llmizeoff
v0.4.5
Published
llmizeOFF — Self-hosted LLM runtime for VPS, cPanel, Android, and local tools. OpenAI-compatible. No cloud, no subscriptions. Live demo: zulqurnainj.com/chat
Maintainers
Readme
llmizeOFF
Formerly offllama. Now packaged as
llmizeoff— a smarter, more production-ready self-hosted LLM runtime and toolkit. Migrating?npm install llmizeoffand change imports fromoffllama→llmizeoff. The oldOffLlamaClient/OffLlamaErrornames still work as aliases.
Try it live → zulqurnainj.com/chat · Nayab is the hosted demo of llmizeOFF
Run LLM inference on any host — cPanel shared hosting, VPS, Raspberry Pi, Android — with zero cloud dependencies, no subscriptions, and no external lock-in.
Self-hosted · Offline-first · VPS & cPanel ready · Android compatible · No subscriptions
What is llmizeOFF?
llmizeOFF (npm: llmizeoff) is an open-source LLM runtime designed to run where cloud AI cannot:
- $5/month VPS — works on the smallest DigitalOcean or Hetzner droplets
- cPanel shared hosting — deploys as a Node.js app without root access
- Android apps — native JNI/NDK module for fully offline on-device inference
- Local machines — zero-config CLI for developer tools and scripts
- Web apps — universal HTTP client for browser, React Native, and Node.js
No GPU required. No monthly API bills. No data sent to third parties.
Live demo — Nayab
Nayab is the hosted demo of llmizeOFF. It runs Qwen 2.5-1.5B on a standard VPS with real token streaming. Try it free to see what llmizeOFF can do in a production environment.
What's inside
| Export | Environment | Description |
|--------|-------------|-------------|
| llmizeoff (default) | Node.js / cPanel | Embedded inference via node-llama-cpp + OpenAI-compatible HTTP server |
| llmizeoff/client | Browser, RN, Node.js, Deno, Bun | Zero-dependency HTTP client for any llmizeOFF server |
| llmizeoff/react-native | React Native (iOS + Android) | Offline on-device inference via llama.rn OR HTTP fallback |
| llmizeoff/nano | Any | Zero-model regex extraction + template message builders |
| android/ | Kotlin / Android native | Full JNI/NDK library — 100% offline, no server needed |
Quick start (Node.js / VPS / cPanel)
npm install llmizeoff
# Download recommended model (Qwen 2.5-1.5B ~1.1GB — best quality/speed on CPU)
npx llmizeoff download
# Start OpenAI-compatible server with real token streaming
npx llmizeoff serve --port 8080 --api-key my-secretCall it from anywhere:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer my-secret" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Recommended models for self-hosting
Tested on a 6-core AMD EPYC VPS (12GB RAM, no GPU):
| Model | Size | First token | Tokens/sec | Verdict | |-------|------|-------------|------------|---------| | Qwen 2.5-1.5B Q4_K_M | 1.1 GB | 3-5s | 4-6 | Best for VPS | | Qwen 2.5-0.5B Q4_K_M | 469 MB | 2-3s | 8-12 | Fastest, limited quality | | Qwen 2.5-3B Q4_K_M | 2.0 GB | 6-10s | 2-3 | Better quality, slower | | Phi-3.5-mini 3.8B | 2.3 GB | 15-20s | 1-2 | Too slow for interactive chat |
Qwen 2.5-1.5B is the sweet spot: smart enough for real tasks, fast enough for streaming chat.
For sub-1-second responses, pair llmizeOFF with Groq (free tier, 800+ tok/s).
Embed in Next.js / Express
import { LlamaEngine } from "llmizeoff";
const llama = new LlamaEngine({ contextSize: 2048 });
await llama.load(); // auto-downloads Qwen 2.5-1.5B on first run
const reply = await llama.chat([
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" }
]);
console.log(reply); // "The capital of France is Paris."Android — fully offline on-device
// build.gradle
dependencies {
implementation("com.github.Zulqurnain:llmizeoff-android:0.3.1")
}val engine = LlmizeOffEngine(context, modelPath = "models/qwen2.5-0.5b-q4_k_m.gguf")
engine.load()
val reply = engine.chat("Explain recursion in one sentence.")Universal HTTP client
Works in browser, React Native, Node.js, Deno, and Bun — zero dependencies:
import { LlmizeOffClient } from "llmizeoff/client";
const client = new LlmizeOffClient({
baseUrl: "https://your-server.example.com:8080",
apiKey: "my-secret",
});
// Streaming
for await (const token of client.streamChat([
{ role: "user", content: "Write a haiku about self-hosting" }
])) {
process.stdout.write(token);
}VPS deployment (PM2 + nginx)
# 1. Install
npm install llmizeoff pm2 -g
# 2. Download model
npx llmizeoff download --model qwen2.5-1.5b
# 3. Start with PM2
pm2 start node_modules/llmizeoff/dist/server.js \
--name llmizeoff \
--env PORT=8080,HOSTNAME=127.0.0.1,OFFLLAMA_API_KEY=your-key
pm2 save && pm2 startupAdd nginx reverse proxy to expose on your domain. See the Nayab source for a complete production example.
Why llmizeOFF?
| Feature | llmizeOFF | Cloud AI APIs | |---------|-----------|---------------| | Monthly cost | $0 (your VPS) | $20-100+/month | | Data privacy | 100% local | Sent to cloud | | Works offline | Yes | No | | VPS / cPanel | Yes | No | | Android (offline) | Yes | No | | Vendor lock-in | None | Yes |
Roadmap — llmizeOFF Pro (coming soon)
The core runtime stays open-source and free forever. The upcoming Pro edition adds:
- Visual dashboard for model management
- One-click model download and switching
- Multi-user support with rate limiting
- Android SDK (AAR package)
- Priority support and SLAs
Support development and shape the roadmap:
License
MIT © Zulqurnain Haider
