llmizeoff

v0.4.5

Published

17 days ago

llmizeOFF — Self-hosted LLM runtime for VPS, cPanel, Android, and local tools. OpenAI-compatible. No cloud, no subscriptions. Live demo: zulqurnainj.com/chat

llmizeOFF

Formerly offllama. Now packaged as llmizeoff — a smarter, more production-ready self-hosted LLM runtime and toolkit. Migrating? npm install llmizeoff and change imports from offllama → llmizeoff. The old OffLlamaClient / OffLlamaError names still work as aliases.

Try it live → zulqurnainj.com/chat · Nayab is the hosted demo of llmizeOFF

Run LLM inference on any host — cPanel shared hosting, VPS, Raspberry Pi, Android — with zero cloud dependencies, no subscriptions, and no external lock-in.

Self-hosted · Offline-first · VPS & cPanel ready · Android compatible · No subscriptions

What is llmizeOFF?

llmizeOFF (npm: llmizeoff) is an open-source LLM runtime designed to run where cloud AI cannot:

$5/month VPS — works on the smallest DigitalOcean or Hetzner droplets
cPanel shared hosting — deploys as a Node.js app without root access
Android apps — native JNI/NDK module for fully offline on-device inference
Local machines — zero-config CLI for developer tools and scripts
Web apps — universal HTTP client for browser, React Native, and Node.js

No GPU required. No monthly API bills. No data sent to third parties.

Live demo — Nayab

Nayab is the hosted demo of llmizeOFF. It runs Qwen 2.5-1.5B on a standard VPS with real token streaming. Try it free to see what llmizeOFF can do in a production environment.

What's inside

| Export | Environment | Description | |--------|-------------|-------------| | llmizeoff (default) | Node.js / cPanel | Embedded inference via node-llama-cpp + OpenAI-compatible HTTP server | | llmizeoff/client | Browser, RN, Node.js, Deno, Bun | Zero-dependency HTTP client for any llmizeOFF server | | llmizeoff/react-native | React Native (iOS + Android) | Offline on-device inference via llama.rn OR HTTP fallback | | llmizeoff/nano | Any | Zero-model regex extraction + template message builders | | android/ | Kotlin / Android native | Full JNI/NDK library — 100% offline, no server needed |

Quick start (Node.js / VPS / cPanel)

npm install llmizeoff

# Download recommended model (Qwen 2.5-1.5B ~1.1GB — best quality/speed on CPU)
npx llmizeoff download

# Start OpenAI-compatible server with real token streaming
npx llmizeoff serve --port 8080 --api-key my-secret

Call it from anywhere:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Recommended models for self-hosting

Tested on a 6-core AMD EPYC VPS (12GB RAM, no GPU):

| Model | Size | First token | Tokens/sec | Verdict | |-------|------|-------------|------------|---------| | Qwen 2.5-1.5B Q4_K_M | 1.1 GB | 3-5s | 4-6 | Best for VPS | | Qwen 2.5-0.5B Q4_K_M | 469 MB | 2-3s | 8-12 | Fastest, limited quality | | Qwen 2.5-3B Q4_K_M | 2.0 GB | 6-10s | 2-3 | Better quality, slower | | Phi-3.5-mini 3.8B | 2.3 GB | 15-20s | 1-2 | Too slow for interactive chat |

Qwen 2.5-1.5B is the sweet spot: smart enough for real tasks, fast enough for streaming chat.

For sub-1-second responses, pair llmizeOFF with Groq (free tier, 800+ tok/s).

Embed in Next.js / Express

import { LlamaEngine } from "llmizeoff";

const llama = new LlamaEngine({ contextSize: 2048 });
await llama.load(); // auto-downloads Qwen 2.5-1.5B on first run

const reply = await llama.chat([
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "What is the capital of France?" }
]);

console.log(reply); // "The capital of France is Paris."

Android — fully offline on-device

// build.gradle
dependencies {
    implementation("com.github.Zulqurnain:llmizeoff-android:0.3.1")
}

val engine = LlmizeOffEngine(context, modelPath = "models/qwen2.5-0.5b-q4_k_m.gguf")
engine.load()
val reply = engine.chat("Explain recursion in one sentence.")

Universal HTTP client

Works in browser, React Native, Node.js, Deno, and Bun — zero dependencies:

import { LlmizeOffClient } from "llmizeoff/client";

const client = new LlmizeOffClient({
  baseUrl: "https://your-server.example.com:8080",
  apiKey: "my-secret",
});

// Streaming
for await (const token of client.streamChat([
  { role: "user", content: "Write a haiku about self-hosting" }
])) {
  process.stdout.write(token);
}

VPS deployment (PM2 + nginx)

# 1. Install
npm install llmizeoff pm2 -g

# 2. Download model
npx llmizeoff download --model qwen2.5-1.5b

# 3. Start with PM2
pm2 start node_modules/llmizeoff/dist/server.js \
  --name llmizeoff \
  --env PORT=8080,HOSTNAME=127.0.0.1,OFFLLAMA_API_KEY=your-key

pm2 save && pm2 startup

Add nginx reverse proxy to expose on your domain. See the Nayab source for a complete production example.

Why llmizeOFF?

| Feature | llmizeOFF | Cloud AI APIs | |---------|-----------|---------------| | Monthly cost | $0 (your VPS) | $20-100+/month | | Data privacy | 100% local | Sent to cloud | | Works offline | Yes | No | | VPS / cPanel | Yes | No | | Android (offline) | Yes | No | | Vendor lock-in | None | Yes |

Roadmap — llmizeOFF Pro (coming soon)

The core runtime stays open-source and free forever. The upcoming Pro edition adds:

Visual dashboard for model management
One-click model download and switching
Multi-user support with rate limiting
Android SDK (AAR package)
Priority support and SLAs

Support development and shape the roadmap:

☕ Ko-fi: @zulqurnainjj

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme