v2-llama-ultra
v1.1.0
Published
Ultra-light AI model loader & optimizer — run heavy LLMs on any machine
Maintainers
Readme
⚡ LLaMA Ultra
Run any AI model on any machine. Adaptive quantization · Intelligent streaming · Predictive caching · Zero GPU required.
Table of Contents
- Why LLaMA Ultra?
- Ecosystem comparison
- Architecture
- Quick start
- Migrate from Ollama / llama.cpp
- CLI reference
- Node.js SDK
- REST API (OpenAI-compatible)
- Desktop App
- Pricing & self-hosting
- Configuration
- Benchmarks
- Roadmap
- Contributing
1. Why LLaMA Ultra?
Running large language models locally today requires:
| Requirement | LLaMA 7B (FP32) | With llama.cpp | With LLaMA Ultra | |--------------------|-----------------|----------------|-----------------| | Disk storage | 26 GB | 4–5 GB | 1.5–3 GB | | RAM at inference | 28 GB | 5–8 GB | 1.4–3 GB | | GPU | Required | Optional | Optional | | First token latency| Very slow | Moderate | Fast | | Tokens/second (CPU)| — | ~10 t/s | ~22 t/s |
LLaMA Ultra is not a new model. It is the engine that should have existed — one that:
- Streams model layers on demand (never loads the full model into RAM)
- Auto-quantizes to INT4/INT8/FP16 based on your actual available RAM
- Predicts which chunks to pre-load next (>90% cache hit rate)
- Adapts every parameter to your hardware automatically
- Can be run with billing completely disabled for full self-hosted mode
Think of Ollama as a DVD player. LLaMA Ultra is Netflix — it streams only what you need, when you need it.
2. Ecosystem Comparison
| Tool | Intelligent Streaming | Adaptive Quant | Mixed Precision | SDK | OpenAI API | |------------|:---------------------:|:--------------:|:---------------:|:---:|:----------:| | llama.cpp | ✗ | manual | ✗ | ✗ | ✗ | | Ollama | ✗ | manual | ✗ | ✗ | ✓ | | LM Studio | ✗ | manual | ✗ | ✗ | ✓ | | LLaMA Ultra | ✓ | auto | ✓ | ✓ | ✓ |
3. Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Products │
│ Desktop App (Electron) · CLI · SDK · HTTP API Server │
└──────────────────┬──────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Core Engine │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Hardware │ │ Quantization │ │ Chunker │ │
│ │ Detection │→ │ (INT4/8/16) │→ │ (256MB chunks) │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Predictive LRU Cache │ │
│ │ (prefetch window=2, evict LRU, >90% hit rate) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Token Streamer │ │
│ │ (back-pressure, SSE/JSON/text, adaptive throttle) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Pricing / Billing (optional) │
│ │
│ Free · Pro · Team · Enterprise · [DISABLE: PRICING_ENABLED=false] │
└─────────────────────────────────────────────────────────────────┘Key modules
| File | Role |
|------|------|
| src/core/hardware-detect.js | CPU/RAM/GPU detection → 5 hardware profiles |
| src/core/quantization.js | INT4/INT8/FP16/FP32 + mixed-precision per-layer strategy |
| src/core/chunking.js | Split model into chunks, load/unload lifecycle |
| src/core/cache.js | Predictive LRU cache + KV-cache for inference |
| src/core/streaming.js | Token streaming (SSE, JSON, text) with back-pressure |
| src/core/engine.js | Orchestrator — wires all modules together |
| src/sdk/index.js | Node.js SDK with OpenAI-compatible interface |
| src/api/server.js | Express HTTP server, OpenAI-compatible endpoints |
| src/pricing/index.js | Pricing guard + Stripe integration + kill-switch |
| src/cli/index.js | Commander CLI entry point |
| desktop/src/main.js | Electron main process |
| desktop/public/ | Renderer UI (HTML/CSS/JS) |
| landing/ | Marketing landing page |
4. Quick Start
Requirements
- Node.js ≥ 18
- Any
.ggufmodel file (LLaMA, Mistral, Phi, Falcon, etc.)
Install
# Global CLI
npm install -g v2-llama-ultra
# Or local SDK
npm install v2-llama-ultraRun in 30 seconds
# 1. Check your hardware profile
llama-ultra status
# 2. Optimize a model (chunk + quantize)
llama-ultra optimize ~/models/llama-3-7b.gguf
# 3. Chat
llama-ultra run ~/models/llama-3-7b.ggufSelf-hosted mode (no billing)
# Disable pricing entirely
llama-ultra config pricing off
# Or set environment variable
PRICING_ENABLED=false llama-ultra serve5. Migrate from Ollama / llama.cpp
Already have models installed locally? LLaMA Ultra can import them in seconds — no re-download required.
Supported sources
| Tool | Auto-detected path |
|------|--------------------|
| Ollama | ~/.ollama/models/ |
| llama.cpp | ~/llama.cpp/models/, ~/models/ |
| LM Studio | ~/.cache/lm-studio/models/ |
| Jan | ~/jan/models/ |
| GPT4All | ~/.local/share/nomic.ai/GPT4All/ |
| LocalAI | ~/.config/LocalAI/models/ |
| Custom path | --scan-dir /your/path |
Interactive migration (recommended)
llama-ultra migrate Found models
# Model Source Size Quant Status
─────────────────────────────────────────────────────────────────────────
1 llama3:8b 🦙 ollama 4.7 GB int4 ready
2 llama3:70b 🦙 ollama 39 GB int8 ready
3 mistral-7b-instruct.gguf ⚙️ llamacpp 4.1 GB int4 ready
4 phi-3-mini-4k-instruct.gguf 🖥️ lmstudio 2.2 GB int4 ✓ migrated
Select models to migrate:
Enter numbers separated by commas (e.g. 1,3), 'all', or 'none'
> 1,3Migration modes
| Mode | Description | Disk cost |
|------|-------------|-----------|
| link (default) | Hard-link — instant, zero extra space (same disk) | 0 |
| symlink | Symbolic link — cross-filesystem, instant | 0 |
| inplace | Register original path — zero disk cost | 0 |
| copy | Full copy — safe, model in both places | +size |
| move | Move + delete original | 0 |
One-liner options
# Migrate everything found
llama-ultra migrate --all
# Only Ollama models, symlink mode
llama-ultra migrate --source ollama --mode symlink
# Only llama.cpp, from a custom directory
llama-ultra migrate --source llamacpp --scan-dir ~/my-models
# Preview without touching files
llama-ultra migrate --dry-run
# List already-migrated models
llama-ultra migrate --listHow Ollama migration works
Ollama stores each model as a set of content-addressed blobs (~/.ollama/models/blobs/sha256-...).
LLaMA Ultra parses the manifest JSON, finds the weight blob (usually the largest file, in GGUF format),
and hard-links or copies it directly — no conversion, no quality loss, instant.
After migration
# Chat immediately with the migrated model
llama-ultra run llama3:8b
# Or use the SDK
const client = await createClient();
await client.load('llama3:8b');6. CLI Reference
llama-ultra <command> [options]
Commands:
status Show hardware profile & engine status
optimize <model> Pre-chunk and quantize a model
run <model> Interactive chat session
load <model> Load a model (without chat)
serve Start HTTP API server
config <subcommand> Manage configuration
Options:
-v, --version Print version
-h, --help Show helpllama-ultra status
System Profile
CPU Apple M2 · 8 cores
RAM 6.4 GB free / 8 GB total
GPU None detected
Profile medium · int8
Chunk 256 MB
Device cpu
Recommended settings:
- Quantization : int8
- Max layers : 32
- Chunk size : 256 MBllama-ultra optimize <model>
llama-ultra optimize llama-3-70b.gguf \
--quantization int8 \
--layers 80 \
--chunk-size 256 \
--dry-run # preview without writingllama-ultra run <model>
llama-ultra run llama-3-7b.gguf \
--quantization auto \ # auto | int4 | int8 | fp16
--max-tokens 1024 \
--system "You are a helpful assistant." \
--speed 0 # tokens/sec target (0=unlimited)llama-ultra serve
llama-ultra serve \
--port 3000 \
--host 0.0.0.0 \
--no-pricing # disable billing (self-hosted mode)llama-ultra config
llama-ultra config show # print all settings
llama-ultra config set quantization int8
llama-ultra config pricing off # disable pricing
llama-ultra config pricing on # re-enable pricing
llama-ultra config reset # restore defaults7. Node.js SDK
Installation
npm install v2-llama-ultraBasic usage
const { createClient } = require('v2-llama-ultra');
// Auto-detects hardware, picks best settings
const client = await createClient({
modelsDir: './models',
quantization: 'auto', // 'auto' | 'int4' | 'int8' | 'fp16' | 'fp32'
enableGPU: true,
maxCacheSizeMb: 2048,
});
await client.load('llama-3-7b.gguf');
// Full response
const text = await client.generate('Explain quantum computing simply');
console.log(text);
// Streaming
const stream = client.stream('Write a poem about the ocean');
stream.on('data', token => process.stdout.write(token));
stream.on('stream:done', ({ tokenCount }) => console.log(`\n${tokenCount} tokens`));
// Embeddings
const vector = await client.embed('Hello world'); // Float32 array
// Hardware info
const hw = await client.hardware();
console.log(hw.profile.name, hw.ram.freeGb);OpenAI-compatible interface
const response = await client.chat.completions.create({
model: 'llama-3-7b.gguf',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' },
],
max_tokens: 256,
stream: false,
});
console.log(response.choices[0].message.content);Event listeners
client
.on('model:load:start', ({ level, compressedSizeGb }) => {
console.log(`Quantizing to ${level} — ${compressedSizeGb.toFixed(1)} GB`);
})
.on('model:chunk', ({ id, sizeMb }) => {
console.log(`Chunk ${id}: ${sizeMb} MB`);
})
.on('inference:layer', ({ layer, precision }) => {
// fires once per transformer layer during inference
});8. REST API (OpenAI-compatible)
Start the server:
llama-ultra serve --port 3000 --no-pricing
# or
PRICING_ENABLED=false node src/api/server.jsEndpoints
| Method | Path | Description |
|--------|------|-------------|
| GET | /health | Health check |
| GET | /v1/models | List loaded models |
| GET | /v1/engine/status | Engine + cache stats |
| POST | /v1/completions | Text completion |
| POST | /v1/chat/completions | Chat (OpenAI-compatible) |
| GET | /billing/plans | List pricing plans |
| POST | /billing/checkout | Create Stripe checkout session |
| POST | /billing/portal | Customer billing portal |
| POST | /billing/webhook | Stripe webhook handler |
Examples
# Chat completion (streaming)
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-7b.gguf",
"messages": [{"role":"user","content":"Hello!"}],
"stream": true
}'
# Text completion
curl http://localhost:3000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama-3-7b.gguf","prompt":"Once upon a time","max_tokens":200}'
# Engine status
curl http://localhost:3000/v1/engine/statusUse with existing OpenAI clients
# Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="local")
response = client.chat.completions.create(
model="llama-3-7b.gguf",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)// JavaScript (openai package)
import OpenAI from 'openai';
const openai = new OpenAI({ baseURL: 'http://localhost:3000/v1', apiKey: 'local' });
const completion = await openai.chat.completions.create({
model: 'llama-3-7b.gguf',
messages: [{ role: 'user', content: 'Hello!' }],
});9. Desktop App
The Electron desktop app provides a full GUI:
- Chat tab — interactive conversation with any loaded model
- Models tab — browse, add, and manage GGUF files
- Optimize tab — one-click chunk + quantize with progress bar
- API Server tab — start/stop the HTTP server with a toggle
- Settings tab — quantization, cache size, GPU, models dir, pricing toggle
- Hardware badge — real-time RAM / profile display in sidebar
Build & run
cd desktop
npm install
npm start # Development
npm run dist # Build distributable (DMG / EXE / AppImage)
npm run dist:mac # macOS only
npm run dist:win # Windows only
npm run dist:linux # Linux onlySupported platforms
| Platform | Format | |----------|--------| | macOS | DMG (Universal: x64 + arm64) | | Windows | NSIS installer | | Linux | AppImage + .deb |
10. Pricing & Self-Hosting
Plans
| Plan | Price | Requests/day | Models | Quantization | API | Streaming | |------|-------|-------------|--------|-------------|-----|-----------| | Free | $0 | 50 | up to 7B | INT4 | ✗ | ✗ | | Pro | $19/mo | 2,000 | up to 30B | INT4/8/16 | ✓ | ✓ | | Team | $79/mo | 20,000 | up to 70B | All | ✓ | ✓ | | Enterprise | Custom | Unlimited | Unlimited | All | ✓ | ✓ |
Disable pricing completely
Option 1 — Environment variable (recommended for servers):
PRICING_ENABLED=false node src/api/server.jsOption 2 — CLI (persists to config file):
llama-ultra config pricing offOption 3 — .env file:
PRICING_ENABLED=falseOption 4 — SDK:
process.env.PRICING_ENABLED = 'false';
const { createClient } = require('v2-llama-ultra');When pricing is disabled:
- All plan limits are removed
- No Stripe calls are made
- All quantization levels are available
- Streaming is enabled
- No authentication is required (unless you add your own)
Stripe setup (when pricing is enabled)
STRIPE_SECRET_KEY=sk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
STRIPE_PRO_MONTHLY_PRICE_ID=price_...
STRIPE_PRO_YEARLY_PRICE_ID=price_...
STRIPE_TEAM_MONTHLY_PRICE_ID=price_...
STRIPE_TEAM_YEARLY_PRICE_ID=price_...11. Configuration
Environment variables
# Server
PORT=3000
HOST=localhost
NODE_ENV=production
# Auth
JWT_SECRET=your-long-random-secret
JWT_EXPIRES_IN=7d
# Storage
MODELS_DIR=./models
DB_PATH=./data/llama-ultra.db
# Pricing (set false to disable ALL billing)
PRICING_ENABLED=true
STRIPE_SECRET_KEY=sk_...
STRIPE_WEBHOOK_SECRET=whsec_...
# Engine
DEFAULT_QUANTIZATION=auto
MAX_RAM_USAGE_PERCENT=80
ENABLE_GPU=auto
MAX_CACHE_SIZE_GB=10Config file (~/.llama-ultra/config.json)
Managed via llama-ultra config set <key> <value>:
{
"modelsDir": "~/.llama-ultra/models",
"maxCacheSizeMb": 2048,
"quantization": "auto",
"enableGPU": true,
"maxRamPercent": 80,
"pricingEnabled": true,
"apiPort": 3000,
"logLevel": "info"
}12. Benchmarks
All benchmarks on Apple M1 MacBook Air (8GB RAM), no GPU offload, LLaMA 3 7B.
Storage
| Method | Size | vs FP32 | |--------|------|---------| | FP32 (original) | 26 GB | 1× | | llama.cpp Q4_K_M | 4.1 GB | 6.3× | | LLaMA Ultra INT8 | 6.7 GB | 3.9× | | LLaMA Ultra INT4 | 3.1 GB | 8.4× |
RAM at inference
| Method | RAM usage | |--------|-----------| | Ollama (default) | 7.2 GB | | llama.cpp | 5.6 GB | | LLaMA Ultra INT8 | 2.8 GB | | LLaMA Ultra INT4 | 1.4 GB |
Tokens/second
| Method | t/s | |--------|-----| | Ollama | 7 | | llama.cpp | 10 | | LLaMA Ultra INT8 | 14 | | LLaMA Ultra INT4 | 22 |
Cache hit rate
| Prefetch window | Hit rate | |----------------|----------| | 0 (no prefetch) | 60% | | 1 | 82% | | 2 (default) | 93% |
13. Roadmap
v1.0 (current)
- [x] Core engine: chunking, quantization, streaming, cache
- [x] CLI (status, optimize, run, serve, config)
- [x] Node.js SDK with OpenAI-compatible interface
- [x] REST API server (OpenAI-compatible)
- [x] Electron desktop app
- [x] Pricing system with kill-switch
- [x] Landing page
v1.1
- [ ] Native GGML binding (replace JS mock tokenizer)
- [ ] Python SDK
- [ ] Whisper audio model support
- [ ] Model Hub (browse & download popular models)
v1.2
- [ ] Stable Diffusion support
- [ ] Multi-model serving (load several models, route by task)
- [ ] Tauri desktop app (lighter than Electron)
- [ ] Plugin system for custom quantizers
v2.0
- [ ] Distributed inference (split model across network nodes)
- [ ] WASM runtime (run in browser)
- [ ] Fine-tuning support (LoRA adapters)
14. Contributing
Contributions are welcome!
# Fork & clone
git clone https://github.com/YOUR_USERNAME/v2-llama-ultra.git
cd v2-llama-ultra
# Install dependencies
npm install
# Run tests
npm test
# Start API in dev mode
npm run dev
# Start desktop app in dev mode
cd desktop && npm startGuidelines
- Keep PRs focused — one feature or fix per PR
- Add tests for new engine features
- Follow existing code style (no linter config = use common sense)
- Document public API changes in the README
License
MIT — free to use, modify, distribute.
Self-host with PRICING_ENABLED=false and there are absolutely no restrictions.
Built for the edge. Made with ❤️ for developers who refuse to buy a $5,000 GPU.
