local-llm
v1.0.3
Published
Run LLMs locally in Node.js with an OpenAI-compatible API
Downloads
422
Maintainers
Readme
local-llm
Run LLMs locally in Node.js with an OpenAI-compatible API. No cloud, no API keys, no data leaves your machine.
npm install local-llmimport { LocalLLM } from 'local-llm';
const ai = await LocalLLM.create({
model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
});
const response = await ai.chat.completions.create({
messages: [{ role: 'user', content: 'What is the capital of France?' }],
max_tokens: 128,
});
console.log(response.choices[0].message.content);Features
- OpenAI-compatible API - Same
chat.completions.create()interface you already know - Vision / Multimodal - Send images alongside text using the GPT-4V content format
- Vercel AI SDK - Drop-in provider for
generateText()andstreamText() - Auto model download - Pass a HuggingFace URL or shorthand, models are downloaded and cached automatically
- GPU auto-detection - Detects Metal (macOS) and CUDA (Linux/Windows) automatically
- Streaming - Full streaming support via async iterators
- TypeScript-first - Complete type definitions out of the box
- No dependencies - Native C++ bindings to llama.cpp, no Python, no external servers
Platform Support
| Platform | GPU | Status | |---|---|---| | macOS Apple Silicon (M1-M4) | Metal | Supported | | macOS Intel | Metal | Supported | | Linux x64 | CPU | Supported | | Windows x64 | CPU | Supported | | Linux ARM64 | CPU | Coming soon | | Linux/Windows CUDA | NVIDIA GPU | Coming soon |
Quick Start
1. Install
npm install local-llm2. Choose a Model
Any GGUF model from HuggingFace works. Some recommendations:
| Model | Size | Good for | |---|---|---| | TinyLlama 1.1B Q4_K_M | ~636 MB | Testing, development | | Llama 3.1 8B Q4_K_M | ~4.9 GB | General use | | Mistral 7B Q4_K_M | ~4.4 GB | General use | | Phi-3 Mini Q4_K_M | ~2.2 GB | Lightweight, fast |
3. Use
import { LocalLLM } from 'local-llm';
const ai = await LocalLLM.create({
model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
});
// Chat completion (same API as OpenAI)
const response = await ai.chat.completions.create({
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain gravity in one sentence.' },
],
max_tokens: 128,
temperature: 0.7,
});
console.log(response.choices[0].message.content);
// Streaming
const stream = await ai.chat.completions.create({
messages: [{ role: 'user', content: 'Write a haiku about coding.' }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
// Clean up
ai.dispose();Vercel AI SDK
import { generateText } from 'ai';
import { LocalLLM } from 'local-llm';
const ai = await LocalLLM.create({ model: 'user/repo/model.gguf' });
const { text } = await generateText({ model: ai.languageModel(), prompt: 'Hello!' });
console.log(text);
ai.dispose();Preloading
Pre-download a model at app startup so users don't wait:
// App startup — download runs in the background, app doesn't block
LocalLLM.preload('user/repo/model.gguf');
// Later, when AI is needed — cached, create() is fast
const ai = await LocalLLM.create({ model: 'user/repo/model.gguf' });Vision / Multimodal
Send images alongside text using the same OpenAI GPT-4V content format. Requires a vision model and its projector file:
import { LocalLLM } from 'local-llm';
const ai = await LocalLLM.create({
model: 'Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf',
projector: 'Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-F16.gguf',
});
const response = await ai.chat.completions.create({
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: 'data:image/png;base64,...' } },
],
}],
max_tokens: 256,
});
console.log(response.choices[0].message.content);
ai.dispose();Images can be provided as data: URIs (base64), local file paths, or HTTP URLs. Streaming works too — just add stream: true.
Configuration
const ai = await LocalLLM.create({
// Model source (required)
model: 'user/repo/file.gguf', // HuggingFace shorthand
// model: 'https://huggingface.co/...', // Full URL
// model: './models/my-model.gguf', // Local file path
// Vision projector (optional — required for vision models)
// projector: 'user/repo/mmproj-file.gguf',
// Compute mode (default: 'auto')
compute: 'auto', // Auto-detect GPU
// compute: 'gpu', // Force GPU (Metal/CUDA)
// compute: 'cpu', // Force CPU only
// compute: 'hybrid', // Split between CPU and GPU
// Context options
contextSize: 2048, // Context window size
batchSize: 512, // Batch size for prompt processing
threads: 4, // CPU thread count
// Download options
cacheDir: '~/.local-llm/models', // Model cache directory
onProgress: (pct) => { // Download progress callback
console.log(`${pct.toFixed(1)}%`);
},
});Generation Options
const response = await ai.chat.completions.create({
messages: [...],
max_tokens: 256, // Maximum tokens to generate
temperature: 0.7, // Randomness (0.0 = deterministic, 2.0 = very random)
top_p: 0.9, // Nucleus sampling
top_k: 40, // Top-k sampling
frequency_penalty: 1.1, // Repetition penalty
seed: 42, // Reproducible output
stream: false, // Set to true for streaming
});Peer Dependencies
The Vercel AI SDK integration is optional. Install ai if you want to use generateText() / streamText():
npm install aiAPI Reference
See the full API documentation.
Advanced Usage
For lower-level control, you can use the engine classes directly:
import { Model, InferenceContext } from 'local-llm';
const model = new Model('./model.gguf', { compute: 'gpu' });
const ctx = model.createContext({ contextSize: 4096 });
// Tokenize
const tokens = model.tokenize('Hello world');
const text = model.detokenize(tokens);
// Chat template
const prompt = model.applyChatTemplate([
{ role: 'user', content: 'Hello' },
], true);
// Generate
const result = await ctx.generate(prompt, { maxTokens: 128 });
// Stream
for await (const token of ctx.stream(prompt, { maxTokens: 128 })) {
process.stdout.write(token);
}
ctx.dispose();
model.dispose();Model Manager
Download and cache models programmatically:
import { ModelManager } from 'local-llm';
const manager = new ModelManager();
// Download with progress
const path = await manager.downloadModel(
'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
{
onProgress: (downloaded, total, pct) => {
console.log(`${pct.toFixed(1)}%`);
},
},
);
// List cached models
const models = await manager.listModels();
// Remove a cached model
await manager.removeModel('https://huggingface.co/...');License
MIT - See LICENSE for details.
Made by Hilum Labs.
