local-llm

v1.0.3

Published

11 days ago

Run LLMs locally in Node.js with an OpenAI-compatible API

Downloads

422

0High
0Medium
0Low

hilum

llm ai local inference llama gguf openai chat machine-learning gpu metal on-device

local-llm

Run LLMs locally in Node.js with an OpenAI-compatible API. No cloud, no API keys, no data leaves your machine.

npm install local-llm

import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({
  model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
});

const response = await ai.chat.completions.create({
  messages: [{ role: 'user', content: 'What is the capital of France?' }],
  max_tokens: 128,
});

console.log(response.choices[0].message.content);

Features

OpenAI-compatible API - Same chat.completions.create() interface you already know
Vision / Multimodal - Send images alongside text using the GPT-4V content format
Vercel AI SDK - Drop-in provider for generateText() and streamText()
Auto model download - Pass a HuggingFace URL or shorthand, models are downloaded and cached automatically
GPU auto-detection - Detects Metal (macOS) and CUDA (Linux/Windows) automatically
Streaming - Full streaming support via async iterators
TypeScript-first - Complete type definitions out of the box
No dependencies - Native C++ bindings to llama.cpp, no Python, no external servers

Platform Support

| Platform | GPU | Status | |---|---|---| | macOS Apple Silicon (M1-M4) | Metal | Supported | | macOS Intel | Metal | Supported | | Linux x64 | CPU | Supported | | Windows x64 | CPU | Supported | | Linux ARM64 | CPU | Coming soon | | Linux/Windows CUDA | NVIDIA GPU | Coming soon |

Quick Start

1. Install

npm install local-llm

2. Choose a Model

Any GGUF model from HuggingFace works. Some recommendations:

| Model | Size | Good for | |---|---|---| | TinyLlama 1.1B Q4_K_M | ~636 MB | Testing, development | | Llama 3.1 8B Q4_K_M | ~4.9 GB | General use | | Mistral 7B Q4_K_M | ~4.4 GB | General use | | Phi-3 Mini Q4_K_M | ~2.2 GB | Lightweight, fast |

3. Use

import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({
  model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
});

// Chat completion (same API as OpenAI)
const response = await ai.chat.completions.create({
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain gravity in one sentence.' },
  ],
  max_tokens: 128,
  temperature: 0.7,
});

console.log(response.choices[0].message.content);

// Streaming
const stream = await ai.chat.completions.create({
  messages: [{ role: 'user', content: 'Write a haiku about coding.' }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}

// Clean up
ai.dispose();

Vercel AI SDK

import { generateText } from 'ai';
import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({ model: 'user/repo/model.gguf' });
const { text } = await generateText({ model: ai.languageModel(), prompt: 'Hello!' });
console.log(text);
ai.dispose();

Preloading

Pre-download a model at app startup so users don't wait:

// App startup — download runs in the background, app doesn't block
LocalLLM.preload('user/repo/model.gguf');

// Later, when AI is needed — cached, create() is fast
const ai = await LocalLLM.create({ model: 'user/repo/model.gguf' });

Vision / Multimodal

Send images alongside text using the same OpenAI GPT-4V content format. Requires a vision model and its projector file:

import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({
  model: 'Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf',
  projector: 'Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-F16.gguf',
});

const response = await ai.chat.completions.create({
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { type: 'image_url', image_url: { url: 'data:image/png;base64,...' } },
    ],
  }],
  max_tokens: 256,
});

console.log(response.choices[0].message.content);
ai.dispose();

Images can be provided as data: URIs (base64), local file paths, or HTTP URLs. Streaming works too — just add stream: true.

Configuration

const ai = await LocalLLM.create({
  // Model source (required)
  model: 'user/repo/file.gguf',       // HuggingFace shorthand
  // model: 'https://huggingface.co/...', // Full URL
  // model: './models/my-model.gguf',     // Local file path

  // Vision projector (optional — required for vision models)
  // projector: 'user/repo/mmproj-file.gguf',

  // Compute mode (default: 'auto')
  compute: 'auto',    // Auto-detect GPU
  // compute: 'gpu',  // Force GPU (Metal/CUDA)
  // compute: 'cpu',  // Force CPU only
  // compute: 'hybrid', // Split between CPU and GPU

  // Context options
  contextSize: 2048,   // Context window size
  batchSize: 512,      // Batch size for prompt processing
  threads: 4,          // CPU thread count

  // Download options
  cacheDir: '~/.local-llm/models',  // Model cache directory
  onProgress: (pct) => {            // Download progress callback
    console.log(`${pct.toFixed(1)}%`);
  },
});

Generation Options

const response = await ai.chat.completions.create({
  messages: [...],
  max_tokens: 256,       // Maximum tokens to generate
  temperature: 0.7,      // Randomness (0.0 = deterministic, 2.0 = very random)
  top_p: 0.9,            // Nucleus sampling
  top_k: 40,             // Top-k sampling
  frequency_penalty: 1.1, // Repetition penalty
  seed: 42,              // Reproducible output
  stream: false,         // Set to true for streaming
});

Peer Dependencies

The Vercel AI SDK integration is optional. Install ai if you want to use generateText() / streamText():

npm install ai

API Reference

See the full API documentation.

Advanced Usage

For lower-level control, you can use the engine classes directly:

import { Model, InferenceContext } from 'local-llm';

const model = new Model('./model.gguf', { compute: 'gpu' });
const ctx = model.createContext({ contextSize: 4096 });

// Tokenize
const tokens = model.tokenize('Hello world');
const text = model.detokenize(tokens);

// Chat template
const prompt = model.applyChatTemplate([
  { role: 'user', content: 'Hello' },
], true);

// Generate
const result = await ctx.generate(prompt, { maxTokens: 128 });

// Stream
for await (const token of ctx.stream(prompt, { maxTokens: 128 })) {
  process.stdout.write(token);
}

ctx.dispose();
model.dispose();

Model Manager

Download and cache models programmatically:

import { ModelManager } from 'local-llm';

const manager = new ModelManager();

// Download with progress
const path = await manager.downloadModel(
  'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
  {
    onProgress: (downloaded, total, pct) => {
      console.log(`${pct.toFixed(1)}%`);
    },
  },
);

// List cached models
const models = await manager.listModels();

// Remove a cached model
await manager.removeModel('https://huggingface.co/...');

License

MIT - See LICENSE for details.

Made by Hilum Labs.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

local-llm

Features

Platform Support

Quick Start

1. Install

2. Choose a Model

3. Use

Vercel AI SDK

Preloading

Vision / Multimodal

Configuration

Generation Options

Peer Dependencies

API Reference

Advanced Usage

Model Manager

License