npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

local-llm

v1.0.3

Published

Run LLMs locally in Node.js with an OpenAI-compatible API

Downloads

422

Readme

local-llm

Run LLMs locally in Node.js with an OpenAI-compatible API. No cloud, no API keys, no data leaves your machine.

npm install local-llm
import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({
  model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
});

const response = await ai.chat.completions.create({
  messages: [{ role: 'user', content: 'What is the capital of France?' }],
  max_tokens: 128,
});

console.log(response.choices[0].message.content);

Features

  • OpenAI-compatible API - Same chat.completions.create() interface you already know
  • Vision / Multimodal - Send images alongside text using the GPT-4V content format
  • Vercel AI SDK - Drop-in provider for generateText() and streamText()
  • Auto model download - Pass a HuggingFace URL or shorthand, models are downloaded and cached automatically
  • GPU auto-detection - Detects Metal (macOS) and CUDA (Linux/Windows) automatically
  • Streaming - Full streaming support via async iterators
  • TypeScript-first - Complete type definitions out of the box
  • No dependencies - Native C++ bindings to llama.cpp, no Python, no external servers

Platform Support

| Platform | GPU | Status | |---|---|---| | macOS Apple Silicon (M1-M4) | Metal | Supported | | macOS Intel | Metal | Supported | | Linux x64 | CPU | Supported | | Windows x64 | CPU | Supported | | Linux ARM64 | CPU | Coming soon | | Linux/Windows CUDA | NVIDIA GPU | Coming soon |

Quick Start

1. Install

npm install local-llm

2. Choose a Model

Any GGUF model from HuggingFace works. Some recommendations:

| Model | Size | Good for | |---|---|---| | TinyLlama 1.1B Q4_K_M | ~636 MB | Testing, development | | Llama 3.1 8B Q4_K_M | ~4.9 GB | General use | | Mistral 7B Q4_K_M | ~4.4 GB | General use | | Phi-3 Mini Q4_K_M | ~2.2 GB | Lightweight, fast |

3. Use

import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({
  model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
});

// Chat completion (same API as OpenAI)
const response = await ai.chat.completions.create({
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain gravity in one sentence.' },
  ],
  max_tokens: 128,
  temperature: 0.7,
});

console.log(response.choices[0].message.content);

// Streaming
const stream = await ai.chat.completions.create({
  messages: [{ role: 'user', content: 'Write a haiku about coding.' }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}

// Clean up
ai.dispose();

Vercel AI SDK

import { generateText } from 'ai';
import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({ model: 'user/repo/model.gguf' });
const { text } = await generateText({ model: ai.languageModel(), prompt: 'Hello!' });
console.log(text);
ai.dispose();

Preloading

Pre-download a model at app startup so users don't wait:

// App startup — download runs in the background, app doesn't block
LocalLLM.preload('user/repo/model.gguf');

// Later, when AI is needed — cached, create() is fast
const ai = await LocalLLM.create({ model: 'user/repo/model.gguf' });

Vision / Multimodal

Send images alongside text using the same OpenAI GPT-4V content format. Requires a vision model and its projector file:

import { LocalLLM } from 'local-llm';

const ai = await LocalLLM.create({
  model: 'Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf',
  projector: 'Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-F16.gguf',
});

const response = await ai.chat.completions.create({
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { type: 'image_url', image_url: { url: 'data:image/png;base64,...' } },
    ],
  }],
  max_tokens: 256,
});

console.log(response.choices[0].message.content);
ai.dispose();

Images can be provided as data: URIs (base64), local file paths, or HTTP URLs. Streaming works too — just add stream: true.

Configuration

const ai = await LocalLLM.create({
  // Model source (required)
  model: 'user/repo/file.gguf',       // HuggingFace shorthand
  // model: 'https://huggingface.co/...', // Full URL
  // model: './models/my-model.gguf',     // Local file path

  // Vision projector (optional — required for vision models)
  // projector: 'user/repo/mmproj-file.gguf',

  // Compute mode (default: 'auto')
  compute: 'auto',    // Auto-detect GPU
  // compute: 'gpu',  // Force GPU (Metal/CUDA)
  // compute: 'cpu',  // Force CPU only
  // compute: 'hybrid', // Split between CPU and GPU

  // Context options
  contextSize: 2048,   // Context window size
  batchSize: 512,      // Batch size for prompt processing
  threads: 4,          // CPU thread count

  // Download options
  cacheDir: '~/.local-llm/models',  // Model cache directory
  onProgress: (pct) => {            // Download progress callback
    console.log(`${pct.toFixed(1)}%`);
  },
});

Generation Options

const response = await ai.chat.completions.create({
  messages: [...],
  max_tokens: 256,       // Maximum tokens to generate
  temperature: 0.7,      // Randomness (0.0 = deterministic, 2.0 = very random)
  top_p: 0.9,            // Nucleus sampling
  top_k: 40,             // Top-k sampling
  frequency_penalty: 1.1, // Repetition penalty
  seed: 42,              // Reproducible output
  stream: false,         // Set to true for streaming
});

Peer Dependencies

The Vercel AI SDK integration is optional. Install ai if you want to use generateText() / streamText():

npm install ai

API Reference

See the full API documentation.

Advanced Usage

For lower-level control, you can use the engine classes directly:

import { Model, InferenceContext } from 'local-llm';

const model = new Model('./model.gguf', { compute: 'gpu' });
const ctx = model.createContext({ contextSize: 4096 });

// Tokenize
const tokens = model.tokenize('Hello world');
const text = model.detokenize(tokens);

// Chat template
const prompt = model.applyChatTemplate([
  { role: 'user', content: 'Hello' },
], true);

// Generate
const result = await ctx.generate(prompt, { maxTokens: 128 });

// Stream
for await (const token of ctx.stream(prompt, { maxTokens: 128 })) {
  process.stdout.write(token);
}

ctx.dispose();
model.dispose();

Model Manager

Download and cache models programmatically:

import { ModelManager } from 'local-llm';

const manager = new ModelManager();

// Download with progress
const path = await manager.downloadModel(
  'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
  {
    onProgress: (downloaded, total, pct) => {
      console.log(`${pct.toFixed(1)}%`);
    },
  },
);

// List cached models
const models = await manager.listModels();

// Remove a cached model
await manager.removeModel('https://huggingface.co/...');

License

MIT - See LICENSE for details.

Made by Hilum Labs.