@localmode/wllama

v2.0.0

Published

a month ago

wllama provider for @localmode - GGUF model inference via llama.cpp WASM with metadata inspection and browser compatibility checking

Downloads

0High
0Medium
0Low

localmode

wllama gguf llama.cpp wasm llm local-first privacy ai language-model text-generation offline

@localmode/wllama

wllama provider for LocalMode -- run any GGUF model in the browser via llama.cpp compiled to WebAssembly.

Features

Run any of the 135,000+ GGUF models from HuggingFace
Works in all modern browsers (no WebGPU required, only WASM)
GGUF metadata inspection via HTTP Range requests (~4KB download)
Browser compatibility checking before downloading multi-GB files
Auto-detects CORS isolation for multi-threaded inference
Streaming text generation
Full AbortSignal cancellation support

Installation

pnpm install @localmode/wllama @localmode/core

Quick Start

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

const { text } = await generateText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

console.log(text);

Streaming

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

const result = await streamText({ model, prompt: 'Write a haiku.' });

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

GGUF Metadata Inspection

Inspect any GGUF model before downloading:

import { parseGGUFMetadata } from '@localmode/wllama';

const metadata = await parseGGUFMetadata(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(metadata.architecture);   // 'llama'
console.log(metadata.quantization);   // 'Q4_K_M'
console.log(metadata.contextLength);  // 131072
console.log(metadata.parameterCount); // ~1.24B

Browser Compatibility Check

Check if a model can run on the current device:

import { checkGGUFBrowserCompatFromURL } from '@localmode/wllama';

const result = await checkGGUFBrowserCompatFromURL(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

if (result.canRun) {
  console.log('Ready to run!', result.estimatedSpeed);
} else {
  console.log('Warnings:', result.warnings);
  console.log('Suggestions:', result.recommendations);
}

CORS Multi-Threading

For 2-4x faster inference, add these HTTP headers to your server:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these headers, wllama falls back to single-threaded mode automatically.

Documentation

Full documentation at localmode.dev/docs/wllama.

Acknowledgments

This package is built on wllama by ngxson and llama.cpp by Georgi Gerganov — GGUF model inference via llama.cpp compiled to WebAssembly. GGUF metadata parsing uses @huggingface/gguf by HuggingFace.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@localmode/wllama

Features

Installation

Quick Start

Streaming

GGUF Metadata Inspection

Browser Compatibility Check

CORS Multi-Threading

Documentation

Acknowledgments

License