localm-web
v0.5.0
Published
Browser-only TypeScript SDK for running LLMs and SLMs locally with WebGPU. Ultralytics-style DX, Vite-first.
Maintainers
Readme
localm-web
⚠️ Status: pre-alpha. Public API is being designed and is expected to change. Code in this repo is intentionally minimal until v0.1.
Browser-only TypeScript SDK for running Language Models (LLMs and SLMs) locally in the user's browser, with a developer experience modeled directly on ort-vision-sdk-web.
import { Chat } from "localm-web";
const chat = await Chat.create("phi-3.5-mini-int4");
for await (const token of chat.stream("Explain ONNX in one sentence.")) {
process.stdout.write(token.text);
}That's it. No server, no API key, no roundtrip — the model runs on the user's GPU via WebGPU.
Why does this exist?
The Python ecosystem for local Language Models is saturated: llama-cpp-python, Ollama, vLLM, transformers, text-generation-inference, and dozens more. Picking up another Python wrapper adds nothing.
The browser side is different. The closest equivalents are:
| Project | What it is | Why it's not enough |
| ----------------------------------------------------------------- | ------------------------------ | ----------------------------------------------------------------- |
| WebLLM (MLC) | Best-in-class WebGPU runtime | Engine-centric, low-level API, no opinionated tasks |
| transformers.js | HF pipeline API in the browser | Slower (no WebGPU-first compilation in many paths), broad surface |
| onnxruntime-genai-web | Microsoft's web LM build | Preview, unstable, no high-level tasks |
There is no opinionated, task-oriented, strict-typed, Ultralytics-style SDK that just works in a Vite app. localm-web fills that gap.
The mental model is straightforward: if ort-vision-sdk-web is what Detector / Classifier / Segmenter look like for vision, then localm-web is what Chat / Completion / Embeddings / Reranker look like for language.
Design principles
- Browser-only. No Node target, no server runtime. If your code runs on a backend, this SDK is the wrong tool — use
transformers, vLLM, Ollama, or any of the dozens of mature Python options. - Maximum performance. WebGPU-first via WebLLM (MLC). Web Worker execution by default so the UI thread stays free. WASM-SIMD fallback for non-WebGPU browsers from v0.5.
- Ultralytics-style DX.
await Class.create(model)thenpredict()/send()/embed()/score(). Mirrorsort-vision-sdk-webso a developer using both feels continuity. - ESM only. No CJS, no UMD, no IIFE. The browser is ESM-native, modern bundlers expect ESM, and shipping multiple formats just bloats the package.
- Vite-first. The build is optimized for Vite 5+ consumers. Other bundlers will still work, but Vite is the supported smooth path.
- Not tied to Vercel. No
vercel.json, no Next-specific helpers, no Edge runtime exports. Examples deploy to any static host (Cloudflare Pages, Netlify, GitHub Pages, S3, self-hosted). - Wrap, don't fork. WebLLM stays a peer dependency. We add the API layer, the task abstractions, and the missing pieces (embeddings, reranker, structured output, fallback runtime).
Scope
In scope
- Browser-only execution (WebGPU primary, WASM-SIMD fallback from v0.5).
- High-level tasks:
Chat,Completion,Embeddings,Reranker. - Streaming token output via async generators with
AbortSignalsupport. - Tokenization, chat templates, sampling, KV cache (delegated to the underlying runtime).
- Model caching (Cache API + OPFS) with resume on interrupted downloads.
- Curated registry of supported SLMs: Phi-3.5-mini, Llama-3.2-1B/3B, Qwen2.5-0.5B/1.5B/3B, Gemma-2-2B, SmolLM2.
- Structured output: JSON Schema → constrained decoding.
- Web Worker execution out of the box.
Out of scope
- Server-side execution (Node, Bun, Deno).
- Training, fine-tuning, LoRA loading.
- Multi-modal models at v1.0 (a future composite SDK may combine
ort-vision-sdk-web+localm-web). - A llama.cpp / GGUF backend — community-maintained options exist; that's not our differentiation.
- A pre-built chat UI. This is an SDK, not a chatbot kit.
- Bundling model weights into the package — models are downloaded at runtime.
- Non-ESM module formats.
Architecture
localm-web/
├── src/
│ ├── core/ # backend abstraction + WebLLM / ORT-Web engines
│ ├── tasks/ # Chat, Completion, Embeddings, Reranker
│ ├── io/ # tokenizer + chat-template loaders
│ ├── sampling/ # greedy, top-k, top-p, temperature
│ ├── cache/ # KV cache + model file cache (Cache API / OPFS)
│ ├── streaming/ # async iterator + AbortSignal plumbing
│ ├── structured/ # JSON Schema → grammar / logit-mask
│ ├── presets/ # curated model registry
│ ├── worker/ # Web Worker entrypoint for inference
│ ├── results.ts # typed result classes
│ ├── types.ts # primitive types (Message, ChatRequest, etc.)
│ └── index.ts # public API
├── test/
├── examples/
├── docs/
└── ...A full layer-by-layer breakdown lives in CLAUDE.md.
Tech stack
- Language: TypeScript 5.4+, strict mode, ES2022 target.
- Module format: ESM only.
- Build: Vite 5+ in library mode,
tscfor declarations. - Primary runtime: WebLLM (MLC), Apache 2.0, WebGPU-first.
- Fallback runtime (v0.5+):
onnxruntime-web+@huggingface/transformers. - Tokenizer:
@huggingface/transformerstokenizer module. - Chat templates:
@huggingface/jinja. - Storage: Cache API + OPFS (Origin Private File System).
- Concurrency: Web Worker via
Comlink(or nativeMessagePort). - Tests: Vitest + Playwright (real browser for WebGPU).
- Lint/format: ESLint + Prettier.
Public API (target shape)
import { Chat, Completion, Embeddings, Reranker } from "localm-web";
// Chat — multi-turn conversation with chat template applied
const chat = await Chat.create("phi-3.5-mini-int4");
const reply = await chat.send("Explain ONNX in one sentence.");
console.log(reply.text);
// Streaming
const controller = new AbortController();
for await (const token of chat.stream("Explain ONNX.", { signal: controller.signal })) {
process.stdout.write(token.text);
}
// Completion — raw text-in text-out (no chat template)
const comp = await Completion.create("qwen2.5-0.5b-int4");
const out = await comp.predict("Once upon a time", { maxTokens: 100 });
// Embeddings
const emb = await Embeddings.create("bge-small-en-v1.5");
const vectors = await emb.embed(["hello world", "another sentence"]);
// Reranker
const rerank = await Reranker.create("bge-reranker-base");
const scores = await rerank.score("query", ["doc1", "doc2", "doc3"]);
// Structured output — free-form JSON
const jsonReply = await chat.send("List three pros and cons of WebGPU as JSON.", { json: true });
const data = jsonReply.json<{ pros: string[]; cons: string[] }>();
// Structured output — JSON Schema constrained decoding (xgrammar via WebLLM)
const userReply = await chat.send("Extract user info from: 'Ada, 36, …'", {
jsonSchema: {
type: "object",
required: ["name", "age"],
properties: {
name: { type: "string" },
age: { type: "integer", minimum: 0 },
},
},
});
const user = userReply.json<{ name: string; age: number }>();The shape mirrors ort-vision-sdk-web: await Class.create(model) then predict() / send() / embed() / score().
Versioning roadmap
| Version | Scope |
| -------- | -------------------------------------------------------------------------------------------- |
| v0.1 | Chat via WebLLM. Phi-3.5-mini, Llama-3.2-1B, Qwen2.5-1.5B. Streaming with AbortSignal. |
| v0.2 | Completion task. Model caching (Cache API + OPFS). Web Worker by default. Progress events. |
| v0.3 | Embeddings and Reranker tasks. BGE family via transformers.js. |
| v0.4 | Structured output (JSON Schema → grammar / logit masking). |
| v0.5 | ORT-Web fallback for browsers without WebGPU. Auto-detection and graceful degradation. |
| v0.6 | Function calling helper (tool use with schema-validated arguments). |
| v1.0 | Documentation site, runnable demos, stable API contract. |
Browser support
- WebGPU: Chrome 113+, Edge 113+, recent Firefox Nightly with
dom.webgpu.enabled, Safari 18+ on macOS Sonoma+ / iOS 18+. - Without WebGPU: from v0.5, a WASM-SIMD fallback path will run smaller models acceptably. Below v0.5, a clear runtime error is raised when WebGPU is missing.
Installation
npm install localm-web @mlc-ai/web-llm@mlc-ai/web-llm is a peer dependency — the consumer pins the version, which keeps the SDK lightweight and avoids version conflicts.
For a step-by-step walkthrough covering install, model selection, downloading weights, running the example app and troubleshooting, see docs/getting-started.md.
Vite usage
The package is designed to drop into a Vite app with no extra config. The Web Worker is bundled via Vite's native worker support; just import the SDK and use it.
A runnable example lives under examples/vite-chat/ — cd into it, npm install, npm run dev, open the browser, pick a model, send a prompt. The full guide in docs/getting-started.md walks through it.
Why not server-side?
Three reasons:
- Mature alternatives exist. Python and TS already have excellent server-side LM tooling (Ollama, vLLM, llama-cpp-python, transformers, llama.cpp Node bindings). Adding another wrapper is noise.
- The browser is the underserved surface. Running models on the user's device removes the server cost, keeps data local, and unlocks offline use cases — but the DX is currently rough.
- Different concerns. Server inference cares about throughput, batching, multi-tenant scheduling. Browser inference cares about cold-start time, model caching, UI thread isolation, WebGPU compatibility. Conflating them produces a bad SDK on both sides.
Security
localm-web is a browser SDK — its dependencies execute in your users' browsers. Two layers, treated differently:
| Layer | What it is | Vuln policy |
| ------------------- | --------------------------------------- | --------------------------------------------------------------------------------------- |
| Runtime (peers) | @mlc-ai/web-llm, future runtime peers | Zero known CVEs. Releases are blocked if npm audit --omit=dev reports anything. |
| Dev tooling | Vite, Vitest, ESLint, esbuild, etc. | Fixed promptly via dependency bumps or overrides. Never reaches the published bundle. |
Reporting vulnerabilities
If you find a vulnerability in localm-web itself (not in a transitive dep), open a private security advisory at https://github.com/mauriciobenjamin700/localm-web/security/advisories/new. Please don't open public issues for unpatched runtime vulns.
What we do on every release
npm ci(locked install — no drift between dev machine and CI).npm auditreviewed manually; nothing handwaved.- ESM-only build, no eval /
Function()/ dynamic remote code. - Signed publish via
npm publish --provenance(provenance attestation visible on the npm package page).
What you should do as a consumer
- Pin the SDK version (
[email protected], not^x.y.z) until you've validated a release. - Self-host model weights or use Subresource Integrity (SRI) when the runtime fetches them — model URLs are external.
- Models are cached locally (Cache API + OPFS) — surface this in your app's privacy policy.
- Run inference inside a Web Worker (the SDK does this by default from v0.2). Don't bypass it to "save a thread" — it isolates faulty model code from your UI.
The full maintainer policy lives in CLAUDE.md → Security & vulnerabilities.
Contributing
Pre-alpha. Issues and design discussion welcome. PRs deferred until the v0.1 surface stabilizes.
License
MIT — see LICENSE.
Related projects
ort-vision-sdk— sibling SDK for computer vision (classification, detection, segmentation). Same DX patterns, same author.
