@tryhamster/gerbil
v1.6.1
Published
On-device LLM inference for the browser and Node.js — text, vision, speech (TTS/STT), and embeddings on a native WebGPU engine. Offline, private, no API keys. React hooks, MCP, and Vercel AI SDK included.
Readme
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
const { text } = await engine.generate("Explain recursion in one sentence");📚 Full docs → gerbilsdk.com/docs · live playground & demos at gerbilsdk.com
Why Gerbil?
- One native engine — pure WebGPU/WGSL compute shaders, nothing extra to ship.
- ~90 KB gzipped — the entire multimodal engine. No heavyweight ML runtime; model weights stream from the HuggingFace Hub at run time.
- Multimodal, all native — text, vision (image→text), embeddings, and speech run on the same engine, loading safetensors directly from the HuggingFace Hub.
- Browser & Node — Chrome 113+, Safari 26+ (iOS 26+), Firefox 141+, and Node via Dawn
(
webgpunpm), anywhere there's a real GPU. - Local & private — no API keys, nothing leaves the device.
- React-first —
useEngineowns load / unload / hot-swap and shares one engine across components (reference-counted), withdtype: "auto"picking int4 on mobile. - Framework ready — Vercel AI SDK v5, Next.js, Express, LangChain adapters.
- Skills & tools — built-in + custom skills with Zod validation; agentic tool calling.
Install
# Try without installing (one-off usage)
npx @tryhamster/gerbil
# Install globally
npm install -g @tryhamster/gerbil
# Or install in your project
npm install @tryhamster/gerbilAfter global install, use gerbil directly instead of npx @tryhamster/gerbil.
Native WebGPU Engine
Gerbil runs LLMs directly on the GPU — pure WebGPU/WGSL compute, no Python, no native runtime, nothing to install. The same code runs in a browser tab and in Node (via Dawn), streaming weights straight from the HuggingFace Hub and loading only the tensors a request needs — skip the vision tower you don't.
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// dtype "auto" picks int4 on mobile, the repo's native precision on desktop.
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
dtype: "auto",
});
// Generate
const { text, tokensPerSecond } = await engine.generate("Write a haiku about gerbils");
console.log(text, `(${tokensPerSecond.toFixed(1)} tok/s)`);
// Stream
for await (const token of engine.stream("Tell me a story")) {
process.stdout.write(token);
}
engine.destroy();WebGPUEngine.create({ repo, dtype, enableVision, embedding, maxSeqLen }) returns an
engine with generate, stream, describeImage, embed, and speak. See the
native engine docs below for the model lineup.
Benchmark it
The same engine runs server-side on Node (via Dawn), so you can watch local decode rip on your own GPU — no API key, no cloud:
npx @tryhamster/gerbil bench # tok/s + first-token latency on your machineReports steady-state decode tok/s, time-to-first-token, and the device it ran on.
For a copy-pasteable version see examples/benchmark.ts
(and the rest of examples/).
React Quickstart
useEngine (from @tryhamster/gerbil/hooks) owns the full engine lifecycle —
load, unload, hot-swap on config change, and reference-counted sharing so multiple
components never upload the same weights to the GPU twice.
import { useEngine } from "@tryhamster/gerbil/hooks";
function Chat() {
const { complete, completion, isLoading, isGenerating, tps } = useEngine({
model: "mlx-community/Qwen3.5-0.8B-4bit",
autoLoad: true, // dtype defaults to "auto": int4 on mobile, native on desktop
});
if (isLoading) return <div>Loading model…</div>;
return (
<div>
<button onClick={() => complete("What is 2+2?")} disabled={isGenerating}>
Generate
</button>
<p>{completion}</p>
{isGenerating && <span>{tps?.toFixed(1)} tok/s</span>}
</div>
);
}The same hook exposes describeImage (vision), embed/similarity (embeddings), stop,
and dispose. Pass enableVision: true or embedding: true to load those modalities.
Gate your app behind the download. Wrap your app in <GerbilGate> to show a splash
until the engine is ready — and keep it warm across navigation (no GPU re-upload):
import { GerbilGate } from "@tryhamster/gerbil/hooks";
<GerbilGate
model="mlx-community/Qwen3.5-0.8B-4bit"
fallback={({ progress }) => <Splash percent={progress} />}
>
<App />
</GerbilGate>;Structured Output
generateObject makes the model return a JSON object: it generates, extracts the JSON,
validates it, and retries with a corrective nudge until it's valid (or maxRetries is hit).
Validate with a predicate (o) => boolean or a minimal { required: [...] } schema; omit
schema to accept any valid JSON.
import { generateObject } from "@tryhamster/gerbil";
const { object, attempts } = await generateObject<{ name: string; age: number }>(
'Extract {name, age} from: "I am Sarah, 28"',
{ schema: { required: ["name", "age"] } },
);
// object === { name: "Sarah", age: 28 }It's available on the engine, the Gerbil class, and the one-liner API:
import { Gerbil, WebGPUEngine } from "@tryhamster/gerbil";
const g = new Gerbil();
await g.loadModel("qwen3.5-0.8b");
const { object } = await g.generateObject("List 3 primes as {primes: number[]}", {
schema: (o) => Array.isArray((o as any).primes),
});
// Or directly on the engine:
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
await engine.generateObject("…", { schema: { required: ["title"] } });In React, use useObject (from @tryhamster/gerbil/hooks):
import { useObject } from "@tryhamster/gerbil/hooks";
const { generate, object, isGenerating } = useObject<{ city: string }>();
await generate("Extract the city from: I live in Paris", {
schema: { required: ["city"] },
});From the CLI:
gerbil object "Extract {name, age}: I am Sarah, 28" --schema person.json
# person.json: { "required": ["name", "age"] }Embeddings
Native text embeddings via EmbeddingGemma-300M (mean-pooled Gemma3 encoder + Dense
head, 768-dim, L2-normalized). EmbeddingGemma is asymmetric — pass taskType so queries
and documents get the right prefix.
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({
repo: "mlx-community/embeddinggemma-300m-4bit",
embedding: true,
});
const query = await engine.embed("capital of France", { taskType: "query" });
const doc = await engine.embed("Paris is the capital of France", { taskType: "document" });
// Vectors are unit-norm, so cosine similarity is a dot product.
const sim = query.reduce((s, v, i) => s + v * doc[i], 0);📖 Full Embeddings Documentation →
Vision
Image-in → text-out via the native vision towers (Qwen3.5 ViT and Gemma 4 ViT). Load with
enableVision: true, then call describeImage.
const engine = await WebGPUEngine.create({
repo: "Qwen/Qwen3.5-0.8B",
enableVision: true,
});
// In Node, decode the image to RGB pixels (HWC, 0..255) yourself; in the browser the
// React hook's describeImage() takes a URL / data-URL directly.
const { text } = await engine.describeImage(
{ pixels, width, height },
"What's in this image?",
);Speech
Text-to-speech — native Kani-TTS-2 (LFM2-350M codec-LM + NVIDIA NeMo NanoCodec).
engine.speak() returns 22.05 kHz mono PCM.
const engine = await WebGPUEngine.create({ repo: "nineninesix/kani-tts-2-en" });
const { pcm, sampleRate } = await engine.speak("Hello, I'm Gerbil!"); // sampleRate === 22050Speech-to-text — native Moonshine (raw-waveform encoder/decoder, no FFT/log-mel)
via the dedicated MoonshineSTT class.
import { MoonshineSTT } from "@tryhamster/gerbil/gpu";
const stt = await MoonshineSTT.create({ repo: "UsefulSensors/moonshine-base" });
const { text, noSpeech } = await stt.transcribe(pcm16kMono); // noSpeech flags silencetranscribe returns noSpeech (RMS VAD + min-duration + marker denylist) so you can skip
silent/empty clips; useSTT surfaces it too, with an onNoSpeech callback.
📖 Full TTS Documentation → · Full STT Documentation →
Skills
Built-in AI skills with Zod-validated inputs:
import { commit, summarize, explain, review } from "@tryhamster/gerbil/skills";
const msg = await commit({ type: "conventional" });
const summary = await summarize({ content: doc, length: "short" });
const explanation = await explain({ content: code, level: "beginner" });Custom Skills
import { defineSkill, loadSkills, useSkill } from "@tryhamster/gerbil/skills";
// Define inline
const sentiment = defineSkill({
name: "sentiment",
description: "Analyze text sentiment",
input: z.object({ text: z.string() }),
async run(input, gerbil) {
return gerbil.json(`Sentiment of: ${input.text}`, { schema: outputSchema });
},
});
// Or load from files
await loadSkills("./skills"); // loads *.skill.ts
const skill = useSkill("my-skill");Tools & Agents
Gerbil supports tool calling with Qwen3 models for agentic workflows. A tool is a
plain object (AgentTool) — give it a name, description, optional parameters,
and an execute function:
import type { AgentTool } from "@tryhamster/gerbil/gpu";
const weatherTool: AgentTool = {
name: "get_weather",
description: "Get weather for a city",
parameters: { city: "string" },
execute: ({ city }) => `Weather in ${city}: 72°F, sunny`,
};Agentic loop, on-device. engine.generateWithTools (and the useAgent React hook)
run the whole loop — generate → call a tool → feed the result back → repeat — and return a
step trace for UIs:
import { useAgent } from "@tryhamster/gerbil/hooks";
const { run, steps, answer, isRunning } = useAgent({
model: "mlx-community/Qwen3.5-0.8B-4bit",
tools: [
{
name: "get_weather",
description: "Get the weather for a city",
parameters: { city: "string" },
execute: ({ city }) => `Weather in ${city}: 72°F, sunny`,
},
],
});
await run("What's the weather in Paris?"); // steps[]: tool_call → tool_result → answerBuilt-in tools:
gerbil_docs— Search Gerbil documentationrun_skill— Execute any Gerbil skill
In the REPL, Agent mode is on by default and enables tool calling:
npx @tryhamster/gerbil repl
# Press ⌘A to toggle agent mode on/off
# Ask: "how do I use gerbil with next.js?"
# Gerbil will call the docs tool and synthesize an answerAutocomplete & Rewrite
Inline autocomplete — engine.autocomplete(prefix) and the debounced useAutocomplete
hook return a brief single-line continuation (low-latency defaults + cleanup):
import { useAutocomplete } from "@tryhamster/gerbil/hooks";
const { suggestion, onInput, accept, dismiss } = useAutocomplete({
model: "mlx-community/Qwen3.5-0.8B-4bit",
});
// <input onChange={(e) => onInput(e.target.value)} /> — render `suggestion` as ghost text;
// Tab → accept(), Esc → dismiss()Tone rewrite — engine.rewrite(text, { tone }) (and useEngine().rewrite) re-generates
text in a target tone ("professional", "friendly", "concise", "playful",
"pirate") or with free-form instructions.
📖 Full Autocomplete Documentation →
CLI
# Without installing (use npx)
npx @tryhamster/gerbil # Interactive REPL (default)
npx @tryhamster/gerbil "Write a haiku" # Generate text
# After installing globally (npm i -g @tryhamster/gerbil)
gerbil # Interactive REPL
gerbil "Write a haiku" # Generate text
gerbil commit # Commit message from staged changes
gerbil summarize README.md # Summarize file
gerbil chat --thinking # Interactive chat
gerbil speak "Hello world" --voice af_heart # Text-to-speech
gerbil transcribe audio.wav # Speech-to-text
gerbil serve --mcp # MCP server for Claude/Cursor
gerbil update # Update to latest versionUpdates: Gerbil checks for updates but never installs without permission. Press
uin REPL or rungerbil update.
Browser Usage
Run LLMs directly in the browser with WebGPU — no server required. The React hooks
live at @tryhamster/gerbil/hooks and run pure WebGPU compute:
import { useChat } from "@tryhamster/gerbil/hooks";
function Chat() {
const { messages, send, isLoading, isGenerating } = useChat();
if (isLoading) return <div>Loading model...</div>;
return (
<div>
{messages.map((m, i) => <div key={i}>{m.role}: {m.content}</div>)}
<button onClick={() => send("Hello!")} disabled={isGenerating}>Send</button>
</div>
);
}@tryhamster/gerbil/browser exports the device & storage utilities
(isModelSafeForDevice, detectMemoryCrash, downloadModelChunked,
getRecommendedModels, requestPersistentStorage, …).
📖 Full Browser Documentation →
Integrations
| Integration | Import | Docs |
|-------------|--------|------|
| Browser | @tryhamster/gerbil/browser | 📖 Browser |
| AI SDK v5 | @tryhamster/gerbil/ai | 📖 AI SDK |
| Next.js | @tryhamster/gerbil/next | 📖 Next.js |
| Express | @tryhamster/gerbil/express | 📖 Express |
| LangChain | @tryhamster/gerbil/langchain | 📖 LangChain |
| MCP Server | npx @tryhamster/gerbil serve --mcp | 📖 MCP |
Native engine: import { WebGPUEngine } from "@tryhamster/gerbil/gpu" (or useEngine from @tryhamster/gerbil/hooks for React) is the primary surface for text, vision, embeddings, and speech.
Supported Models
The native engine runs these modalities today. All load straight from the HuggingFace Hub
via WebGPUEngine.create({ repo }).
Text
| Model | Repo | Notes |
|-------|------|-------|
| Qwen3.5-0.8B | mlx-community/Qwen3.5-0.8B-4bit | Default text model; vision-capable (Qwen/Qwen3.5-0.8B for the ViT) |
| Qwen3.5-2B | Qwen/Qwen3.5-2B | Higher quality; 262k context; multimodal (vision-capable) |
| LFM2.5-350M | LiquidAI/LFM2.5-350M | Hybrid conv/attention, very fast, ~199 MB q4 |
| Gemma 4 E2B | mlx-community/gemma-4-e2b-it-4bit | PLE CPU-streamed; vision-capable |
Vision (image → text, describeImage)
| Tower | From | Notes |
|-------|------|-------|
| Qwen3.5 ViT | Qwen/Qwen3.5-0.8B (enableVision: true) | Bit-exact vs HF |
| Gemma 4 ViT | mlx-community/gemma-4-e2b-it-4bit (enableVision: true) | Native projector |
Embeddings (embed)
| Model | Repo | Notes |
|-------|------|-------|
| EmbeddingGemma-300M | mlx-community/embeddinggemma-300m-4bit | 768-dim, asymmetric (taskType), runs on iPad |
Speech
| Model | Type | Repo | Notes |
|-------|------|------|-------|
| Kani-TTS-2 | TTS | nineninesix/kani-tts-2-en | engine.speak() → 22.05 kHz PCM |
| Moonshine | STT | UsefulSensors/moonshine-base | MoonshineSTT.transcribe(), raw-waveform |
Quantization & dtype
dtype: "auto" (the React-hook default) picks int4 on mobile and the repo's native
precision on desktop. For Qwen3.5-0.8B on Dawn/Node:
| Format | Download | tok/s | Notes | |---|---|---|---| | MLX 4-bit (affine) | 404 MB | fastest | Smallest. Recommended. | | GPTQ (AutoRound) | 734 MB | fast | Pre-quantized linears, F16 embed | | F32 (on-the-fly Q4) | 1666 MB | slowest | No pre-quantization needed |
Throughput moves run-to-run and across the optimization loop; treat these as relative, not promises.
WGSL Kernels
MatMul, MatMulInt4, EmbeddingInt4, RMSNorm, RoPE, GQA Attention (flash-style, causal + bidirectional), SwiGLU/GeGLU, CrossAttention, CausalConv1d, M-RoPE, EmbedSplice, FSQ + HiFi-GAN (NanoCodec decoder), and more.
High-level
Gerbilclass.import { Gerbil } from "@tryhamster/gerbil"(plus the one-liner and@tryhamster/gerbil/skills) is a supported convenience wrapper over the nativeWebGPUEngine— ideal for quick scripts, the CLI, and the AI SDK. Reach forWebGPUEngine/useEnginedirectly when you want lower-level control over loading, vision, embeddings, and speech.
Documentation
Full documentation, guides, and a live playground live at gerbilsdk.com/docs.
| Guide | Description |
|-------|-------------|
| 📖 Getting Started | Install, load a model, core concepts |
| 📖 Structured Output | generateObject / useObject — validated JSON with retries |
| 📖 Embeddings | EmbeddingGemma semantic search, similarity, RAG |
| 📖 Vision | Image → text with Qwen3.5 ViT & Gemma 4 ViT |
| 📖 Text-to-Speech | Native Kani-TTS-2 (engine.speak()) |
| 📖 Speech-to-Text | Native Moonshine (MoonshineSTT) |
| 📖 Browser | WebGPU inference, React hooks |
| 📖 Hooks | useEngine / useObject / useTTS / useSTT |
| 📖 Skills | Built-in skills, custom skill development |
| 📖 Tools | Tool calling, agentic workflows |
| 📖 REPL | Interactive terminal dashboard |
| 📖 AI SDK | Vercel AI SDK v5 (LLM, TTS, STT, Embeddings) |
| 📖 Frameworks | Next.js, Express, React, LangChain |
| 📖 CLI | All CLI commands and options |
| 📖 Mobile | iOS / iPadOS guidance & memory guards |
| 📖 MCP | MCP server for Claude Desktop & Cursor |
Requirements
The native engine needs a real GPU and a WebGPU runtime:
- Browser — Chrome/Edge 113+, Safari 26+ (iOS/iPadOS 26+), or Firefox 141+
- Node — Node.js 18+ with the
webgpupackage (Dawn) installed
On devices without WebGPU the engine throws a clear error rather than silently degrading.
License
MIT
