@tryhamster/gerbil

v1.6.1

Published

20 hours ago

On-device LLM inference for the browser and Node.js — text, vision, speech (TTS/STT), and embeddings on a native WebGPU engine. Offline, private, no API keys. React hooks, MCP, and Vercel AI SDK included.

0High
0Medium
0Low

eyaltoledano

crunchyman-ralph

llm local gpu webgpu inference ai-sdk transformers qwen thinking mcp langchain nextjs express cli

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
const { text } = await engine.generate("Explain recursion in one sentence");

📚 Full docs → gerbilsdk.com/docs · live playground & demos at gerbilsdk.com

Why Gerbil?

One native engine — pure WebGPU/WGSL compute shaders, nothing extra to ship.
~90 KB gzipped — the entire multimodal engine. No heavyweight ML runtime; model weights stream from the HuggingFace Hub at run time.
Multimodal, all native — text, vision (image→text), embeddings, and speech run on the same engine, loading safetensors directly from the HuggingFace Hub.
Browser & Node — Chrome 113+, Safari 26+ (iOS 26+), Firefox 141+, and Node via Dawn (webgpu npm), anywhere there's a real GPU.
Local & private — no API keys, nothing leaves the device.
React-first — useEngine owns load / unload / hot-swap and shares one engine across components (reference-counted), with dtype: "auto" picking int4 on mobile.
Framework ready — Vercel AI SDK v5, Next.js, Express, LangChain adapters.
Skills & tools — built-in + custom skills with Zod validation; agentic tool calling.

Install

# Try without installing (one-off usage)
npx @tryhamster/gerbil

# Install globally
npm install -g @tryhamster/gerbil

# Or install in your project
npm install @tryhamster/gerbil

After global install, use gerbil directly instead of npx @tryhamster/gerbil.

Native WebGPU Engine

Gerbil runs LLMs directly on the GPU — pure WebGPU/WGSL compute, no Python, no native runtime, nothing to install. The same code runs in a browser tab and in Node (via Dawn), streaming weights straight from the HuggingFace Hub and loading only the tensors a request needs — skip the vision tower you don't.

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

// dtype "auto" picks int4 on mobile, the repo's native precision on desktop.
const engine = await WebGPUEngine.create({
  repo: "mlx-community/Qwen3.5-0.8B-4bit",
  dtype: "auto",
});

// Generate
const { text, tokensPerSecond } = await engine.generate("Write a haiku about gerbils");
console.log(text, `(${tokensPerSecond.toFixed(1)} tok/s)`);

// Stream
for await (const token of engine.stream("Tell me a story")) {
  process.stdout.write(token);
}

engine.destroy();

WebGPUEngine.create({ repo, dtype, enableVision, embedding, maxSeqLen }) returns an engine with generate, stream, describeImage, embed, and speak. See the native engine docs below for the model lineup.

Benchmark it

The same engine runs server-side on Node (via Dawn), so you can watch local decode rip on your own GPU — no API key, no cloud:

npx @tryhamster/gerbil bench           # tok/s + first-token latency on your machine

Reports steady-state decode tok/s, time-to-first-token, and the device it ran on. For a copy-pasteable version see examples/benchmark.ts (and the rest of examples/).

React Quickstart

useEngine (from @tryhamster/gerbil/hooks) owns the full engine lifecycle — load, unload, hot-swap on config change, and reference-counted sharing so multiple components never upload the same weights to the GPU twice.

import { useEngine } from "@tryhamster/gerbil/hooks";

function Chat() {
  const { complete, completion, isLoading, isGenerating, tps } = useEngine({
    model: "mlx-community/Qwen3.5-0.8B-4bit",
    autoLoad: true, // dtype defaults to "auto": int4 on mobile, native on desktop
  });

  if (isLoading) return <div>Loading model…</div>;
  return (
    <div>
      <button onClick={() => complete("What is 2+2?")} disabled={isGenerating}>
        Generate
      </button>
      <p>{completion}</p>
      {isGenerating && <span>{tps?.toFixed(1)} tok/s</span>}
    </div>
  );
}

The same hook exposes describeImage (vision), embed/similarity (embeddings), stop, and dispose. Pass enableVision: true or embedding: true to load those modalities.

Gate your app behind the download. Wrap your app in <GerbilGate> to show a splash until the engine is ready — and keep it warm across navigation (no GPU re-upload):

import { GerbilGate } from "@tryhamster/gerbil/hooks";

<GerbilGate
  model="mlx-community/Qwen3.5-0.8B-4bit"
  fallback={({ progress }) => <Splash percent={progress} />}
>
  <App />
</GerbilGate>;

Structured Output

generateObject makes the model return a JSON object: it generates, extracts the JSON, validates it, and retries with a corrective nudge until it's valid (or maxRetries is hit). Validate with a predicate (o) => boolean or a minimal { required: [...] } schema; omit schema to accept any valid JSON.

import { generateObject } from "@tryhamster/gerbil";

const { object, attempts } = await generateObject<{ name: string; age: number }>(
  'Extract {name, age} from: "I am Sarah, 28"',
  { schema: { required: ["name", "age"] } },
);
// object === { name: "Sarah", age: 28 }

It's available on the engine, the Gerbil class, and the one-liner API:

import { Gerbil, WebGPUEngine } from "@tryhamster/gerbil";

const g = new Gerbil();
await g.loadModel("qwen3.5-0.8b");
const { object } = await g.generateObject("List 3 primes as {primes: number[]}", {
  schema: (o) => Array.isArray((o as any).primes),
});

// Or directly on the engine:
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
await engine.generateObject("…", { schema: { required: ["title"] } });

In React, use useObject (from @tryhamster/gerbil/hooks):

import { useObject } from "@tryhamster/gerbil/hooks";

const { generate, object, isGenerating } = useObject<{ city: string }>();
await generate("Extract the city from: I live in Paris", {
  schema: { required: ["city"] },
});

From the CLI:

gerbil object "Extract {name, age}: I am Sarah, 28" --schema person.json
# person.json: { "required": ["name", "age"] }

Embeddings

Native text embeddings via EmbeddingGemma-300M (mean-pooled Gemma3 encoder + Dense head, 768-dim, L2-normalized). EmbeddingGemma is asymmetric — pass taskType so queries and documents get the right prefix.

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

const engine = await WebGPUEngine.create({
  repo: "mlx-community/embeddinggemma-300m-4bit",
  embedding: true,
});

const query = await engine.embed("capital of France", { taskType: "query" });
const doc = await engine.embed("Paris is the capital of France", { taskType: "document" });

// Vectors are unit-norm, so cosine similarity is a dot product.
const sim = query.reduce((s, v, i) => s + v * doc[i], 0);

📖 Full Embeddings Documentation →

Vision

Image-in → text-out via the native vision towers (Qwen3.5 ViT and Gemma 4 ViT). Load with enableVision: true, then call describeImage.

const engine = await WebGPUEngine.create({
  repo: "Qwen/Qwen3.5-0.8B",
  enableVision: true,
});

// In Node, decode the image to RGB pixels (HWC, 0..255) yourself; in the browser the
// React hook's describeImage() takes a URL / data-URL directly.
const { text } = await engine.describeImage(
  { pixels, width, height },
  "What's in this image?",
);

📖 Full Vision Documentation →

Speech

Text-to-speech — native Kani-TTS-2 (LFM2-350M codec-LM + NVIDIA NeMo NanoCodec). engine.speak() returns 22.05 kHz mono PCM.

const engine = await WebGPUEngine.create({ repo: "nineninesix/kani-tts-2-en" });
const { pcm, sampleRate } = await engine.speak("Hello, I'm Gerbil!"); // sampleRate === 22050

Speech-to-text — native Moonshine (raw-waveform encoder/decoder, no FFT/log-mel) via the dedicated MoonshineSTT class.

import { MoonshineSTT } from "@tryhamster/gerbil/gpu";

const stt = await MoonshineSTT.create({ repo: "UsefulSensors/moonshine-base" });
const { text, noSpeech } = await stt.transcribe(pcm16kMono); // noSpeech flags silence

transcribe returns noSpeech (RMS VAD + min-duration + marker denylist) so you can skip silent/empty clips; useSTT surfaces it too, with an onNoSpeech callback.

📖 Full TTS Documentation → · Full STT Documentation →

Skills

Built-in AI skills with Zod-validated inputs:

import { commit, summarize, explain, review } from "@tryhamster/gerbil/skills";

const msg = await commit({ type: "conventional" });
const summary = await summarize({ content: doc, length: "short" });
const explanation = await explain({ content: code, level: "beginner" });

Custom Skills

import { defineSkill, loadSkills, useSkill } from "@tryhamster/gerbil/skills";

// Define inline
const sentiment = defineSkill({
  name: "sentiment",
  description: "Analyze text sentiment",
  input: z.object({ text: z.string() }),
  async run(input, gerbil) {
    return gerbil.json(`Sentiment of: ${input.text}`, { schema: outputSchema });
  },
});

// Or load from files
await loadSkills("./skills");  // loads *.skill.ts
const skill = useSkill("my-skill");

📖 Full Skills Documentation →

Tools & Agents

Gerbil supports tool calling with Qwen3 models for agentic workflows. A tool is a plain object (AgentTool) — give it a name, description, optional parameters, and an execute function:

import type { AgentTool } from "@tryhamster/gerbil/gpu";

const weatherTool: AgentTool = {
  name: "get_weather",
  description: "Get weather for a city",
  parameters: { city: "string" },
  execute: ({ city }) => `Weather in ${city}: 72°F, sunny`,
};

Agentic loop, on-device. engine.generateWithTools (and the useAgent React hook) run the whole loop — generate → call a tool → feed the result back → repeat — and return a step trace for UIs:

import { useAgent } from "@tryhamster/gerbil/hooks";

const { run, steps, answer, isRunning } = useAgent({
  model: "mlx-community/Qwen3.5-0.8B-4bit",
  tools: [
    {
      name: "get_weather",
      description: "Get the weather for a city",
      parameters: { city: "string" },
      execute: ({ city }) => `Weather in ${city}: 72°F, sunny`,
    },
  ],
});
await run("What's the weather in Paris?"); // steps[]: tool_call → tool_result → answer

Built-in tools:

gerbil_docs — Search Gerbil documentation
run_skill — Execute any Gerbil skill

In the REPL, Agent mode is on by default and enables tool calling:

npx @tryhamster/gerbil repl
# Press ⌘A to toggle agent mode on/off
# Ask: "how do I use gerbil with next.js?"
# Gerbil will call the docs tool and synthesize an answer

📖 Full Tools Documentation →

Autocomplete & Rewrite

Inline autocomplete — engine.autocomplete(prefix) and the debounced useAutocomplete hook return a brief single-line continuation (low-latency defaults + cleanup):

import { useAutocomplete } from "@tryhamster/gerbil/hooks";

const { suggestion, onInput, accept, dismiss } = useAutocomplete({
  model: "mlx-community/Qwen3.5-0.8B-4bit",
});
// <input onChange={(e) => onInput(e.target.value)} /> — render `suggestion` as ghost text;
// Tab → accept(), Esc → dismiss()

Tone rewrite — engine.rewrite(text, { tone }) (and useEngine().rewrite) re-generates text in a target tone ("professional", "friendly", "concise", "playful", "pirate") or with free-form instructions.

📖 Full Autocomplete Documentation →

CLI

# Without installing (use npx)
npx @tryhamster/gerbil                        # Interactive REPL (default)
npx @tryhamster/gerbil "Write a haiku"        # Generate text

# After installing globally (npm i -g @tryhamster/gerbil)
gerbil                                        # Interactive REPL
gerbil "Write a haiku"                        # Generate text
gerbil commit                                 # Commit message from staged changes
gerbil summarize README.md                    # Summarize file
gerbil chat --thinking                        # Interactive chat
gerbil speak "Hello world" --voice af_heart   # Text-to-speech
gerbil transcribe audio.wav                   # Speech-to-text
gerbil serve --mcp                            # MCP server for Claude/Cursor
gerbil update                                 # Update to latest version

Updates: Gerbil checks for updates but never installs without permission. Press u in REPL or run gerbil update.

📖 Full CLI Documentation →

Browser Usage

Run LLMs directly in the browser with WebGPU — no server required. The React hooks live at @tryhamster/gerbil/hooks and run pure WebGPU compute:

import { useChat } from "@tryhamster/gerbil/hooks";

function Chat() {
  const { messages, send, isLoading, isGenerating } = useChat();

  if (isLoading) return <div>Loading model...</div>;

  return (
    <div>
      {messages.map((m, i) => <div key={i}>{m.role}: {m.content}</div>)}
      <button onClick={() => send("Hello!")} disabled={isGenerating}>Send</button>
    </div>
  );
}

@tryhamster/gerbil/browser exports the device & storage utilities (isModelSafeForDevice, detectMemoryCrash, downloadModelChunked, getRecommendedModels, requestPersistentStorage, …).

📖 Full Browser Documentation →

Integrations

| Integration | Import | Docs | |-------------|--------|------| | Browser | @tryhamster/gerbil/browser | 📖 Browser | | AI SDK v5 | @tryhamster/gerbil/ai | 📖 AI SDK | | Next.js | @tryhamster/gerbil/next | 📖 Next.js | | Express | @tryhamster/gerbil/express | 📖 Express | | LangChain | @tryhamster/gerbil/langchain | 📖 LangChain | | MCP Server | npx @tryhamster/gerbil serve --mcp | 📖 MCP |

Native engine: import { WebGPUEngine } from "@tryhamster/gerbil/gpu" (or useEngine from @tryhamster/gerbil/hooks for React) is the primary surface for text, vision, embeddings, and speech.

Supported Models

The native engine runs these modalities today. All load straight from the HuggingFace Hub via WebGPUEngine.create({ repo }).

Text

| Model | Repo | Notes | |-------|------|-------| | Qwen3.5-0.8B | mlx-community/Qwen3.5-0.8B-4bit | Default text model; vision-capable (Qwen/Qwen3.5-0.8B for the ViT) | | Qwen3.5-2B | Qwen/Qwen3.5-2B | Higher quality; 262k context; multimodal (vision-capable) | | LFM2.5-350M | LiquidAI/LFM2.5-350M | Hybrid conv/attention, very fast, ~199 MB q4 | | Gemma 4 E2B | mlx-community/gemma-4-e2b-it-4bit | PLE CPU-streamed; vision-capable |

Vision (image → text, `describeImage`)

| Tower | From | Notes | |-------|------|-------| | Qwen3.5 ViT | Qwen/Qwen3.5-0.8B (enableVision: true) | Bit-exact vs HF | | Gemma 4 ViT | mlx-community/gemma-4-e2b-it-4bit (enableVision: true) | Native projector |

Embeddings (`embed`)

| Model | Repo | Notes | |-------|------|-------| | EmbeddingGemma-300M | mlx-community/embeddinggemma-300m-4bit | 768-dim, asymmetric (taskType), runs on iPad |

Speech

| Model | Type | Repo | Notes | |-------|------|------|-------| | Kani-TTS-2 | TTS | nineninesix/kani-tts-2-en | engine.speak() → 22.05 kHz PCM | | Moonshine | STT | UsefulSensors/moonshine-base | MoonshineSTT.transcribe(), raw-waveform |

Quantization & dtype

dtype: "auto" (the React-hook default) picks int4 on mobile and the repo's native precision on desktop. For Qwen3.5-0.8B on Dawn/Node:

| Format | Download | tok/s | Notes | |---|---|---|---| | MLX 4-bit (affine) | 404 MB | fastest | Smallest. Recommended. | | GPTQ (AutoRound) | 734 MB | fast | Pre-quantized linears, F16 embed | | F32 (on-the-fly Q4) | 1666 MB | slowest | No pre-quantization needed |

Throughput moves run-to-run and across the optimization loop; treat these as relative, not promises.

WGSL Kernels

MatMul, MatMulInt4, EmbeddingInt4, RMSNorm, RoPE, GQA Attention (flash-style, causal + bidirectional), SwiGLU/GeGLU, CrossAttention, CausalConv1d, M-RoPE, EmbedSplice, FSQ + HiFi-GAN (NanoCodec decoder), and more.

High-level Gerbil class. import { Gerbil } from "@tryhamster/gerbil" (plus the one-liner and @tryhamster/gerbil/skills) is a supported convenience wrapper over the native WebGPUEngine — ideal for quick scripts, the CLI, and the AI SDK. Reach for WebGPUEngine / useEngine directly when you want lower-level control over loading, vision, embeddings, and speech.

Documentation

Full documentation, guides, and a live playground live at gerbilsdk.com/docs.

| Guide | Description | |-------|-------------| | 📖 Getting Started | Install, load a model, core concepts | | 📖 Structured Output | generateObject / useObject — validated JSON with retries | | 📖 Embeddings | EmbeddingGemma semantic search, similarity, RAG | | 📖 Vision | Image → text with Qwen3.5 ViT & Gemma 4 ViT | | 📖 Text-to-Speech | Native Kani-TTS-2 (engine.speak()) | | 📖 Speech-to-Text | Native Moonshine (MoonshineSTT) | | 📖 Browser | WebGPU inference, React hooks | | 📖 Hooks | useEngine / useObject / useTTS / useSTT | | 📖 Skills | Built-in skills, custom skill development | | 📖 Tools | Tool calling, agentic workflows | | 📖 REPL | Interactive terminal dashboard | | 📖 AI SDK | Vercel AI SDK v5 (LLM, TTS, STT, Embeddings) | | 📖 Frameworks | Next.js, Express, React, LangChain | | 📖 CLI | All CLI commands and options | | 📖 Mobile | iOS / iPadOS guidance & memory guards | | 📖 MCP | MCP server for Claude Desktop & Cursor |

Requirements

The native engine needs a real GPU and a WebGPU runtime:

Browser — Chrome/Edge 113+, Safari 26+ (iOS/iPadOS 26+), or Firefox 141+
Node — Node.js 18+ with the webgpu package (Dawn) installed

On devices without WebGPU the engine throws a clear error rather than silently degrading.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Why Gerbil?

Install

Native WebGPU Engine

Benchmark it

React Quickstart

Structured Output

Embeddings

Vision

Speech

Skills

Custom Skills

Tools & Agents

Autocomplete & Rewrite

CLI

Browser Usage

Integrations

Supported Models

Text

Vision (image → text, describeImage)

Embeddings (embed)

Speech

Quantization & dtype

WGSL Kernels

Documentation

Requirements

License

Vision (image → text, `describeImage`)

Embeddings (`embed`)