mohdel
v0.114.0
Published
Self-hosted LLM gateway and SDK for Node — a LiteLLM-style unified API for 11 providers (Anthropic, OpenAI, Gemini, Mistral, Groq, xAI, DeepSeek, OpenRouter, …) with per-call USD cost tracking, streaming, tool calls, vision, speech-to-text, and built-in O
Readme
Mohdel
Self-hosted LLM gateway and SDK for Node — think LiteLLM, for the JS world. One answer() call for 13 providers; swap models by changing one string; get real per-call USD cost back on every result, with OpenTelemetry built in and process isolation when you need it. Your keys, your infra, no SaaS proxy in the path.
npm install -g mohdel
mo # interactive setup — pick a provider, paste your API key
mo ask gemini/gemini-3-flash-preview "why is the sky blue"Providers: Anthropic, OpenAI, Gemini, Mistral, Groq, xAI, Cerebras, Fireworks, DeepSeek, Qwen Cloud, Xiaomi, OpenRouter, Novita. Node 22+, ES modules.
Why mohdel
- Real numbers on every call. Token counts and per-call USD cost computed from your own pricing catalog (
curated.json) — not estimates, not provider-specific shapes. Bill tenants, alert on spend, reconcile invoices. See docs/CATALOG.md for the catalog format. - One interface across providers. Same
answer()call, same event stream, same{ status, output, inputTokens, outputTokens, cost }result. Switching fromanthropic/claude-sonnet-4-6toopenai/gpt-5.4-miniis one string change — adapter differences stay inside mohdel. - Self-hosted, no vendor in the path. API keys live in
~/.config/mohdel/. Mohdel calls provider APIs directly; nothing routes through a third party, nothing marks up your tokens, no extra hop of availability risk. - Observability without instrumentation. OpenTelemetry spans, trace-linked logs, and OTLP metrics over one endpoint. Set
OTEL_EXPORTER_OTLP_ENDPOINT; everything else is wired. - Two integration paths, same API. In-process factory for CLI tools, scripts, single-process services. Optional
thin-gatesubprocess for fault isolation, cross-process quota, and any-language HTTP callers — no code change to switch.
How it compares
The one-paragraph version: LiteLLM is the closest analog but lives in Python; Vercel AI SDK is an application toolkit, not an infra layer; OpenRouter is the same one-API promise as a SaaS in your request path; raw provider SDKs are N different shapes with no cost accounting.
| | mohdel | LiteLLM | Vercel AI SDK | OpenRouter | Raw SDKs | |---|---|---|---|---|---| | Runs in a Node stack natively | yes | Python service | yes | n/a (SaaS) | yes | | Per-call USD cost on the result | yes | yes | no | yes | no | | Self-hosted, keys never leave your infra | yes | yes | yes | no | yes | | Provider-SDK process isolation | yes (thin-gate) | proxy only | no | n/a | no | | OTel spans + metrics out of the box | yes | via callbacks | no | no | no | | UI streaming helpers, structured output, agents | no — by design | no | yes | no | varies |
- vs LiteLLM — same core promise (unified calls, cost tracking, self-hosted gateway), but Node-native: if your stack is JS, there's no Python sidecar to deploy, version, and monitor. The honest gap: LiteLLM's proxy exposes an OpenAI-compatible endpoint and admin features (virtual keys, budgets); thin-gate speaks its own wire protocol — callers use the JS client or implement the protocol.
- vs Vercel AI SDK — different layer, not a rival. The AI SDK is an application toolkit (UI streaming, structured outputs, agent loops) with no per-call cost, no gateway, no process isolation. Use it above mohdel if you like it — mohdel is the inference primitive underneath.
- vs OpenRouter — the self-hosted version of the same idea. With a SaaS
router you accept their uptime, their markup, and your prompts transiting
their infra. Mohdel goes direct to providers with your keys — and ships an
openrouteradapter for when you want both. - vs raw provider SDKs — no abstraction tax to escape later: mohdel's
envelope is flat and close to the SDKs underneath, and
cost/tokenscome back normalized so you never parse five different usage shapes.
Documentation
- INTEGRATION.md — JS library guide (factory, client, answer options, tools, streaming, vision, transcription, errors, OTel)
- docs/COOKBOOK.md — copy-paste recipes (summarize a file, stream, swap providers, tools, vision, batch + cost)
- docs/CATALOG.md —
curated.jsonwalkthrough with worked examples - docs/GLOSSARY.md — short definitions for envelope, thin-gate, session, creator vs provider, status, …
- ARCHITECTURE.md — design rationale, three-plane architecture
- PROTOCOL.md — wire format for porting clients/sessions to other languages
- LOGGING.md — log levels, prefixes, pino integration
Quick Start
The three lines at the top of this README are the whole onboarding: install, run mo to pick a provider and paste your API key, then mo ask. Gemini, Groq, and Cerebras all have free tiers — start there if you don't already have a paid key.
Model IDs always use the <provider>/<model> format:
gemini/gemini-3-flash-preview
anthropic/claude-sonnet-4-6
openai/gpt-5.4-mini
groq/llama-4-scout-17b-16e-instructWhat mohdel is not
Scope-capping is deliberate. If you're shopping for any of the following, mohdel is the wrong layer — use it alongside your framework of choice, not instead of it.
- Not an orchestrator. No chains, no agents, no memory, no prompt templates, no retrieval. Wrap mohdel with LangChain, LangGraph, LlamaIndex, Vercel AI SDK, or your own tool loop — mohdel exposes the inference primitive, orchestration stays in your application.
- Not a retry / fallback engine. Errors are classified (
retryable,severity,type) so the caller can decide, but mohdel never retries or swaps models silently. Silent model-swapping would conflict with existing multi-model logic upstream; the caller owns the retry budget and fallback choice. - Not a response cache. The
cache: trueflag on envelopes is for provider-side prompt caching (Anthropic, OpenAI) — not mohdel-level memoization. Caching inference results is orchestration-policy territory and depends on invariants only the caller knows. - Not a context-window / token manager. No pre-call token count, no projected-cost guard. The caller owns what goes in the prompt and is the source of truth for what counts.
- Not a SaaS proxy. Self-hosted. Your API keys, your infra. No routing through a third party, no vendor lock-in.
See ARCHITECTURE.md §Design principles for the full rationale behind each.
CLI
# One-shot inference — pipeable
mo ask anthropic/claude-sonnet-4-6 "explain monads"
cat article.txt | mo ask openai/gpt-5.4 "summarize in 3 bullets"
echo "hello" | mo ask gemini/gemini-3-flash-preview --json | jq .cost
# Streaming
mo ask anthropic/claude-sonnet-4-6 --stream "write a haiku about recursion"
# With thinking effort
mo ask anthropic/claude-opus-4-6 --effort high "prove P != NP"
# Speech → text from an audio file
mo transcribe groq/whisper-large-v3-turbo meeting.mp3
mo transcribe mistral/voxtral-mini-transcribe interview.wav --language fr
# Browse the model catalog
mo ls # list all curated models
mo ls --sort price # sorted by input price
mo search sonnet # filter by name/label
mo show anthropic/claude-sonnet-4-6 # model details
mo stats # catalog summary
mo providers # providers with key status & rate limits
# Rank models by benchmarks
mo rank # curated models, balanced weights
mo rank --use-case tool-loop # weighted for tool reliability
mo rank --json # machine-readable
# Manage the catalog
mo curate anthropic # add new models from a provider
mo setup anthropic # configure API key
mo model add fireworks/deepseek-r1 # add a model manually
mo model set <model> <key> <value> # set any field on a model
mo model rm <model> <key> # remove a field
mo check # validate schema + upstream drift
# Rate limits
mo rl show anthropic # provider or model limits
mo rl set anthropic/claude-sonnet-4-6 60 100000
# Benchmark with live inference
mo bench anthropic/claude-sonnet-4-6 # single model
mo bench --tag fast --effort low # suite by tagAll list/show commands support --json [fields] — bare --json lists available fields (like gh).
Library Usage
Two integration paths, same adapters underneath: start with the in-process factory; graduate to the cross-process client when you want gateway-grade isolation.
Factory — in-process (start here)
import mohdel from 'mohdel'
const mo = await mohdel()
const result = await mo.use('anthropic/claude-sonnet-4-6').answer('Hello')
console.log(result.output, result.cost)No subprocess, no setup beyond your API key. Right for CLI tools (mo ask), scripts, tests, and single-process services — which is most projects.
Client — cross-process (the production gateway)
import { call } from 'mohdel/client'
const envelope = {
callId: 'c-1', authId: 'u-1', auth: { key: process.env.ANTHROPIC_API_SK },
model: 'anthropic/claude-haiku-4-5', prompt: 'Hello'
}
for await (const ev of call(envelope, { socketPath: '/tmp/mohdel-data.sock' })) {
if (ev.type === 'delta') process.stdout.write(ev.delta.delta)
else if (ev.type === 'done') console.log('\n→', ev.result.cost)
}Same API, but inference runs in a pooled subprocess behind the thin-gate supervisor (Rust): a crashing provider SDK can't take your service down, quota is enforced across processes, and non-JS callers can speak the same wire. Switching from factory to client is a configuration change, not a rewrite. See INTEGRATION.md §Client for setup.
For the full API — initialization, alias resolution, answer options, response shape, tool use, streaming, vision, error handling, OpenTelemetry, sub-path exports — see INTEGRATION.md.
Observability
Every call emits:
- OpenTelemetry span (
mohdel.session.answer) under the caller'straceparent, with GenAI semantic-convention attributes (gen_ai.request.model,gen_ai.system,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens) plus mohdel's own (mohdel.status,mohdel.cost,mohdel.thinking_tokens,mohdel.time_to_first_token_ms,mohdel.cooldownon fast-fail). - Trace-linked logs — every stderr log line carries
{traceId, spanId, callId, authId, provider, model}. Dump logs + traces into the same collector (SigNoz, Honeycomb, Jaeger + Loki) and they're correlated for free. No per-call instrumentation code. - Gate-side OTLP metrics (when running
thin-gate):mohdel.sessions.{alive,respawned,spawn_failures},mohdel.calls{provider,status},mohdel.call.duration_ms,mohdel.cooldown.rejections,mohdel.quota.rejections,mohdel.policy.errors.
One endpoint for everything: set OTEL_EXPORTER_OTLP_ENDPOINT and spans + metrics flow to it over gRPC. No-op when unset — zero overhead for callers who aren't wired. See INTEGRATION.md §OpenTelemetry and LOGGING.md for details.
The OTel SDK packages (@opentelemetry/sdk-node, @opentelemetry/exporter-trace-otlp-grpc) are optionalDependencies — installed by default, but npm install --omit=optional skips them (along with their gRPC transitive tree). If you do that and later want trace export, install them explicitly:
npm install @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc@opentelemetry/api stays in dependencies — the no-op tracer needs it regardless of whether export is wired.
Architecture
Mohdel splits into three planes that can be deployed independently:
┌──────────┐ unix ┌─────────────┐ stdin/stdout ┌──────────┐
│ client │ socket │ thin-gate │ NDJSON │ session │ × N
caller ──► │ (JS) │ ─HTTP─►│ (Rust) │ ─────────────► │ (JS) │
└──────────┘ └─────────────┘ └──────────┘
│
▼ admin plane (unix socket, HTTP)
GET /v1/healthmohdel/client(JS) — thin stub that callers import. Opens a unix socket to thin-gate, sends aCallEnvelope, receives an async-iterable ofEvents. Zero transitive provider-SDK imports — caller-side code stays light.mohdel-thin-gate(Rust binary, prebuilt and shipped via themohdel-thin-gate-<platform>npm sub-packages) — scheduler / state owner / supervisor. Binds the data-plane socket, validates the envelope, dispatches to a pooled session subprocess, relays events back, handles graceful cancellation on client disconnect. Binds the admin plane forGET /v1/health. Pushes OTLP metrics (sessions alive/respawned, calls by provider/status, call-duration histogram, cooldown / quota / policy rejections) whenOTEL_EXPORTER_OTLP_ENDPOINTis set. Internal trait hooks (RoutePolicy,QuotaPolicy,ConfigSource,CachePolicy) make the crate testable and fork-friendly for deployments that need bespoke policy — not a published-library surface.mohdel/session(JS subprocess) — provider executor. Spawned by thin-gate, reads envelopes from stdin, dispatches to the matching adapter, writes events to stdout. A napi-rs addon was scoped for hot-loop optimization but current benchmarks show per-call JS CPU is not the bottleneck; the stub stays underrust/napi-addon/for future reactivation.
Running thin-gate
cargo run --bin mohdel-thin-gate /tmp/mohdel-data.sock /tmp/mohdel-admin.sock /path/to/js/session/bin.js
# or with a pre-built release binary:
./target/release/mohdel-thin-gate /tmp/mohdel-data.sock /tmp/mohdel-admin.sock ./js/session/bin.jsPositional args are optional (data socket, admin socket, session bin). Env overrides:
MOHDEL_SESSION_BIN— path to session entrypoint (defaults to none; if unset, data plane returns synthetic events)MOHDEL_SESSION_POOL_SIZE— pre-warmed sessions (default 2)
With no session-bin configured, thin-gate runs in demo mode: POST /v1/call returns a synthetic echo event sequence. Useful for health-checking the HTTP layer without a runtime dependency on Node.
Calling from JS
The client snippet under Library Usage above is the full surface: call(envelope, { socketPath, signal? }) returns an async iterable of events. Pass an AbortSignal to cancel in flight; thin-gate forwards a cancel control message to the session and reuses it on the pool. The envelope is the flat answer(prompt, options) surface plus transport metadata (callId, authId, auth.key, optional traceparent); see js/core/envelope.js for the full field list.
Canonical types (frozen wire contract)
Wire format is JSON over NDJSON frames, camelCase. Types are defined in js/core/ (JSDoc) and mirrored in rust/thin-gate/src/protocol.rs (serde). Cross-language conformance tests enforce round-trip fidelity. The session-side protocol (envelopes in, events out, cancel control messages) is specified in PROTOCOL.md — read that to implement a session in another language.
CallEnvelope— flatanswer()options plus transport metadata:callId,authId,auth.key,traceparent?,baggage?,provider,model,prompt,outputBudget?,outputType?,outputStyle?,outputEffort?,images?,videos?,cache?,tools?,toolChoice?,parallelToolCalls?,identifier?.Event— three-variant union discriminated ontype:{ type: 'delta', delta: { type: 'message' | 'function_call', delta: string } }{ type: 'done', result: AnswerResult }{ type: 'error', error: TypedError }
AnswerResult—status,output,inputTokens,outputTokens,thinkingTokens,cost(single number),timestamps,warning?,toolCalls?.Status—'completed' | 'tool_use' | 'incomplete'.Warning— additive string union:'insufficientOutputBudget','cancelled', ...TypedError—{ message, detail?, severity, retryable, type }.messageis a stable machine key;detailis user-facing context;severityis'trace' | 'debug' | 'info' | 'warn' | 'error' | 'fatal';typeis an optional canonical tag (e.g.'AUTH_INVALID','PROVIDER_COOLDOWN').
A cancel control message { op: "cancel", callId } on session stdin aborts the matching in-flight call.
Extending the frozen wire types is breaking — additive changes only on trait method sets and non-frozen internals. See ARCHITECTURE.md §What isn't frozen for the refinable-vs-frozen split.
Adding a new provider adapter
See CONTRIBUTING.md. Short version:
- Create
js/session/adapters/<provider>.jsexportingasync function* <provider>(envelope, { client?, signal? }). - Map provider-native events to the canonical Event union.
- Pass
{ signal }to the SDK's streaming method so cancellation aborts in-flight HTTP. - On SDK throw: if
signal?.aborted, return silently (run() emits call.cancelled); else yieldcall.errorviaclassifyProviderError(e)from./_errors.js. - Register in
js/session/adapters/index.js. - Write unit tests with a dependency-injected mock client.
- Optionally add a gated live test in
test/live/<provider>.live.test.js.
Configuration
API keys live in ~/.config/mohdel/environment (one KEY=value per line, loaded automatically):
ANTHROPIC_API_SK=sk-ant-...
OPENAI_API_SK=sk-...
GEMINI_API_SK=AI...
GROQ_API_SK=gsk_...
XAI_API_SK=xai-...
CEREBRAS_API_SK=csk-...
MISTRAL_API_SK=...
FIREWORKS_API_SK=fw_...
DEEPSEEK_API_SK=sk-...
OPENROUTER_API_SK=sk-or-...
NOVITA_API_SK=...Only set keys for providers you use. Run mo with no arguments for interactive setup.
File locations
| Path | Purpose |
|------|---------|
| ~/.config/mohdel/environment | API keys |
| ~/.config/mohdel/default.json | Default model selection |
| ~/.config/mohdel/curated.json | Model catalog with metadata, tags, pricing |
| ~/.config/mohdel/providers.json | Provider-level rate limits |
| ~/.config/mohdel/excluded.json | Excluded models |
| ~/.cache/mohdel/uploaded-files.json | Gemini file upload cache |
Paths follow the XDG convention via env-paths.
Provider Matrix
What each provider supports through mohdel's unified interface:
| Provider | Streaming | Tools | Vision | Video | Thinking | Notes |
|----------|-----------|-------|--------|-------|----------|-------|
| Anthropic | Yes | Yes | Yes | No | Yes (adaptive / budget) | identifier → metadata.user_id |
| OpenAI | Yes | Yes | Yes | No | Yes (o-series) | GPT-5 verbosity via outputStyle |
| Gemini | Yes | Yes | Yes | Yes | Yes (thinkingLevel / thinkingBudget) | Auto-uploads large videos; content-hashed cache |
| Cerebras | No | Yes | Yes | No | Yes (reasoning_effort or zai disable_reasoning) | Non-streaming chat completions |
| Groq | No | Yes | Yes | No | No | Non-streaming; shared chat-completions path |
| xAI | Yes | Yes | Yes | No | Auto | OpenAI Responses API over api.x.ai/v1 |
| DeepSeek | No | Yes | Yes | No | No | DSML tool-call fallback when model emits tags in content |
| Fireworks | Yes | Yes | Yes | No | Yes (reasoning_effort) | OpenAI SDK + baseURL; model id auto-prefixed |
| Mistral | No | Yes | Yes | No | No | tool_choice: "any" = required |
| Qwen Cloud | No | Yes | No | No | Yes (enable_thinking + thinking_budget) | Alibaba DashScope intl; hybrid models think by default — effort none sends explicit off |
| Xiaomi | No | Yes | Yes | No | Auto | MiMo; shared chat-completions path, reasoning_content captured |
| OpenRouter | Yes | Yes | Yes | No | Varies | Meta-provider; providerOptions.openrouter for routing prefs |
| Novita | No | No | No | No | No | Image generation only |
Adapter capability ≠ model capability — whether a given model accepts images, tools, or thinking effort depends on the model spec in curated.json. The adapter passes through what the envelope supplies; the provider rejects unsupported combos.
Local Development
git clone <repo> && cd mohdel
npm install
npm test # unit tests, no API keysRust tests
cargo test --workspace # thin-gate + napi-addon
cargo build --release --bin mohdel-thin-gateTest files under rust/thin-gate/tests/:
| File | Coverage |
|------|----------|
| conformance.rs | JS↔Rust protocol round-trip |
| protocol.rs | serde (de)serialization of envelope/events/results |
| server.rs | HTTP layer, synthetic dispatch, 404/400 paths |
| session_dispatch.rs | real node js/session/bin.js spawn + dispatch + graceful cancel |
| policy.rs | RoutePolicy + QuotaPolicy + Enforcer end-to-end |
| config.rs | TOML ConfigSource parsing, defaults, malformed, env override |
| supervision.rs | readiness ping/pong + readiness timeout + garbage-response handling |
| stress.rs | 100 concurrent calls, cancel storm, session-death-under-load |
Spawning tests require node in PATH.
Provider integration tests
These hit real provider APIs. Models are drawn from your local curated.json — one per provider. Each provider block is skipped automatically when its API key is missing.
npm run test:provider # all providers via the factory path
TAG=fast npm run test:provider # filter by model tag
npm run test:multiturn # multi-turn conversation tests (incl. tool round-trip)
npm run test:vision # image input testsLive adapter tests
Exercise the session adapters directly against real provider APIs. Gated on env keys; skipped cleanly when keys are absent. See test/live/README.md for details.
ANTHROPIC_API_SK=sk-ant-... npm run test:live
OPENAI_API_SK=sk-... npm run test:liveScenario-driven testing (the fake provider)
For deterministic stress, benchmark, and bug-repro work, register provider: "fake" in the envelope with a JSON prompt that drives the scenario:
{ mode: 'volume', tokens: 1000 } // throughput stress
{ mode: 'slow', tokens: 50, delayMs: 100 } // streaming cadence
{ mode: 'error', type: 'AUTH_INVALID' } // error classification
{ mode: 'hang' } // cancel / timeout plumbing
{ mode: 'tool', name: 'f', args: { x: 1 } } // tool round-trip
{ mode: 'incomplete' } // status contract
{ mode: 'crash' } // process isolation (exits the adapter process)
{ mode: 'cancel_after', tokens: 5 } // cancel mid-streamAll modes honor AbortSignal. The benchmarks in bench/ use this to pin adapter work to a fixed shape and isolate what's being measured — see bench/bench.js (throughput) and bench/isolation.js (crash containment).
npm scripts
| Command | Description |
|---------|-------------|
| npm test | Unit tests (vitest) |
| npm run test:provider | Provider integration via the factory — real API calls |
| npm run test:live | Live session-adapter tests (env-key gated) |
| npm run lint | StandardJS lint |
| npm run cli | Interactive model picker |
| cargo test --workspace | Rust tests (thin-gate + protocol + policy + stress + ...) |
| node bench/bench.js | In-process vs via-gate throughput benchmark |
| node bench/isolation.js | Crash-isolation demo (in-process dies, via-gate contains) |
Contributing
Fork the repository and submit a pull request. Code style: Node 22+, ES modules, no semicolons, 2-space indent, single quotes (StandardJS). See CONTRIBUTING.md for details.
Mohdel's wire is language-agnostic. The JS client is the first implementation, not the only one — a Python / Go / Ruby / Swift / Elixir / ... client is a great starter contribution. See CONTRIBUTING.md §Porting a client to another language and PROTOCOL.md.
License
MIT. See LICENSE.
