wandler

v2.6.4

Published

2 months ago

transformers.js inference server — OpenAI-compatible API for ONNX models

0High
0Medium
0Low

timpietrusky

inference transformers.js onnx openai llm webgpu server

wandler

OpenAI-compatible inference server powered by transformers.js. Run ONNX models locally with WebGPU acceleration or CPU — no Python, no CUDA required.

Think vLLM or llama.cpp, but for the ts crowd.

Quickstart

npx wandler --llm onnx-community/gemma-4-E4B-it-ONNX:q4

# custom model, precision, device, port
npx wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX:fp16 --device cpu --port 3000

# with embeddings and STT
npx wandler --llm onnx-community/gemma-4-E4B-it-ONNX:q4 \
  --embedding Xenova/all-MiniLM-L6-v2:q8 \
  --stt onnx-community/whisper-tiny:q4

Use it with the OpenAI SDK:

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "-" });

const response = await client.chat.completions.create({
  model: "onnx-community/gemma-4-E4B-it-ONNX",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

CLI

wandler — transformers.js inference server

Usage:
  wandler --llm org/repo[:precision] [options]
  wandler model ls [--type <type>]

Commands:
  models                    List available models from the catalog

Model:
  -l, --llm <id>              LLM model
      --backend <name>        LLM backend: wandler, transformersjs (default: wandler)
  -e, --embedding <id>        Embedding model
  -s, --stt <id>              STT model
  -d, --device <type>         Device: auto, cpu, cuda, coreml, dml, webgpu, wasm (default: auto)
      --hf-token <token>      HuggingFace token for gated models
      --cache-dir <path>      Model cache directory

Server:
  -p, --port <number>         Port (default: 8000)
      --host <addr>           Bind address (default: 127.0.0.1)
  -k, --api-key <key>         API key for auth (or WANDLER_API_KEY)
      --cors-origin <origin>  Allowed CORS origin (default: *)
      --max-tokens <n>        Max tokens per request (default: 2048)
      --max-concurrent <n>    Max concurrent requests (default: 1)
      --timeout <ms>          Request timeout in ms (default: 120000)
      --log-level <level>     debug, info, warn, error (default: info)
      --quiet                 Suppress non-error startup/profile logs
      --prefill-chunk-size <n>
                              Chunk size for long-prompt prefill; auto uses a 640MB GPU attention budget; auto:<mb> customizes it; 0/off disables it
      --decode-loop <mode>     Wandler decode loop: auto/on/off (default: auto; on is experimental)
      --prefix-cache <mode>   Enable prefix KV cache: true/false (default: true)
      --prefix-cache-entries <n>
                              Prefix KV cache entries (default: 2)
      --prefix-cache-min-tokens <n>
                              Minimum prefix tokens to cache (default: 512)
      --warmup-tokens <n>     Approximate prompt tokens to run once before serving
      --warmup-max-new-tokens <n>
                              Max new tokens for startup warmup

Info:
  -v, --version               Show version
  -h, --help                  Show this help

Precision suffixes: q4, q8, fp16, fp32 (default: q4)

Environment Variables

Every CLI flag has a corresponding environment variable:

| Variable | Default | Description | |----------|---------|-------------| | WANDLER_LLM | onnx-community/gemma-4-E4B-it-ONNX:q4 | LLM model with precision | | WANDLER_BACKEND | wandler | LLM backend: wandler for Wandler's serving layer, transformersjs for the direct baseline | | WANDLER_STT | onnx-community/whisper-tiny:q4 | Speech-to-text model | | WANDLER_EMBEDDING | — | Embedding model (disabled by default) | | WANDLER_DEVICE | webgpu | Device: webgpu, cpu, wasm | | WANDLER_PORT | 8000 | Server port | | WANDLER_HOST | 127.0.0.1 | Bind address | | WANDLER_API_KEY | — | API key for auth | | WANDLER_CORS_ORIGIN | * | Allowed CORS origin | | WANDLER_MAX_TOKENS | 2048 | Max tokens per request | | WANDLER_MAX_CONCURRENT | 1 | Max concurrent requests | | WANDLER_TIMEOUT | 120000 | Request timeout (ms) | | WANDLER_LOG_LEVEL | info | Log level | | WANDLER_QUIET | false | Suppress non-error startup/profile logs | | WANDLER_CACHE_DIR | ~/.cache/huggingface | Model cache directory (also respects HF_HOME) | | WANDLER_PREFILL_CHUNK_SIZE | auto | Chunk size for long-prompt prefill; auto uses the fastest GPU path that fits a 640MB attention budget, auto:<mb> customizes it; set 0/off to disable | | WANDLER_DECODE_LOOP | auto | Wandler-owned decode loop for supported text generation; auto uses the safe transformers.js generate() path, on opts into the experimental Wandler loop, off disables it | | WANDLER_PREFIX_CACHE | true | Enable in-memory prefix KV caching for repeated system/tool prefixes | | WANDLER_PREFIX_CACHE_ENTRIES | 2 | Prefix KV cache entry count | | WANDLER_PREFIX_CACHE_MIN_TOKENS | 512 | Minimum prefix size before caching | | WANDLER_WARMUP_TOKENS | 0 | Approximate prompt tokens to run once before serving | | WANDLER_WARMUP_MAX_NEW_TOKENS | 8 | Max new tokens for startup warmup | | HF_TOKEN | — | HuggingFace token for gated models |

Endpoints

LLM

| Endpoint | Method | Description | |----------|--------|-------------| | /v1/chat/completions | POST | Chat completion (streaming + non-streaming) | | /v1/completions | POST | Text completion (legacy) | | /v1/models | GET | List loaded models | | /v1/models/{id} | GET | Get model details |

Embeddings

| Endpoint | Method | Description | |----------|--------|-------------| | /v1/embeddings | POST | Text embeddings |

Audio

| Endpoint | Method | Description | |----------|--------|-------------| | /v1/audio/transcriptions | POST | Speech-to-text (Whisper) |

Utilities

| Endpoint | Method | Description | |----------|--------|-------------| | /tokenize | POST | Text to token IDs | | /detokenize | POST | Token IDs to text | | /health | GET | Server status | | /admin/metrics | GET | Request metrics |

Parameters

Chat & Text Completions

Standard OpenAI parameters:

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | messages / prompt | array / string | required | Input | | temperature | float | 0.7 | Sampling temperature (0 = greedy) | | top_p | float | 0.95 | Nucleus sampling | | max_tokens | int | 2048 | Max tokens to generate | | stream | bool | false | Enable SSE streaming | | stop | string | string[] | — | Stop sequences | | presence_penalty | float | 0 | Penalize token presence | | frequency_penalty | float | 0 | Penalize token frequency | | response_format | object | — | {"type": "json_object"} for JSON mode | | tools | array | — | Function calling definitions | | stream_options | object | — | {"include_usage": true} |

Extended parameters (vLLM/llama.cpp compatible):

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | top_k | int | — | Top-k sampling | | min_p | float | — | Minimum probability threshold | | typical_p | float | — | Locally typical sampling | | repetition_penalty | float | — | Direct repetition penalty (> 1.0) | | no_repeat_ngram_size | int | — | Prevent N-gram repetition |

Embeddings

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | input | string | string[] | required | Text to embed | | encoding_format | string | "float" | "float" or "base64" |

Compatible Models

List all verified models with their capabilities:

wandler model ls

type      | size  | prec | capabilities             | repo:precision                                   | name
------------------------------------------------------------------------------------------------------------------------
llm       | 2B    | q4   | chat, tool-calling       | onnx-community/gemma-4-E4B-it-ONNX:q4            | Gemma 4 E4B
llm       | 1.2B  | q4   | chat, tool-calling       | LiquidAI/LFM2.5-1.2B-Instruct-ONNX:q4            | LFM 2.5 1.2B
llm       | 350M  | q4   | chat, tool-calling       | LiquidAI/LFM2.5-350M-ONNX:q4                     | LFM 2.5 350M
llm       | 0.8B  | q4   | chat, tool-calling       | onnx-community/Qwen3.5-0.8B-Text-ONNX:q4         | Qwen 3.5 0.8B
llm       | 1.7B  | q4   | chat                     | HuggingFaceTB/SmolLM2-1.7B-Instruct:q4           | SmolLM2 1.7B
embedding | 22M   | q8   | embedding                | Xenova/all-MiniLM-L6-v2:q8                       | all-MiniLM-L6-v2
embedding | 33M   | q8   | embedding                | Xenova/bge-small-en-v1.5:q8                      | BGE Small EN v1.5
embedding | 137M  | q8   | embedding                | nomic-ai/nomic-embed-text-v1.5:q8                | Nomic Embed Text v1.5
stt       | 39M   | q4   | transcription            | onnx-community/whisper-tiny:q4                   | Whisper Tiny
stt       | 74M   | q4   | transcription            | onnx-community/whisper-base:q4                   | Whisper Base
stt       | 244M  | q4   | transcription            | onnx-community/whisper-small:q4                  | Whisper Small

Filter by type:

wandler model ls --type llm
wandler model ls --type embedding
wandler model ls --type stt

Use the repo:precision value directly with --llm, --embedding, or --stt.

Any ONNX model from onnx-community or transformers.js compatible models should work beyond the verified catalog.

Tool Calling

wandler parses tool calls from multiple model output formats:

LFM: [func_name(arg="val")] and [tool_calls [{...}]]
Qwen: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
OpenAI JSON: {"tool_calls": [...]}

Thinking blocks (<think>...</think>) are automatically stripped before parsing.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

wandler

Quickstart

CLI

Environment Variables

Endpoints

LLM

Embeddings

Audio

Utilities

Parameters

Chat & Text Completions

Embeddings

Compatible Models

Tool Calling

License