@fugood/buttress-server
v2.24.9
Published
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
Readme
Buttress Server
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
Installation
npm install -g @fugood/buttress-serverQuick Start
Using CLI
# Start with config file
npx bricks-buttress --config ./config.toml
# Start without config (uses env vars and defaults)
npx bricks-buttressWorkspace Binding (bricks buttress)
By default, a buttress-server runs in public mode: any client on the LAN can connect, no auth required. To restrict access to a single BRICKS workspace and enable workspace-scoped JWT auth, bind the server with the bricks buttress CLI commands. Once bound, the server only accepts WebSocket / file-transfer requests carrying a valid access token signed by that workspace's issuer.
The bricks CLI is the tool that performs the binding and writes the local state file. Install it first — see the bricks-cli docs — then bricks auth login with the workspace owner's account before running the commands below.
Bind a server to a workspace
# Pair the local machine's buttress-server with the workspace of the current bricks-cli profile
bricks buttress bind
# Override the auto-detected server id, give it a friendly name, or write to a custom state dir
bricks buttress bind --server-id buttress-mac-studio --name "Studio LLM" --state-dir /etc/buttress
# For headless/remote setups: emit state.json to stdout instead of writing to disk
bricks buttress bind --print > /etc/buttress/state.jsonThe state file (~/.bricks-cli/buttress/state.json by default, or $BRICKS_BUTTRESS_STATE_DIR) stores:
workspace.id/workspace.name— which workspace this server belongs toworkspace.serverId— the server's stable id (defaults tobuttress-<machineId>)workspace.issuerPublicKey+workspace.kid— Ed25519 SPKI used to verify access tokens
Restart bricks-buttress after binding for the change to take effect — the state file is read once at startup.
Inspect bindings
# Show local state.json + the workspace-side bound list
bricks buttress status
# Same, JSON-formatted
bricks buttress status --jsonDiscover servers on the LAN
# UDP scan + HTTP /buttress/info verification (3s timeout by default)
bricks buttress scan
# UDP only (skip the /buttress/info round-trip)
bricks buttress scan --udp-only
# Machine-readable
bricks buttress scan --jsonscan lists every buttress-server visible on the LAN, including unbound (public) ones, with their version, auth state (open vs JWT required + kid), bound workspace, and per-generator hardware caps (score, GPU, usable memory). Servers whose workspace matches your current bricks-cli profile are highlighted; this is purely a discovery command and does not mint any tokens.
Unbind
# Remove the binding from the workspace and delete the local state.json
bricks buttress unbind
# Keep the local state file (useful if you only want to revoke server-side)
bricks buttress unbind --keep-localAfter unbinding, restart the server to return it to public mode.
Issue a long-lived access token
For headless callers (CI, ctor agents) that already hold a workspace token, mint a long-lived buttress access token instead of relying on a per-launcher session token:
# Default 30-day TTL
bricks buttress issue-token
# Custom TTL (seconds), JSON output for scripting
bricks buttress issue-token --ttl 3600 --jsonThe token claims { k: 'ba', w_id, st: 'ws', sid, jti, exp } and any buttress-server bound to the same workspace will accept it.
Configuration
Configuration is loaded from a TOML file passed via --config / -c. Every top-level table is optional — missing sections fall back to defaults. See config/sample.toml for an end-to-end example.
Top-level sections
| Section | Purpose |
| ------------------------- | -------------------------------------------------------------------------------------------------- |
| [env] | Environment variables exported into the process only if not already set |
| [server] | HTTP/RPC listener (port, log level, body limits) |
| [runtime] | Global defaults shared by every generator (most [generators.model] keys may live here too) |
| [runtime.session_cache] | KV-cache reuse store — see Session State Cache |
| [autodiscover] | LAN UDP / HTTP / mDNS discovery toggles |
| [openai_compat] | Enable /oai-compat/v1/* — see Compatibility Endpoints |
| [anthropic_messages] | Enable /anthropic-messages — see Compatibility Endpoints|
| [[generators]] | Array of generator instances — one entry per loaded model |
[env]
[env]
HUGGINGFACE_TOKEN = "hf_xxx" # ggml backends read this; HF_TOKEN is not picked up automatically
CUDA_VISIBLE_DEVICES = "0"Values here are exported only when the variable isn't already set in the process — see Environment Variable Priority. For HuggingFace auth across all backends, [runtime] huggingface_token = "hf_xxx" works regardless of variable name.
[server]
| Key | Type | Default |
| ----------------- | -------------- | ---------------------------------------------------------------------- |
| id | string | buttress-<machineId> — stable id used for autodiscover / binding |
| name | string | Buttress Server (<short id>) — display name |
| port | number | 2080 (overridden by --port) |
| log_level | "debug"/"info"/"warn"/"error" | unset |
| max_body_size | string|number | "50MB" — e.g. "100MB", "1GB", or raw bytes |
| session_timeout | string|number | 60000 ms — accepts ms numbers or duration strings ("30s") |
| temp_file_dir | string | $TMPDIR/.buttress |
[runtime] — global generator defaults
Most ggml-llm [generators.model] keys can also live in [runtime] as defaults. Per-generator values win; otherwise the runtime default applies.
| Key | Type | Notes |
| ---------------------------- | ----------------------------- | ---------------------------------------------------------------------- |
| cache_dir | string | Model + metadata cache root (default ~/.buttress/models) |
| huggingface_token | string | Falls back to $HUGGINGFACE_TOKEN |
| http_headers | table | Extra headers attached to HF / HTTP downloads |
| context_release_delay_ms | number | Idle time before unloading a context (default 10000; 0 = immediate)|
| prefer_variants | string[] | Override variant probe order (ggml backends) |
| n_threads | number | CPU thread count |
| n_ctx | number | Context window (per-model value wins; auto-capped at training context) |
| n_gpu_layers | number|"auto" | Layers offloaded to GPU (default "auto") |
| n_batch / n_ubatch | number | Prompt batch / micro-batch size. Note: n_batch has a model-level default of 512 that shadows the runtime value unless [generators.model] n_batch is set explicitly. |
| n_parallel | number | Parallel sequences (default 4) |
| n_cpu_moe | number | MoE expert layers offloaded to CPU |
| flash_attn_type | "on" / "off" / "auto" | When a GPU backend is selected, defaults to "auto"; on CPU, defaults to "off". Explicit "on" / "off" / "auto" overrides. |
| cache_type_k, cache_type_v | string | KV-cache dtype (f16, f32, q8_0, q4_0, …) |
| kv_unified | boolean | Use a unified KV cache across sequences |
| swa_full | boolean | Materialize full attention even for sliding-window layers |
| ctx_shift | boolean | Allow llama.cpp's rolling context shift |
| use_mmap, use_mlock | boolean | Memory-mapping / locking |
| no_extra_bufts | boolean | Disable extra compute buffer types |
| cpu_mask, cpu_strict | string / boolean | CPU affinity (advanced) |
| devices | string[] | Restrict to specific GGML devices |
| Speculative keys | various | speculative, spec_type, spec_draft_n_max/n_min/p_min/p_split |
[autodiscover]
Set autodiscover = true for defaults, false (or omit) to disable, or a table for fine control:
[autodiscover]
udp.port = 8089
udp.announcements = { enabled = true, interval = 5000 }
udp.requests = { enabled = true, responseDelay = 100 }
http.enabled = true
http.path = "/buttress/info"
http.cors = true
# mdns.enabled = false # Bonjour/Avahi advertisement (optional)[[generators]]
Every generator entry has a type, an optional [generators.backend] table, and a [generators.model] table:
[[generators]]
type = "ggml-llm" # or "ggml-stt" / "mlx-llm"
[generators.backend]
# (see per-type sections below)
[generators.model]
repo_id = "..."
# (see per-type sections below)Common [generators.model] keys
Shared by all generator types:
| Key | Type | Notes |
| ------------------------- | --------- | -------------------------------------------------------------------------------- |
| repo_id (required) | string | HuggingFace repo (org/repo) |
| revision | string | Default "main" |
| download | boolean | Pre-download at server startup (default false) |
Additional keys honored by ggml-llm and ggml-stt (mlx-llm gets quantization from the repo itself and does not use these):
| Key | Type | Notes |
| ------------------------- | --------- | -------------------------------------------------------------------------------- |
| filename | string | Pin a specific artifact in the repo |
| url | string | Direct download URL (skips manifest lookup) |
| quantization | string | Preferred quant tag — e.g. q4_0, q8_0, mxfp4 |
| preferred_quantizations | string[] | Ordered fallback list when quantization doesn't match (alias: quantizations) |
| allow_local_file | boolean | Required to use local_path / mmproj_local_path |
| local_path | string | Use a local file as the load path. Repo metadata is still resolved from HF, so repo_id is still required. |
| api_base, base_url | string | Override HF API / blob hosts (mirrors / proxies) |
ggml-llm (llama.cpp via @fugood/llama.node)
Loads a GGUF LLM. Runtime keys above can be overridden per-generator under [generators.model]; [generators.backend] only controls backend selection and resource planning.
[generators.backend]
| Key | Type | Default | Notes |
| --------------------- | -------- | --------------------------------------------- | ---------------------------------------------------------------- |
| variant | string | auto | Force cuda / vulkan / snapdragon / default |
| variant_preference | string[] | ["cuda","vulkan","snapdragon","default"] | Probe order when variant is unset |
| gpu_memory_fraction | number | 0.85 | Max GPU fraction the hardware guardrails may plan against |
| cpu_memory_fraction | number | 0.5 | Max RAM fraction for CPU-side buffers |
[generators.model] — in addition to the common keys above:
| Key | Type | Notes |
| ----------------------------------------------------------------------------- | ---------------- | -------------------------------------------------------------------- |
| n_ctx | number | Context window. Auto-capped at the model's training context. |
| n_gpu_layers | number|"auto" | Layers offloaded to GPU (default "auto") |
| n_batch | number | Prompt batch size (default 512) |
| n_ubatch, n_threads, n_parallel, n_cpu_moe | number | Same semantics as the [runtime] defaults |
| flash_attn_type, cache_type_k, cache_type_v, kv_unified, swa_full, ctx_shift, use_mmap, use_mlock, no_extra_bufts, cpu_mask, cpu_strict, devices | various | Per-model overrides for the [runtime] defaults |
Multimodal (mtmd) — auto-downloads the matching mmproj-*.gguf from the same repo and calls initMultimodal:
| Key | Type | Notes |
| ------------------------- | ------- | ------------------------------------------------------------------ |
| enable_mtmd | boolean | Default false |
| mmproj_filename | string | Pin a specific projector file |
| mmproj_url | string | Direct URL override |
| mmproj_local_path | string | Local projector (requires allow_local_file = true) |
| mmproj_use_gpu | boolean | null = auto (true when n_gpu_layers > 0) |
| mmproj_image_min_tokens | number | Min visual tokens (dynamic-resolution models; -1 = unset) |
| mmproj_image_max_tokens | number | Max visual tokens (-1 = unset) |
Speculative decoding
| Key | Type | Notes |
| -------------------- | ------ | -------------------------------------------------- |
| speculative | string | Draft model identifier |
| spec_type | string | Strategy (backend-defined) |
| spec_draft_n_max | int | Max drafted tokens per step |
| spec_draft_n_min | int | Min drafted tokens |
| spec_draft_p_min | number | Min acceptance probability |
| spec_draft_p_split | number | Split threshold |
Example
[[generators]]
type = "ggml-llm"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
gpu_memory_fraction = 0.95
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800
download = trueggml-stt (whisper.cpp via @fugood/whisper.node)
Loads a Whisper GGML model for speech-to-text.
[generators.backend]
| Key | Type | Default | Notes |
| --------------------- | -------- | ----------------------------- | ---------------------------------- |
| variant | string | auto | cuda / vulkan / default |
| variant_preference | string[] | ["cuda","vulkan","default"] | Probe order |
| gpu_memory_fraction | number | 0.85 | |
| cpu_memory_fraction | number | 0.5 | |
[generators.model] — common keys plus:
| Key | Type | Default | Notes |
| ------------------------- | ----------------------------- | -------------------------------- | ---------------------------------------------------- |
| repo_id | string | "BricksDisplay/whisper-ggml" | Defaulted (unlike ggml-llm) |
| preferred_quantizations | string[] | ["q8_0", <no-quant>, "q5_1"] | Default fallback chain |
| use_gpu | boolean | true | Force-disable GPU even when available |
| use_flash_attn | "on" / "off" / "auto" / boolean | "auto" | "auto" enables flash-attn when GPU is in use. true/false are accepted as shortcuts for "on"/"off". |
Runtime extras — under [runtime] for ggml-stt only:
| Key | Type | Notes |
| ------------- | ------ | ------------------------------------------- |
| max_threads | number | Caps the whisper.cpp thread count |
Example
[[generators]]
type = "ggml-stt"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-large-v3-turbo-q8_0.bin"
use_gpu = true
use_flash_attn = "on"
download = truemlx-llm (Apple Silicon, Python mlx-lm / mlx-vlm bridge)
Loads an MLX-format model on Apple Silicon. On first use, the backend creates a virtualenv at {cache_dir}/mlx-env and installs mlx_lm_package, mlx_vlm_package, plus torch and torchvision (required by some VLM processors). If an existing venv already has mlx_vlm and torch importable, the install step is skipped. There is no [generators.backend] section.
[generators.model] — common repo_id / revision / download plus:
| Key | Type | Default | Notes |
| ------------------ | ------------------- | --------- | ----------------------------------------------------------------------------- |
| adapter_path | string | — | Local LoRA adapter directory |
| vlm | "auto" / boolean | "auto" | Force VLM (true) vs text-only (false); "auto" infers from the repo |
| tokenizer_config | table | — | Forwarded to mlx_lm.load(..., tokenizer_config=...) |
| model_config | table | — | Forwarded to mlx_lm.load(..., model_config=...) |
quantization, filename, and preferred_quantizations are not used — the MLX repo itself determines the quantization.
Runtime extras — under [runtime] for mlx-llm:
| Key | Type | Default | Notes |
| ------------------- | ------ | ----------------------------- | -------------------------------------------------------------------- |
| mlx_env_dir | string | {cache_dir}/mlx-env | Location of the auto-managed Python venv |
| mlx_lm_package | string | "mlx-lm==0.31.1" | pip spec used when provisioning the venv |
| mlx_vlm_package | string | "mlx-vlm==0.4.0" | pip spec used when provisioning the venv |
| session_cache.* | table | enabled, 5GB, 100 entries | Separate cache from ggml-llm (lives in {cache_dir}/mlx-session-cache) |
Example
[[generators]]
type = "mlx-llm"
[generators.model]
repo_id = "mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
vlm = true
download = trueProgrammatic Usage
import { startServer } from '@fugood/buttress-server'
startServer({
port: 3000,
defaultConfig: {
runtime: {
cache_dir: './.buttress-cache'
},
generators: [
{
type: 'ggml-llm',
model: {
repo_id: 'ggml-org/gemma-3-270m-qat-GGUF',
quantization: 'mxfp4',
}
}
]
}
})
.then(({ port }) => {
console.log(`Server running on port ${port}`)
})
.catch(console.error)Environment Variable Priority
Environment variables can be set in the [env] section of the TOML config. These values will only be applied if the environment variable is not already set in the system. This allows:
- Default values in config file
- System environment variables to override config values
- Command-line exports to have highest priority
Example:
# Config has: [env] HF_TOKEN = "default_token"
# This will use the system env variable (highest priority)
HF_TOKEN=my_token npx bricks-buttress
# This will use the config value
npx bricks-buttressPort Priority
Port can be configured via multiple sources (highest priority first):
- Command-line flag:
--port 3000 - Config file:
[server] port = 2080 - Default:
2080
CLI Reference
bricks-buttress v2.23.0-beta.22
Buttress server for remote inference with GGML backends.
Usage:
bricks-buttress [options]
Options:
-h, --help Show this help message
-v, --version Show version number
-p, --port <port> Port to listen on (default: 2080)
-c, --config <path|toml> Path to TOML config file or inline TOML string
Testing Options:
--test-caps <backend> Test model capabilities (ggml-llm or ggml-stt)
--test-caps-model-id <id> Model ID to test (used with --test-caps)
--test-models <ids> Comma-separated list of model IDs to test
--test-models-default Test default set of models
Note: --test-models and --test-models-default output a markdown report
file (e.g., ggml-llm-model-capabilities-YYYY-MM-DD.md)
Environment Variables:
NODE_ENV Set to 'development' for dev mode
Examples:
bricks-buttress
bricks-buttress --port 3000
bricks-buttress --config ./config.toml
bricks-buttress --test-caps ggml-llm --test-models-default
bricks-buttress --test-caps ggml-stt --test-caps-model-id BricksDisplay/whisper-ggml:ggml-small.binCompatibility Endpoints (Experimental)
The server can expose OpenAI- and Anthropic-compatible HTTP endpoints in addition to the native RPC. Each endpoint is opt-in via the TOML config:
[openai_compat]
enabled = true
# cors_allowed_origins = "*" # Or a list of origins; defaults to disabled
[anthropic_messages]
enabled = true
# cors_allowed_origins = ["http://localhost:3000"]| Endpoint | Config flag |
| --------------------- | ------------------------------------ |
| /oai-compat/v1/* | [openai_compat] enabled = true |
| /anthropic-messages | [anthropic_messages] enabled = true |
Session State Cache
The server supports session state caching for ggml-llm generators, which saves KV cache state to disk after completions. This enables:
- Prompt reuse: Same or similar prompts can reuse cached state, skipping prompt processing
- Multi-turn conversations: Conversation history state is preserved across requests
Configuration
[runtime.session_cache]
enabled = true # Enable/disable session caching (default: true)
max_size_bytes = "10GB" # Supports string (e.g., "10GB", "500MB") or number (default: 10GB)
max_entries = 1000 # Max number of cached entries (default: 1000)How it works
- After a successful completion, the KV cache state is saved to disk
- On new completions, the server checks if any cached state matches the prompt prefix
- If a match is found, the cached state is loaded, skipping redundant prompt processing
- LRU eviction removes oldest entries when limits are exceeded
Cache location
Cache files are stored in {cache_dir}/.session-state-cache/:
cache-map.json- Index of cached entriesstates/- Binary state filestemp/- Temporary files (auto-cleaned after 1 hour)
Tips
- macOS: Use
sudo sysctl iogpu.wired_limit_mb=<number>to increase GPU memory allocation. The default available memory of GPU is about ~70%. For example, if the hardware have 128GB memory, you can usesudo sysctl iogpu.wired_limit_mb=137438to increase to 128GB. Runsudo sysctl iogpu.wired_limit_mb=0if you want to back to default.
