@fugood/buttress-server
v2.24.2
Published
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
Downloads
1,685
Readme
Buttress Server
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
Installation
npm install -g @fugood/buttress-serverQuick Start
Using CLI
# Start with config file
npx bricks-buttress --config ./config.toml
# Start without config (uses env vars and defaults)
npx bricks-buttressConfiguration
Configuration can be provided via:
--config/-cflag with TOML file path
Configuration Format (TOML)
# Environment variables (only set if not already defined in system)
[env]
HF_TOKEN = "your_huggingface_token_here"
CUDA_VISIBLE_DEVICES = "0"
[server]
port = 2080
log_level = "info"
[runtime]
cache_dir = "~/.buttress/models"
# Session state cache for ggml-llm (saves KV cache to disk for prompt reuse)
[runtime.session_cache]
enabled = true
max_size_bytes = "10GB" # Supports string (e.g., "10GB", "500MB") or number
max_entries = 1000
# GGML LLM generator
[[generators]]
type = "ggml-llm"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800
# GGML STT (Speech-to-Text) generator
[[generators]]
type = "ggml-stt"
[generators.backend]
variant_preference = ["coreml", "default"]
[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-small.bin"Programmatic Usage
import { startServer } from '@fugood/buttress-server'
startServer({
port: 3000,
defaultConfig: {
runtime: {
cache_dir: './.buttress-cache'
},
generators: [
{
type: 'ggml-llm',
model: {
repo_id: 'ggml-org/gemma-3-270m-qat-GGUF',
quantization: 'mxfp4',
}
}
]
}
})
.then(({ port }) => {
console.log(`Server running on port ${port}`)
})
.catch(console.error)Environment Variable Priority
Environment variables can be set in the [env] section of the TOML config. These values will only be applied if the environment variable is not already set in the system. This allows:
- Default values in config file
- System environment variables to override config values
- Command-line exports to have highest priority
Example:
# Config has: [env] HF_TOKEN = "default_token"
# This will use the system env variable (highest priority)
HF_TOKEN=my_token npx bricks-buttress
# This will use the config value
npx bricks-buttressPort Priority
Port can be configured via multiple sources (highest priority first):
- Command-line flag:
--port 3000 - Config file:
[server] port = 2080 - Default:
2080
CLI Reference
bricks-buttress v2.23.0-beta.22
Buttress server for remote inference with GGML backends.
Usage:
bricks-buttress [options]
Options:
-h, --help Show this help message
-v, --version Show version number
-p, --port <port> Port to listen on (default: 2080)
-c, --config <path|toml> Path to TOML config file or inline TOML string
Testing Options:
--test-caps <backend> Test model capabilities (ggml-llm or ggml-stt)
--test-caps-model-id <id> Model ID to test (used with --test-caps)
--test-models <ids> Comma-separated list of model IDs to test
--test-models-default Test default set of models
Note: --test-models and --test-models-default output a markdown report
file (e.g., ggml-llm-model-capabilities-YYYY-MM-DD.md)
Environment Variables:
NODE_ENV Set to 'development' for dev mode
Examples:
bricks-buttress
bricks-buttress --port 3000
bricks-buttress --config ./config.toml
bricks-buttress --test-caps ggml-llm --test-models-default
bricks-buttress --test-caps ggml-stt --test-caps-model-id BricksDisplay/whisper-ggml:ggml-small.binSession State Cache
The server supports session state caching for ggml-llm generators, which saves KV cache state to disk after completions. This enables:
- Prompt reuse: Same or similar prompts can reuse cached state, skipping prompt processing
- Multi-turn conversations: Conversation history state is preserved across requests
Configuration
[runtime.session_cache]
enabled = true # Enable/disable session caching (default: true)
max_size_bytes = "10GB" # Supports string (e.g., "10GB", "500MB") or number (default: 10GB)
max_entries = 1000 # Max number of cached entries (default: 1000)How it works
- After a successful completion, the KV cache state is saved to disk
- On new completions, the server checks if any cached state matches the prompt prefix
- If a match is found, the cached state is loaded, skipping redundant prompt processing
- LRU eviction removes oldest entries when limits are exceeded
Cache location
Cache files are stored in {cache_dir}/.session-state-cache/:
cache-map.json- Index of cached entriesstates/- Binary state filestemp/- Temporary files (auto-cleaned after 1 hour)
Tips
- macOS: Use
sudo sysctl iogpu.wired_limit_mb=<number>to increase GPU memory allocation. The default available memory of GPU is about ~70%. For example, if the hardware have 128GB memory, you can usesudo sysctl iogpu.wired_limit_mb=137438to increase to 128GB. Runsudo sysctl iogpu.wired_limit_mb=0if you want to back to default.
