@fugood/buttress-server

v2.24.2

Published

a day ago

A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.

Downloads

1,685

0High
0Medium
0Low

BRICKS buttress server

Buttress Server

A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.

Installation

npm install -g @fugood/buttress-server

Quick Start

Using CLI

# Start with config file
npx bricks-buttress --config ./config.toml

# Start without config (uses env vars and defaults)
npx bricks-buttress

Configuration

Configuration can be provided via:

--config / -c flag with TOML file path

Configuration Format (TOML)

# Environment variables (only set if not already defined in system)
[env]
HF_TOKEN = "your_huggingface_token_here"
CUDA_VISIBLE_DEVICES = "0"

[server]
port = 2080
log_level = "info"

[runtime]
cache_dir = "~/.buttress/models"

# Session state cache for ggml-llm (saves KV cache to disk for prompt reuse)
[runtime.session_cache]
enabled = true
max_size_bytes = "10GB"  # Supports string (e.g., "10GB", "500MB") or number
max_entries = 1000

# GGML LLM generator
[[generators]]
type = "ggml-llm"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800

# GGML STT (Speech-to-Text) generator
[[generators]]
type = "ggml-stt"
[generators.backend]
variant_preference = ["coreml", "default"]
[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-small.bin"

Programmatic Usage

import { startServer } from '@fugood/buttress-server'

startServer({
  port: 3000,
  defaultConfig: {
    runtime: {
      cache_dir: './.buttress-cache'
    },
    generators: [
      {
        type: 'ggml-llm',
        model: {
          repo_id: 'ggml-org/gemma-3-270m-qat-GGUF',
          quantization: 'mxfp4',
        }
      }
    ]
  }
})
  .then(({ port }) => {
    console.log(`Server running on port ${port}`)
  })
  .catch(console.error)

Environment Variable Priority

Environment variables can be set in the [env] section of the TOML config. These values will only be applied if the environment variable is not already set in the system. This allows:

Default values in config file
System environment variables to override config values
Command-line exports to have highest priority

Example:

# Config has: [env] HF_TOKEN = "default_token"

# This will use the system env variable (highest priority)
HF_TOKEN=my_token npx bricks-buttress

# This will use the config value
npx bricks-buttress

Port Priority

Port can be configured via multiple sources (highest priority first):

Command-line flag: --port 3000
Config file: [server] port = 2080
Default: 2080

CLI Reference

bricks-buttress v2.23.0-beta.22

Buttress server for remote inference with GGML backends.

Usage:
  bricks-buttress [options]

Options:
  -h, --help                    Show this help message
  -v, --version                 Show version number
  -p, --port <port>             Port to listen on (default: 2080)
  -c, --config <path|toml>      Path to TOML config file or inline TOML string

Testing Options:
  --test-caps <backend>         Test model capabilities (ggml-llm or ggml-stt)
  --test-caps-model-id <id>     Model ID to test (used with --test-caps)
  --test-models <ids>           Comma-separated list of model IDs to test
  --test-models-default         Test default set of models

  Note: --test-models and --test-models-default output a markdown report
        file (e.g., ggml-llm-model-capabilities-YYYY-MM-DD.md)

Environment Variables:
  NODE_ENV                      Set to 'development' for dev mode

Examples:
  bricks-buttress
  bricks-buttress --port 3000
  bricks-buttress --config ./config.toml
  bricks-buttress --test-caps ggml-llm --test-models-default
  bricks-buttress --test-caps ggml-stt --test-caps-model-id BricksDisplay/whisper-ggml:ggml-small.bin

Session State Cache

The server supports session state caching for ggml-llm generators, which saves KV cache state to disk after completions. This enables:

Prompt reuse: Same or similar prompts can reuse cached state, skipping prompt processing
Multi-turn conversations: Conversation history state is preserved across requests

Configuration

[runtime.session_cache]
enabled = true                  # Enable/disable session caching (default: true)
max_size_bytes = "10GB"         # Supports string (e.g., "10GB", "500MB") or number (default: 10GB)
max_entries = 1000              # Max number of cached entries (default: 1000)

How it works

After a successful completion, the KV cache state is saved to disk
On new completions, the server checks if any cached state matches the prompt prefix
If a match is found, the cached state is loaded, skipping redundant prompt processing
LRU eviction removes oldest entries when limits are exceeded

Cache location

Cache files are stored in {cache_dir}/.session-state-cache/:

cache-map.json - Index of cached entries
states/ - Binary state files
temp/ - Temporary files (auto-cleaned after 1 hour)

Tips

macOS: Use sudo sysctl iogpu.wired_limit_mb=<number> to increase GPU memory allocation. The default available memory of GPU is about ~70%. For example, if the hardware have 128GB memory, you can use sudo sysctl iogpu.wired_limit_mb=137438 to increase to 128GB. Run sudo sysctl iogpu.wired_limit_mb=0 if you want to back to default.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Buttress Server

Installation

Quick Start

Using CLI

Configuration

Configuration Format (TOML)

Programmatic Usage

Environment Variable Priority

Port Priority

CLI Reference

Session State Cache

Configuration

How it works

Cache location

Tips