athanor

v1.4.0

Published

18 hours ago

Personal LLM alchemy for Apple Silicon: discover, run, and publish MLX and llama.cpp models

0High
0Medium
0Low

athanor

Athanor — personal LLM alchemy. Discover, download, configure, and switch between MLX and llama.cpp models on Apple Silicon from a single TUI or CLI, while keeping an OpenAI-compatible HTTP endpoint live for downstream tools (pi-agent, editors, etc.).

What it does

Discovers MLX models in your HuggingFace cache and GGUF files in ~/.models.
Downloads new models from HuggingFace via the hf CLI.
Runs them via mlx_lm.server or llama-server, one or more at a time, each on a stable port.
Supervises the processes as detached children with per-process log files and automatic reattach.
Publishes them to a pi-agent catalog (~/.pi/agent/models.json) as one custom provider per model, leaving your other (cloud, Ollama, etc.) providers alone.
Exposes an optional local control API so other tools can ask athanor to activate a model on demand.

Prerequisites

macOS on Apple Silicon
Node.js ≥ 18
mlx_lm.server (from mlx-lm) — text-only MLX models
mlx_vlm.server (from mlx-vlm) — vision/multimodal MLX models; optional if you never run VLMs
llama-server (from llama.cpp)
hf (from huggingface_hub) — only required for athanor pull

Run athanor doctor at any point to verify all four are on your PATH.

Agent-assisted setup

If you use an AI coding agent (Claude Code, Cursor, Aider, etc.), the fastest path is to open this repo in the agent and ask it to set athanor up for you. AGENTS.md has an Onboarding a user section written for that case: it tells the agent how to install the CLI (npm start vs npm link), run athanor doctor, install any missing runtime binaries, profile the host (Apple Silicon check, unified memory via vm_stat, HF cache size), and pick a starter model sized for the machine.

A minimal prompt, once the repo is open:

Set up athanor on this machine. Profile what I have, install anything missing, and suggest a starter model I can actually run.

The rest of this README walks the same path manually.

Setup

Install the three runtime helpers athanor shells out to. All three land on your PATH and can be verified with athanor doctor.

mlx-lm (MLX runtime)

mlx-lm is a Python package; the mlx_lm.server entry point is what athanor invokes. A dedicated virtualenv keeps it isolated from your system Python.

# with uv (recommended)
brew install uv
uv tool install mlx-lm
# ⇒ `mlx_lm.server` is now on PATH via ~/.local/bin

# or with pipx
brew install pipx
pipx install mlx-lm

# or with a plain venv
python3 -m venv ~/.venvs/mlx && source ~/.venvs/mlx/bin/activate
pip install -U mlx-lm

Verify:

mlx_lm.server --help

MLX requires Apple Silicon and macOS 13.5+. Models are downloaded to ~/.cache/huggingface/hub on first use.

mlx-vlm (MLX vision/multimodal runtime)

Required only if you flip a model to mlxFlavor: "vlm" to feed it actual image input. Athanor defaults every MLX entry to mlx_lm.server, which handles text-only chat for most VLM-tagged repos (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, LLaVA, etc.) without torch. Only install this when you actually need vision — and then opt in per model with athanor flavor <slug> vlm.

mlx_vlm.server imports transformers' VLM processors, which in turn import PyTorch and Torchvision. Installing mlx-vlm alone is not enough; torch and torchvision must be present in the same environment. With uv:

uv tool install mlx-vlm --with torch --with torchvision

Or with pipx:

pipx install mlx-vlm
pipx inject mlx-vlm torch torchvision

Or with pip into the same venv you used for mlx-lm:

pip install -U mlx-vlm torch torchvision

Verify:

mlx_vlm.server --help
python3 -c "import torch, torchvision; print(torch.__version__, torchvision.__version__)"

If mlx_vlm.server starts but a request fails with Qwen3VLVideoProcessor requires the PyTorch library (or similar), torch/torchvision are missing or installed into a different interpreter than mlx_vlm.server is using. Re-run the install above into the right environment.

llama.cpp (GGUF runtime)

The easiest path on macOS is Homebrew — the bottle is built with Metal enabled, so GPU acceleration works out of the box:

brew install llama.cpp

You can also build from source if you want a specific revision or custom flags:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
# copy or symlink build/bin/llama-server onto your PATH

Verify:

llama-server --help

hf (HuggingFace CLI — only for `athanor pull`)

Needed only if you want athanor to download new models from the Hub. Skip if you'll populate ~/.cache/huggingface/hub by other means (e.g. mlx_lm.server --model <repo> auto-downloads on first use).

The Hugging Face Hub team has replaced the legacy huggingface-cli command with a new CLI called hf. Athanor invokes hf download. The recommended install path is the standalone installer, which drops a self-contained hf binary onto your PATH without touching system Python:

curl -LsSf https://hf.co/cli/install.sh | bash

Alternatives:

# Homebrew
brew install hf

# via uvx — runs the latest release on demand, no install step
uvx hf --help

# via pip (the `hf` entry point ships with huggingface_hub ≥ 0.34)
pip install -U huggingface_hub

Verify:

hf --help
# optional: log in if you need access to gated/private repos
hf auth login

Final check

athanor doctor
# mlx_lm.server:   /Users/you/.local/bin/mlx_lm.server  version 0.31.3 (uv)
# mlx_vlm.server:  /Users/you/.local/bin/mlx_vlm.server  version 0.4.4 (uv)
# llama-server:    /opt/homebrew/bin/llama-server  version 9010 (brew)
# hf:              /Users/you/.local/bin/hf  version 1.13.0 (uv)

athanor doctor --check-updates
# ... latest <version> up to date
# ... or latest <version> update available
# ... hint uv tool upgrade <tool>

Quick start

Fastest path

If you want one working local text model quickly on a typical 16 GB+ Apple Silicon Mac:

npm install
npm start -- doctor
npm start -- pull mlx-community/Qwen3.5-9B-MLX-4bit
npm start -- start qwen3-5-9b-mlx-4bit
npm start -- expose qwen3-5-9b-mlx-4bit

Then in pi-agent, select provider athanor-mlx-qwen3-5-9b-mlx-4bit and model mlx-community/Qwen3.5-9B-MLX-4bit.

If you only want text chat, you do not need mlx_vlm.server; mlx_lm.server is enough even for many VLM-tagged repos when used as text-only models.

Full quick start

npm install

# verify external runtime binaries are on PATH
npm start -- doctor

# one-time: ingest whatever's already on disk
npm start -- scan

# see what's in the registry — if empty, this prints curated
# starter models with reviewed 8 / 16 / 32 GB memory tiers and task tags
# that you can copy the `athanor pull ...` line from
npm start -- ls

# pull one and start it (by slug)
npm start -- pull mlx-community/Qwen3.5-9B-MLX-4bit
npm start -- start qwen3-5-9b-mlx-4bit

# or drop into the TUI (no args) — the empty state has the same
# suggestions with memory tiers/task tags and pulls them inline when you press Enter
npm start

npm start runs the app via tsx, so no build step is needed for development. For a live development loop, use:

npm run dev

That runs a small custom watcher (scripts/dev-watch.mjs) which watches only src/**/*.ts and src/**/*.tsx, then respawns tsx src/index.tsx with ATHANOR_DEV_TUI=1. This avoids tsx watch's stdin/restart behavior, which can interfere with Ink/TUI key handling in tmux. In this dev mode athanor still starts the real ingress path (router when needed, control API if enabled) so pi-agent integration behaves like the normal app, but it skips the alt-screen/cursor toggles to make UI iteration safer in split panes. The TUI also collapses to compact/minimal layouts in short or narrow terminals so model selection stays usable in tmux splits. Router-driven model switches are reflected in the TUI by polling persisted live instance state, not only the local process's in-memory supervisor map. For one-shot runs without the dev safeguards, keep using npm start.

If you want a compiled build or to install the athanor binary globally:

npm run build        # emit dist/
npm link             # expose `athanor` on PATH
athanor ls           # now usable directly

bin/athanor imports dist/index.js, so npm link requires a prior npm run build. Linked mode does not auto-rebuild; re-run the build after pulling changes or stay on npm start for the dev loop.

Concepts

Registry

~/.athanor/models.json is the source of truth. Every model has:

| field | purpose | | ----------- | ------------------------------------------------------------- | | id | stable canonical id (HF repo, repo+file, or local:…) | | slug | short user-editable handle (qwen-32b) | | path | on-disk location athanor passes to the runtime | | runtime | mlx or llama.cpp | | source | { type: "hf", repo, [revision], [file] } or { type: "local" } | | port | stable port allocated once per model, never changes | | preset | per-model overrides that merge on top of global runtime config | | mlxFlavor | "lm" or "vlm" — picks which MLX binary to use (MLX only) | | publish | whether pi-agent sees this model | | piAlias | the name pi-agent uses (defaults to slug) | | tags | free-form labels (chat, coder, …) |

Stable per-model ports

Each model is bound to a port at first ingest and keeps it forever. This means pi-agent's catalog is configured once per model; switching which model is active does not change pi's URLs, only the status field athanor writes into each entry.

Port range is configurable (portRange in ~/.athanor/config.json, default 8081–8099).

MLX capabilities and flavor routing

MLX entries track two independent axes:

mlxCapabilities — what the model advertises. Detected from the snapshot's config.json at scan and pull time, primarily by looking for a vision_config block, with fallbacks for known VLM model_type values (qwen2_vl, qwen2_5_vl, llava*, mllama, pixtral, idefics2/3, phi3_v) and architecture-name patterns such as Qwen2VLForConditionalGeneration. Today the only capability is "vlm". Capabilities are refreshed on every scan.
mlxFlavor — which server binary to launch. "lm" routes to mlx_lm.server (the default, no torch/torchvision required); "vlm" routes to mlx_vlm.server (requires torch + torchvision; needed for actual image input). Never set automatically — you choose with athanor flavor <slug> lm|vlm.

The split is deliberate: many VLM-tagged repos (e.g. Qwen2.5-VL, Qwen3-VL-MLX) run fine as text-only under mlx_lm.server, which is lighter, faster to load, and doesn't need a PyTorch install. Auto-routing every VLM-capable repo to mlx_vlm.server would silently break text-only workflows whenever torch isn't available. So athanor defaults everything to lm and leaves the upgrade to you.

In athanor ls and athanor show, entries with mlxFlavor: "vlm" display mlx-vlm in the runtime column; athanor show also prints a caps row and, for vision-capable entries still on lm, a hint pointing at athanor flavor <slug> vlm. In pi-agent, VLM-flavored entries render as [mlx-vlm] <slug> (athanor); the provider id stays athanor-mlx-<slug> regardless of flavor, so pi URLs don't churn if a model's flavor is toggled later.

Supervisor and policies

The runtime supervisor manages N concurrent child processes. Three policies:

single-active (default) — starting a model stops any others.
multi-active-lru — keep up to supervisor.maxConcurrent running; evict the least-recently-started.
manual — never auto-stop; you decide.

Children are started with detached: true, stdio redirected to ~/.athanor/logs/<slug>-<pid>.log, and unref()ed so the CLI/TUI can exit without killing them. On next launch athanor reattaches via PID.

Readiness is detected by polling the runtime's health endpoint (/health for llama.cpp, /v1/models for mlx_lm.server), not by matching stdout strings.

Observability

The TUI banner shows a second line with system CPU and RAM bars plus the 1-minute load average, refreshed once a second. Each running row in the model list gets a compact suffix CPU% · RSS · tok/s (e.g. 340% · 4.2G · 22.5 tok/s). The CLI mirrors this: athanor status adds CPU, RSS, and tok/s columns for every running instance.

Caveats worth knowing:

CPU% is per-core, not per-machine, matching ps and Activity Monitor. A runtime using 8 cores reads as ~800%. Divide by os.cpus().length yourself if you want a whole-machine number.
RSS is resident set size, not reserved allocation. On Apple Silicon's unified memory this is the honest "how much of my RAM is this model currently pinning" number.
tok/s is post-request, not live. Athanor does not sit in the request path — clients connect directly to each runtime's port. The number is parsed from the per-completion timing line the runtime already writes to its log (eval time = … tokens per second for llama.cpp, Generation: … tokens, … tokens-per-sec for mlx_lm / mlx_vlm), and updates once a generation finishes. While a request is streaming, the most recent completed request's rate is shown. If no completion has happened yet, the column is blank.
Sampling is best-effort — if ps fails, the log format changes, or timing lines are not yet present, the affected column is hidden rather than showing a wrong number.

Pi-agent sync

Athanor publishes into pi-agent's custom providers system. On every state change it rewrites ~/.pi/agent/models.json:

Each exposed athanor model becomes its own pi provider named athanor-<runtime>-<slug> (e.g. athanor-mlx-qwen3-32b, athanor-llama-llama3-8b), with baseUrl pointing at that model's stable port. One provider per model is required because a pi provider has exactly one baseUrl and each athanor model runs on its own port. The runtime segment keeps the backing engine visible in pi's /model picker. MLX VLM entries keep the athanor-mlx-<slug> provider id (so URLs don't churn if flavor is corrected) but render as [mlx-vlm] <slug> (athanor) in pi's model list.
Providers whose name does not start with athanor- are preserved untouched — your OpenAI, Anthropic, Ollama, OpenRouter, etc. entries are safe.
Each athanor provider uses api: "openai-completions" (both mlx_lm.server and llama-server are OpenAI-compatible), a placeholder apiKey: "athanor" (required but ignored by both runtimes), and compat: { supportsDeveloperRole: false, supportsReasoningEffort: false } — the same flags pi's docs recommend for Ollama/vLLM-style local servers.
The provider's single model uses an id that matches exactly what the runtime was launched with, because mlx_lm.server compares the request's model field literally and falls back to a HuggingFace lookup on mismatch (see ml-explore/mlx-lm#1133). Concretely:
- MLX HF-sourced models are launched with --model <repo> (e.g. --model mlx-community/Qwen3-32B-4bit) and the pi id is that same repo string. mlx_lm.server resolves the repo from the local HF cache with no network access.
- MLX local models are launched with --model <path> and the pi id is that same path.
- llama.cpp models are launched with -m <path> --alias <piAlias|slug> and the pi id is that alias. llama-server ignores the request's model field, so the alias is just what appears in /v1/models.
~/.pi/agent/settings.json is only touched when an athanor model is started as the active default, at which point defaultProvider and defaultModel are set to point at it. All other settings keys (theme, compaction, etc.) are preserved.

Example exposed provider (MLX, HF-sourced):

{
  "providers": {
    "athanor-mlx-qwen3-32b": {
      "baseUrl": "http://127.0.0.1:8081/v1",
      "api": "openai-completions",
      "apiKey": "athanor",
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "mlx-community/Qwen3-32B-4bit",
          "name": "[mlx] qwen3-32b (athanor)",
          "input": ["text"],
          "contextWindow": 16384
        }
      ]
    }
  }
}

Then in pi: pi --provider athanor-mlx-qwen3-32b --model mlx-community/Qwen3-32B-4bit, or just select it from /model.

Disable athanor's sync entirely with "enablePiSync": false in ~/.athanor/config.json.

CLI reference

athanor                          launch the TUI
athanor scan                     rescan model dirs and update registry
athanor ls                       list registry entries (with live status)
athanor status                   list running instances
athanor show     <id|slug>       inspect a model: runtime, effective config, launch command
athanor start    <id|slug>       start a model
athanor stop     [<id|slug>|--all]stop one or all
athanor restart  <id|slug>       stop + start
athanor logs     <id|slug> [-n N] tail last N lines of a running model's log
athanor pull     <repo> [--file F] [--revision R]
                                 download from HuggingFace and register
athanor search   [q] [--mlx|--gguf|--any] [--author A] [--sort S] [--limit N]
                                 search the HuggingFace Hub
athanor trending [--mlx|--gguf] [--limit N]
                                 top trending MLX/GGUF models
athanor preset   <slug> show|set k=v...|unset k...|clear|apply <recipe>
                                 view or modify a model's preset
athanor recipes                  list built-in + user recipes and tunable keys
athanor flavor   <slug> lm|vlm   force MLX runtime flavor (lm = mlx_lm, vlm = mlx_vlm)
athanor expose    <id|slug>      include in pi-agent catalog
athanor hide      <id|slug>      remove from pi-agent catalog
athanor rm       <id|slug>       remove from registry (must be stopped)
athanor sync                     manually rewrite pi catalog
athanor config                   print resolved config and its path
athanor doctor                   check that required binaries are on PATH and show installed versions
athanor doctor --check-updates   also compare installed versions with latest available and print upgrade hints

<id|slug> accepts either the canonical id or the short slug.

TUI key bindings

| key | action | | ---------- | --------------------------------------------------- | | ↑ / ↓ / wheel | move selection (one entry per wheel notch) | | ⏎ | start (if idle) / stop (if running) the highlighted model | | r | restart the highlighted model | | k | kill the highlighted model | | P | toggle pi-agent visibility (expose/hide) | | d | remove the highlighted entry from the registry | | D | open the downloads modal | | s | rescan and ingest new models (automatic on start and when the HF cache changes) | | p | open the pull modal (esc cancels in progress) | | e | open the preset editor for the highlighted model | | / | filter the list by substring of slug or id | | tab | hide the model selector and expand the log pane; press again to restore | | q | quit (does not stop running models) |

The downloads modal shows queued/running/completed pulls. Inside it: ↑↓ selects a task, c cancels the selected running task, C clears finished tasks, and esc closes the modal.

With the selector hidden (tab), the log pane grows to fill the space and the arrow keys switch roles:

| key | action | | --------------------- | ---------------------------------------------------- | | mouse wheel | scroll the log (3 lines per notch) | | ↑ / ↓ | scroll the log one line at a time | | PgUp / PgDn | scroll by half a page | | g / Home | jump to the top of the buffer | | G / End | jump back to the tail (resumes live follow) |

When scrolled up, the header shows +N ↑ paused and the log stops auto-updating until you return to the tail. All other keys (r, k, ⏎, tab) still act on the model you had selected before hiding the list.

Mouse reporting is enabled only while the TUI is running (SGR mode, \x1b[?1000h\x1b[?1006h) and disabled on exit, including on uncaught exceptions and SIGTERM/SIGHUP. While it's active, click-and-drag text selection in some terminals (iTerm2, Terminal.app) requires holding ⌥/Alt; copy-on-select typically still works. If the process is killed with SIGKILL, nothing can reset the terminal — run reset or relaunch the TUI to restore it.

The bottom pane continuously tails the log file of whichever model is highlighted. When any model enters the running set — whether you started it, restarted it, or the router auto-started it on an incoming request — the cursor jumps to it so its logs appear immediately; if an active filter hides the newcomer, the filter is cleared.

Press tab for a full-screen log view that shows model details (id, runtime, port, path), live instance telemetry (pid, uptime, CPU, RSS, tok/s), and a larger log tail. It honors the same cursor-follows-active behavior, so a router-driven model swap auto-switches the view to the new active model. tab again returns to the split list+log layout.

Models downloaded out-of-band (hf download in another terminal, or pulled while the TUI was closed) are picked up automatically: every TUI start runs a scan, and while the TUI is running an fs.watch on modelDirs.mlx / modelDirs.llama debounces cache changes into an incremental ingestDiscovered call. New entries toast in the footer as +N new: <slug>…. Pressing s still works as an explicit rescan.

Configuration

~/.athanor/config.json. Missing fields fall back to these defaults:

{
  "portRange": { "min": 8081, "max": 8099 },
  "enablePiSync": true,
  "modelDirs": {
    "mlx": "~/.cache/huggingface/hub",
    "llama": "~/.models"
  },
  "mlx": {
    "prefillStepSize": 512,
    "promptCacheSize": 32768,
    "decodeConcurrency": 1
  },
  "llama": {
    "nGpuLayers": 999,
    "threads": 8,
    "ctxSize": 32768,
    "batchSize": 512,
    "ubatchSize": 256,
    "parallel": 1
  },
  "supervisor": {
    "policy": "single-active",
    "maxConcurrent": 1,
    "startupTimeoutMs": 120000,
    "healthPollIntervalMs": 500
  },
  "controlApi": {
    "enabled": false,
    "port": 8079,
    "host": "127.0.0.1"
  },
  "router": {
    "enabled": true,
    "port": 8080,
    "host": "127.0.0.1",
    "drainTimeoutMs": 30000
  }
}

Per-model presets

mlx and llama above are global defaults. Athanor now ships a practical 32K default context baseline; built-in recipes scale that up or down by use case. Any model in the registry can override the globals with its preset field, which is merged on top per-runtime. Manage presets via the CLI (preferred) or the TUI (press e on a highlighted model):

# inspect effective config, launch command, and running state
athanor show qwen-32b

# set / unset individual fields — kebab-case and camelCase both work
athanor preset qwen-32b set ctx-size=32768 nGpuLayers=48
athanor preset qwen-32b unset ctx-size
athanor preset qwen-32b clear

# apply a named recipe for the model's runtime
athanor preset qwen-32b apply coding

# list built-in + user recipes and every tunable key per runtime
athanor recipes

Built-in recipes: balanced, fast, quality, long-context, coding.

balanced — recommended default, 32K context
fast — lower latency, 8K context
quality — larger 32K context for more stable long reasoning
coding — 32K context for multi-file and agent workflows
long-context — 64K context, higher memory use

balanced is an explicit preset recipe; clearing a preset is a separate action (athanor preset <slug> clear). Drop your own into ~/.athanor/recipes.json (a plain list or { "recipes": [...] }); user recipes override built-ins of the same name.

Presets survive re-scans: athanor scan only refreshes path, sizeBytes, and — for MLX — mlxCapabilities. Everything else is left alone. athanor ls marks tuned models with [tuned].

Under the hood, a preset looks like this in ~/.athanor/models.json — you can edit it directly if you prefer:

{
  "id": "mlx-community/Qwen2.5-32B-Instruct-4bit",
  "slug": "qwen-32b",
  "preset": {
    "runtime": "mlx",
    "mlx": { "decodeConcurrency": 1, "prefillStepSize": 512, "promptCacheSize": 32768 }
  }
}

Restart the model for the preset to take effect.

pi-agent receives the model's effective served context window from athanor's merged runtime configuration (global defaults plus any per-model preset), so pi metadata matches the actual launch settings rather than only explicit override fields.

Environment variables

ATHANOR_HOME — overrides ~/.athanor. Useful for running multiple profiles side by side or for tests.
PI_HOME — overrides ~/.pi.

Finding new models

Browse the Hub without leaving the terminal. Both commands query https://huggingface.co/api/models with MLX/GGUF tag filters and print a grouped, readable list. No auth is required for public models.

# free-text search, both runtimes (default)
athanor search qwen

# restrict to one runtime
athanor search coder --mlx
athanor search llama --gguf

# by author and sort key
athanor search --author mlx-community --sort downloads --limit 30
athanor search --author bartowski --gguf --sort likes

# what's hot right now (sorts by HF's trendingScore)
athanor trending
athanor trending --mlx --limit 15

Supported sorts: downloads (default), likes, trending, modified, size. Each row shows download count, likes, license, and a relative last-modified time.

Search is intentionally biased toward athanor's actual domain: the Hub query asks for pipeline_tag=text-generation, and athanor also prunes obvious non-LLM tasks client-side (ASR, TTS, feature-extraction, image-generation, etc.). The goal is to surface local text-generation candidates for MLX / llama.cpp, not to behave like a general-purpose Hugging Face browser.

The footer hints at the follow-up:

→ athanor pull <repo>                 # MLX: downloads the whole repo
→ athanor pull <repo> --file F.gguf   # GGUF: pick one file

HuggingFace pull

athanor pull mlx-community/Qwen2.5-7B-Instruct-4bit
athanor pull bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

What happens:

GET https://huggingface.co/api/models/<repo> is used to list siblings and decide runtime.
- any .gguf sibling → llama.cpp (you must specify --file if more than one exists)
- tags: ["mlx"] or mlx in the repo id → mlx
- .safetensors only, nothing matching MLX → fallback to mlx
hf download is invoked, output streamed.
On success, a registry entry is created with publish: true, a fresh port from portRange, and piAlias: slug.

Cancellation is safe: press Ctrl-C during athanor pull (or Esc in the TUI pull modal) and athanor SIGTERMs the hf child — escalating to SIGKILL after 3s if it ignores the signal — and exits with code 130. No registry entry is written on abort, so you can re-run the same pull later without cleanup.

Control API (optional)

When controlApi.enabled is true, athanor exposes a small local HTTP server (default 127.0.0.1:8079) that other tools can drive:

GET  /status                     running instances + registry summary
POST /activate      { "id": "<id|slug>" }   start a model (respects supervisor policy)
POST /deactivate    { "id": "<id|slug>" }   stop a model

This is off by default. Enable it only on trusted machines.

Ingress

Athanor exposes an OpenAI-compatible ingress (default 127.0.0.1:8080) that fronts every exposed model on a single port. Pi-agent sees up to two providers — athanor-mlx and athanor-llama — both pointing at that ingress, each listing only models of its runtime. The split exists because pi's per-provider compat flags differ between engines (mlx_lm/vlm don't accept the developer role; llama-server does), and it also makes it obvious in pi's /model picker which backend is serving a given request. Switching models inside pi becomes a normal "different model field in the request body" swap, and athanor starts the target on demand (respecting supervisor policy) before proxying the request.

Ingress lifecycle follows active model serving state rather than the foreground TUI. When athanor is open it ensures ingress availability; when active models remain after the UI exits, the detached ingress companion stays up; when the last model stops, the detached ingress companion stops too. This lets you start a model, close the TUI, and keep pi-agent connectivity until you stop or switch models. Reopening the TUI later reattaches to the same detached runtime/ingress state and reflects ingress-driven model switches from persisted instance state.

GET  /health                                200 OK
GET  /v1/models                             synthesised list of exposed models
POST /v1/chat/completions  { "model": ... } activate + proxy (SSE streamed through)
POST /v1/completions       { "model": ... } same
POST /v1/embeddings        { "model": ... } same

The ingress config lives under router in ~/.athanor/config.json for backward compatibility:

{ "router": { "enabled": true, "port": 8080, "host": "127.0.0.1", "drainTimeoutMs": 30000 } }

By default, pi sync emits the ingress-backed aggregator providers (not per-model providers). If you've exposed only MLX models you'll see athanor-mlx alone; only GGUF, just athanor-llama. The model field in requests may be the runtime's model id (the HF repo for MLX, the launch alias for llama.cpp), the athanor slug, or the canonical id; all three are resolved. Unknown models return 404.

For users who don't want to keep the TUI open, athanor router runs the ingress server in the foreground and blocks on Ctrl-C:

athanor router                       # uses config.router.host / .port
athanor router --port 9000           # override
athanor router --host 0.0.0.0 --port 8080

The subcommand ignores router.enabled — invoking it is itself the opt-in — but it still respects 127.0.0.1 as the default host.

Caveats:

Cold start. First request on an idle model blocks until the runtime is healthy (often 10–60s for large MLX). No keepalive is injected into the stream — make sure your client's timeout is generous.
In-flight safety. athanor stop on a currently-streaming model waits briefly (up to router.drainTimeoutMs, default 30s) for open proxied streams to finish before SIGTERM; past that, the runtime is terminated and in-flight responses are cut.
Listen posture. Same as the control API: 127.0.0.1 only, no auth.

Troubleshooting

athanor start hangs or times out. Check ~/.athanor/logs/<slug>-<pid>.log. Most startup failures are the runtime itself complaining (missing weights, wrong quant, out of memory). Raise supervisor.startupTimeoutMs for very large models.
port already in use. Another process is on the model's stable port. Either stop it, or edit the entry's port in ~/.athanor/models.json and restart.
Pi-agent can't see a new model. Make sure it's exposed (CLI: athanor expose <slug>) and run athanor sync. Confirm ~/.pi/agent/models.json contains the expected athanor provider shape (per-model when router is off, athanor-mlx / athanor-llama aggregators when router is on), then open /model in pi (the file reloads on open).
Models from other tools disappeared from pi. They shouldn't — athanor only rewrites providers whose name starts with athanor-. If this happens, open an issue with the before/after of ~/.pi/agent/models.json.
Stale PID / router state. If a child or detached router crashed without athanor noticing, reopening athanor or running athanor sync / athanor status will reconcile persisted state and clear dead router metadata opportunistically. If a model port is still held, run athanor stop <slug> (a no-op when nothing is live) then athanor start <slug>.
doctor reports a missing binary. Install mlx_lm, mlx_vlm, llama.cpp, or huggingface_hub, or adjust your shell's PATH. mlx_vlm.server is only needed if you plan to run VLM models; athanor start on a VLM entry will fail with a clear error if it's missing.
doctor --check-updates reports update available. Follow the printed one-line hint. Today the built-in hints cover uv-managed Python tools (uv tool upgrade mlx-lm, uv tool upgrade mlx-vlm, uv tool upgrade hf) and Homebrew's llama.cpp formula (brew upgrade llama.cpp).

Development

npm install
npx tsc --noEmit      # typecheck
npm run test:run      # vitest run (one shot)
npm test              # vitest (watch)
npm run build         # tsc -> dist/

Tests redirect ATHANOR_HOME and PI_HOME to per-run temporary directories via test/setup.ts, so running the suite never touches your real config.

Layout

src/
  adapters/     # mlx (lm + vlm) + llama.cpp command builders and health probes
  cli/          # hand-rolled CLI dispatcher, doctor, output formatting
  config/       # config file load + defaults
  control/      # optional HTTP control API (off by default)
  discovery/    # HF cache scanner + registry ingest (MLX capability detection lives here)
  presets/      # preset merge, tunable-key metadata, recipes
  pull/         # HuggingFace repo inspection and download
  registry/     # models.json CRUD, slug + port allocation
  search/       # HuggingFace Hub search + trending
  supervisor/   # detached process lifecycle, policy, reattach
  sync/         # namespaced pi-agent catalog merge
  ui/           # Ink TUI (list, pull modal, preset editor)
  types/        # shared types

License

Licensed under the Apache License, Version 2.0. See LICENSE for the full text, or http://www.apache.org/licenses/LICENSE-2.0 for the canonical copy.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

athanor

What it does

Prerequisites

Agent-assisted setup

Setup

mlx-lm (MLX runtime)

mlx-vlm (MLX vision/multimodal runtime)

llama.cpp (GGUF runtime)

hf (HuggingFace CLI — only for athanor pull)

Final check

Quick start

Fastest path

Full quick start

Concepts

Registry

Stable per-model ports

MLX capabilities and flavor routing

Supervisor and policies

Observability

Pi-agent sync

CLI reference

TUI key bindings

Configuration

Per-model presets

Environment variables

Finding new models

HuggingFace pull

Control API (optional)

Ingress

Troubleshooting

Development

Layout

License

hf (HuggingFace CLI — only for `athanor pull`)