@pioneur/llama-launch

v0.1.6

Published

8 hours ago

Launch Claude Code, Codex, and similar coding agents against local llama.cpp models.

0High
0Medium
0Low

pioneurs

ollama llama.cpp anthropic openai claude codex agents cli local-llm llama

llama-launch

Ollama-style launch control for coding-agent harnesses on top of local llama.cpp.

llama-launch is for the case where you want to keep the provider UX, but replace the hosted model with a local runtime. It starts the compatible backend, exposes the protocol shim the harness expects, and then launches the harness from your current project directory.

llama claude gemma4:31b
llama codex gemma4:31b

Overview

llama-launch gives you a short CLI for agent runtimes:

llama claude gemma4:31b
llama codex gemma4:31b
llama launch codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_M

It handles:

starting llama-server when needed
reusing managed llama-server backends across runs
exposing a local gateway for Anthropic Messages, OpenAI Responses, and OpenAI Chat
launching the selected harness with the right environment variables
npm install for use inside any project

Why this exists

Use llama-launch when you want local models to behave more like hosted coding providers.

Examples:

run Claude Code against a local gemma4:31b backend
run Codex against a local llama.cpp runtime
keep the harness inside the current project while the model runtime is managed separately

Install

Global install:

npm install -g @pioneur/llama-launch
llama claude gemma4:31b

Inside a project:

npm install --save-dev @pioneur/llama-launch
npx llama claude gemma4:31b

Local path install while developing:

npm install --save-dev /path/to/llama-launch
npx llama codex gemma4:31b

The npm package bootstraps its own private Python virtualenv during install, so users do not need to manually install the Python package first.

Requirements

Node.js 20+
npm 10+
Python 3
llama-server on PATH
the target harness installed if you want to launch it directly

Provider binaries:

claude for llama claude ...
codex for llama codex ...
opencode for llama opencode ...
openclaw for llama openclaw ...
hermes for llama hermes ...
pi for llama pi ...

If a provider binary is not on PATH, you can override it with --provider-bin.

Quick start

Dry-run the resolved launch plan:

llama claude gemma4:31b --dry-run --output json

Run Claude against the preferred local profile:

llama claude gemma4:31b \
  --provider-arg=--print \
  --provider-arg=--output-format \
  --provider-arg=json \
  --provider-arg=--dangerously-skip-permissions \
  --provider-arg='Use the Bash tool to run "pwd" and answer with the working directory only.'

By default, llama-launch uses a compact terminal UI and also keeps backend and gateway logs quiet. The default terminal behavior is meant to feel closer to a hosted provider run: tool activity plus the final answer, without the raw transport chatter. To see the unfiltered provider and backend output, use:

llama claude gemma4:31b --ui raw

First-run downloads from Hugging Face can take several minutes for large models. llama-launch now waits up to 30 minutes for backend startup by default. If needed, override that with:

export LLAMA_LAUNCH_BACKEND_STARTUP_TIMEOUT_SECONDS=3600

Resident backend behavior:

the first run starts a managed llama-server
later runs on the same model and port reuse that backend instead of reloading the model
inspect running managed backends with llama ps
stop one explicitly with llama stop --base-url http://127.0.0.1:8080

Run Codex:

llama codex gemma4:31b

Run against a local GGUF:

llama codex ./models/gemma4-31b.gguf

Run against a Hugging Face GGUF repo:

llama codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_M

Install by provider

Claude:

npm install -g @pioneur/llama-launch
llama claude gemma4:31b

Codex:

npm install -g @pioneur/llama-launch
llama codex gemma4:31b

Pi:

npm install -g @pioneur/llama-launch
llama pi gemma4:31b

Project-local install:

npm install --save-dev @pioneur/llama-launch
npx llama claude gemma4:31b

Model selection

The model selector is interpreted in three ways:

local .gguf path: uses llama-server -m
owner/repo[:quant]: uses llama-server -hf
runtime model id: assumes a compatible llama.cpp backend is already serving it

Examples:

llama codex ./models/qwen2.5-coder-7b.gguf
llama codex bartowski/Qwen2.5-Coder-7B-GGUF
llama codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_M
llama claude gemma4:31b

The same selection can also be passed with --model:

llama claude --model gemma4:31b
llama codex --model ./models/model.gguf

Choosing a model

Use the selector that matches how you want to source the model:

local GGUF file when you already have a model downloaded on disk
Hugging Face repo when you want llama-server to resolve the model from a GGUF repository
runtime model id when your local backend already exposes a model name you want to target

Known aliases such as gemma4:31b can also auto-resolve to a built-in profile and start the matching backend for you.

Examples by source:

# Local GGUF
llama codex ./models/gemma4-31b.gguf

# Hugging Face GGUF repo
llama codex bartowski/Qwen2.5-Coder-7B-GGUF

# Existing runtime model id
llama claude gemma4:31b

Examples by harness:

# Claude-oriented local profile
llama claude gemma4:31b

# Codex against a local GGUF
llama codex ./models/qwen2.5-coder-7b.gguf

# Codex against a Hugging Face GGUF repo
llama codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_M

Current provider constraint:

claude is intentionally stricter and is aimed at stronger local tool-use profiles, with gemma4:31b as the primary supported target
codex is more permissive and can be used with a wider range of local llama.cpp models

Known-good starting points:

claude: gemma4:31b
codex: gemma4:31b or a local coder-focused GGUF
opencode: any solid OpenAI-compatible local model exposed through llama.cpp
openclaw: experimental, start with stronger instruction-following models
hermes: OpenAI-compatible local models are the intended path
pi: OpenAI-compatible local models are the intended path

What gets launched

When you run llama claude gemma4:31b, the launcher resolves the model selector, brings up the local backend if needed, exposes the right gateway shape, and then starts the provider CLI with the expected environment variables.

For Claude, that means Anthropic Messages compatibility on top of a local llama.cpp model. For Codex, that means the OpenAI-style path it expects.

Provider support

Current harness support level:

claude: supported through the Anthropic Messages adapter, tuned for gemma4:31b
codex: supported through the OpenAI Responses adapter
opencode: supported through the OpenAI Chat path
hermes: supported through the OpenAI Chat path
pi: supported through the OpenAI Chat path
openclaw: experimental

The launcher ships the protocol shims and process orchestration. Actual runtime quality still depends on the selected model and the provider CLI's tolerance for non-hosted backends.

Security and privacy

What the npm package contains:

the CLI wrapper
the Python gateway/runtime
the README and license

What it does not publish:

your local models
your ~/.claude directory
local API keys or tokens
shell history, logs, or project files
the private virtualenv created during npm install

At runtime, llama-launch passes environment variables to the selected provider process so the provider can talk to the local gateway. That is local process behavior, not published package content.

Vault-backed secrets

If you want one authenticated service to broker credentials at runtime, this repo includes a Bitwarden Secrets Manager wrapper.

Files:

scripts/run-with-vault.sh
scripts/publish-with-vault.sh
.vault-secrets.example.json

How it works:

the machine account authenticates once through BWS_ACCESS_TOKEN
run-with-vault.sh looks up a requested env var name in a local secret-id map
it fetches the live secret value from Bitwarden with bws
it exports that env var only for the target subprocess
the target command runs without storing the downstream secret in the repo

Setup:

cp .vault-secrets.example.json .vault-secrets.json
mkdir -p ~/.config/llama-launch
mv .vault-secrets.json ~/.config/llama-launch/vault-secrets.json

Populate ~/.config/llama-launch/vault-secrets.json with real Bitwarden secret IDs:

{
  "NPM_TOKEN": "your-secret-id",
  "ANTHROPIC_API_KEY": "your-secret-id",
  "OPENAI_API_KEY": "your-secret-id"
}

Make the machine account token available:

export BWS_ACCESS_TOKEN='your-machine-account-token'

Examples:

./scripts/run-with-vault.sh NPM_TOKEN -- npm whoami
./scripts/publish-with-vault.sh
./scripts/run-with-vault.sh ANTHROPIC_API_KEY -- claude --print 'hello'

By default the wrapper looks for the secret-id map in:

LLAMA_LAUNCH_VAULT_CONFIG
./.vault-secrets.json
~/.config/llama-launch/vault-secrets.json

Claude profile

The intended local Claude profile is:

llama claude gemma4:31b

That profile is tuned around:

ggml-org/gemma-4-31b-it-GGUF:Q4_K_M
Anthropic Messages compatibility
local tool use as the primary target

Smaller Gemma profiles such as gemma3:1b are intentionally rejected for Claude launch because they are not reliable enough for Claude-style tool workflows.

Claude compatibility

The Claude Code path includes:

ANTHROPIC_BASE_URL pointed at the gateway root
ANTHROPIC_CUSTOM_MODEL_OPTION for local model ids
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1 for proxy mode
/v1/messages
/v1/messages/count_tokens
streamed tool_use translation over a chat-completions backend
--bare by default to avoid unrelated user plugins and hooks interfering with local-model runs

Commands

llama protocols
llama harnesses
llama backend-status
llama server-command --hf-model ggml-org/gemma-4-31b-it-GGUF:Q4_K_M
llama serve
llama launch codex gemma4:31b
llama claude gemma4:31b
llama codex gemma4:31b

If you prefer the long Python form during development:

python -m llama_launch.cli claude gemma4:31b

Testing

Run the automated suite:

.venv/bin/python -m unittest discover -s tests -v

Run the opt-in real Claude probe:

RUN_REAL_CLAUDE_PROBE=1 .venv/bin/python -m unittest \
  tests.test_runtime_e2e.RuntimeE2ETests.test_real_claude_completes_bash_tool_roundtrip_through_gateway -v

Status

Current focus:

Claude Code
Codex
OpenCode
OpenClaw
Hermes
Pi

The transport and launch stack is in place. Real-world harness quality still depends on the underlying model’s tool-use and instruction-following capability.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llama-launch

Overview

Why this exists

Install

Requirements

Quick start

Install by provider

Model selection

Choosing a model

What gets launched

Provider support

Security and privacy

Vault-backed secrets

Claude profile

Claude compatibility

Commands

Testing

Status