@pioneur/llama-launch
v0.1.6
Published
Launch Claude Code, Codex, and similar coding agents against local llama.cpp models.
Maintainers
Readme
llama-launch
Ollama-style launch control for coding-agent harnesses on top of local llama.cpp.
llama-launch is for the case where you want to keep the provider UX, but replace the hosted model with a local runtime. It starts the compatible backend, exposes the protocol shim the harness expects, and then launches the harness from your current project directory.
llama claude gemma4:31b
llama codex gemma4:31bOverview
llama-launch gives you a short CLI for agent runtimes:
llama claude gemma4:31b
llama codex gemma4:31b
llama launch codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_MIt handles:
- starting
llama-serverwhen needed - reusing managed
llama-serverbackends across runs - exposing a local gateway for Anthropic Messages, OpenAI Responses, and OpenAI Chat
- launching the selected harness with the right environment variables
- npm install for use inside any project
Why this exists
Use llama-launch when you want local models to behave more like hosted coding providers.
Examples:
- run Claude Code against a local
gemma4:31bbackend - run Codex against a local
llama.cppruntime - keep the harness inside the current project while the model runtime is managed separately
Install
Global install:
npm install -g @pioneur/llama-launch
llama claude gemma4:31bInside a project:
npm install --save-dev @pioneur/llama-launch
npx llama claude gemma4:31bLocal path install while developing:
npm install --save-dev /path/to/llama-launch
npx llama codex gemma4:31bThe npm package bootstraps its own private Python virtualenv during install, so users do not need to manually install the Python package first.
Requirements
- Node.js 20+
- npm 10+
- Python 3
llama-serveronPATH- the target harness installed if you want to launch it directly
Provider binaries:
claudeforllama claude ...codexforllama codex ...opencodeforllama opencode ...openclawforllama openclaw ...hermesforllama hermes ...piforllama pi ...
If a provider binary is not on PATH, you can override it with --provider-bin.
Quick start
Dry-run the resolved launch plan:
llama claude gemma4:31b --dry-run --output jsonRun Claude against the preferred local profile:
llama claude gemma4:31b \
--provider-arg=--print \
--provider-arg=--output-format \
--provider-arg=json \
--provider-arg=--dangerously-skip-permissions \
--provider-arg='Use the Bash tool to run "pwd" and answer with the working directory only.'By default, llama-launch uses a compact terminal UI and also keeps backend and gateway logs quiet. The default terminal behavior is meant to feel closer to a hosted provider run: tool activity plus the final answer, without the raw transport chatter. To see the unfiltered provider and backend output, use:
llama claude gemma4:31b --ui rawFirst-run downloads from Hugging Face can take several minutes for large models. llama-launch now waits up to 30 minutes for backend startup by default. If needed, override that with:
export LLAMA_LAUNCH_BACKEND_STARTUP_TIMEOUT_SECONDS=3600Resident backend behavior:
- the first run starts a managed
llama-server - later runs on the same model and port reuse that backend instead of reloading the model
- inspect running managed backends with
llama ps - stop one explicitly with
llama stop --base-url http://127.0.0.1:8080
Run Codex:
llama codex gemma4:31bRun against a local GGUF:
llama codex ./models/gemma4-31b.ggufRun against a Hugging Face GGUF repo:
llama codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_MInstall by provider
Claude:
npm install -g @pioneur/llama-launch
llama claude gemma4:31bCodex:
npm install -g @pioneur/llama-launch
llama codex gemma4:31bPi:
npm install -g @pioneur/llama-launch
llama pi gemma4:31bProject-local install:
npm install --save-dev @pioneur/llama-launch
npx llama claude gemma4:31bModel selection
The model selector is interpreted in three ways:
- local
.ggufpath: usesllama-server -m owner/repo[:quant]: usesllama-server -hf- runtime model id: assumes a compatible
llama.cppbackend is already serving it
Examples:
llama codex ./models/qwen2.5-coder-7b.gguf
llama codex bartowski/Qwen2.5-Coder-7B-GGUF
llama codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_M
llama claude gemma4:31bThe same selection can also be passed with --model:
llama claude --model gemma4:31b
llama codex --model ./models/model.ggufChoosing a model
Use the selector that matches how you want to source the model:
- local GGUF file when you already have a model downloaded on disk
- Hugging Face repo when you want
llama-serverto resolve the model from a GGUF repository - runtime model id when your local backend already exposes a model name you want to target
Known aliases such as gemma4:31b can also auto-resolve to a built-in profile and start the matching backend for you.
Examples by source:
# Local GGUF
llama codex ./models/gemma4-31b.gguf
# Hugging Face GGUF repo
llama codex bartowski/Qwen2.5-Coder-7B-GGUF
# Existing runtime model id
llama claude gemma4:31bExamples by harness:
# Claude-oriented local profile
llama claude gemma4:31b
# Codex against a local GGUF
llama codex ./models/qwen2.5-coder-7b.gguf
# Codex against a Hugging Face GGUF repo
llama codex ggml-org/gemma-4-31b-it-GGUF:Q4_K_MCurrent provider constraint:
claudeis intentionally stricter and is aimed at stronger local tool-use profiles, withgemma4:31bas the primary supported targetcodexis more permissive and can be used with a wider range of localllama.cppmodels
Known-good starting points:
claude:gemma4:31bcodex:gemma4:31bor a local coder-focused GGUFopencode: any solid OpenAI-compatible local model exposed throughllama.cppopenclaw: experimental, start with stronger instruction-following modelshermes: OpenAI-compatible local models are the intended pathpi: OpenAI-compatible local models are the intended path
What gets launched
When you run llama claude gemma4:31b, the launcher resolves the model selector, brings up the local backend if needed, exposes the right gateway shape, and then starts the provider CLI with the expected environment variables.
For Claude, that means Anthropic Messages compatibility on top of a local llama.cpp model. For Codex, that means the OpenAI-style path it expects.
Provider support
Current harness support level:
claude: supported through the Anthropic Messages adapter, tuned forgemma4:31bcodex: supported through the OpenAI Responses adapteropencode: supported through the OpenAI Chat pathhermes: supported through the OpenAI Chat pathpi: supported through the OpenAI Chat pathopenclaw: experimental
The launcher ships the protocol shims and process orchestration. Actual runtime quality still depends on the selected model and the provider CLI's tolerance for non-hosted backends.
Security and privacy
What the npm package contains:
- the CLI wrapper
- the Python gateway/runtime
- the README and license
What it does not publish:
- your local models
- your
~/.claudedirectory - local API keys or tokens
- shell history, logs, or project files
- the private virtualenv created during npm install
At runtime, llama-launch passes environment variables to the selected provider process so the provider can talk to the local gateway. That is local process behavior, not published package content.
Vault-backed secrets
If you want one authenticated service to broker credentials at runtime, this repo includes a Bitwarden Secrets Manager wrapper.
Files:
scripts/run-with-vault.shscripts/publish-with-vault.sh.vault-secrets.example.json
How it works:
- the machine account authenticates once through
BWS_ACCESS_TOKEN run-with-vault.shlooks up a requested env var name in a local secret-id map- it fetches the live secret value from Bitwarden with
bws - it exports that env var only for the target subprocess
- the target command runs without storing the downstream secret in the repo
Setup:
cp .vault-secrets.example.json .vault-secrets.json
mkdir -p ~/.config/llama-launch
mv .vault-secrets.json ~/.config/llama-launch/vault-secrets.jsonPopulate ~/.config/llama-launch/vault-secrets.json with real Bitwarden secret IDs:
{
"NPM_TOKEN": "your-secret-id",
"ANTHROPIC_API_KEY": "your-secret-id",
"OPENAI_API_KEY": "your-secret-id"
}Make the machine account token available:
export BWS_ACCESS_TOKEN='your-machine-account-token'Examples:
./scripts/run-with-vault.sh NPM_TOKEN -- npm whoami
./scripts/publish-with-vault.sh
./scripts/run-with-vault.sh ANTHROPIC_API_KEY -- claude --print 'hello'By default the wrapper looks for the secret-id map in:
LLAMA_LAUNCH_VAULT_CONFIG./.vault-secrets.json~/.config/llama-launch/vault-secrets.json
Claude profile
The intended local Claude profile is:
llama claude gemma4:31bThat profile is tuned around:
ggml-org/gemma-4-31b-it-GGUF:Q4_K_M- Anthropic Messages compatibility
- local tool use as the primary target
Smaller Gemma profiles such as gemma3:1b are intentionally rejected for Claude launch because they are not reliable enough for Claude-style tool workflows.
Claude compatibility
The Claude Code path includes:
ANTHROPIC_BASE_URLpointed at the gateway rootANTHROPIC_CUSTOM_MODEL_OPTIONfor local model idsCLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1for proxy mode/v1/messages/v1/messages/count_tokens- streamed
tool_usetranslation over a chat-completions backend --bareby default to avoid unrelated user plugins and hooks interfering with local-model runs
Commands
llama protocols
llama harnesses
llama backend-status
llama server-command --hf-model ggml-org/gemma-4-31b-it-GGUF:Q4_K_M
llama serve
llama launch codex gemma4:31b
llama claude gemma4:31b
llama codex gemma4:31bIf you prefer the long Python form during development:
python -m llama_launch.cli claude gemma4:31bTesting
Run the automated suite:
.venv/bin/python -m unittest discover -s tests -vRun the opt-in real Claude probe:
RUN_REAL_CLAUDE_PROBE=1 .venv/bin/python -m unittest \
tests.test_runtime_e2e.RuntimeE2ETests.test_real_claude_completes_bash_tool_roundtrip_through_gateway -vStatus
Current focus:
- Claude Code
- Codex
- OpenCode
- OpenClaw
- Hermes
- Pi
The transport and launch stack is in place. Real-world harness quality still depends on the underlying model’s tool-use and instruction-following capability.
