localpi
v0.3.0
Published
Pi-compatible local model launcher with managed llama-server support.
Readme
localpi
Localpi is a local Pi launcher for open-weight models.
By default, Localpi discovers available local providers, lets you choose when more than one model is loaded, points Pi at the selected model, and writes Pi config for the other discovered models so /model can switch among them during the session.
Localpi supports LM Studio, vLLM, custom OpenAI-compatible servers, and an optional managed llama-server fallback.
Localpi is intentionally generic. It does not contain classifier prompts, dataset workflows, GitHub routing logic, or final-schema output machinery. Structured classifier runs belong in caller tools such as localpager-agent.
See:
Install
npm install -g localpiDuring development:
npm run localpi -- --statusAfter build:
node dist/src/cli/main.js --statusRuntime Model
Target default:
localpi --model gemma-12bThis uses the default auto runtime. If exactly one model is loaded locally, Localpi selects it. If multiple models are loaded in an interactive terminal, Localpi boots Pi with a temporary default and opens Pi's native model selector. If no external model is loaded and llama-server is installed, Localpi can fall back to the managed llama-server default. Thinking starts from --thinking, LOCALPI_THINKING, the last saved Pi thinking level, or medium.
LM Studio is explicit:
localpi --runtime lmstudio --model gemma-4-e4b-itvLLM is explicit:
localpi --runtime vllm --model qwenCustom OpenAI-compatible endpoints are also supported:
localpi --runtime openai-compatible --base-url http://127.0.0.1:8000/v1 --model my-modelUse --provider <id> with --model <id> to select a catalog entry without opening the picker. --provider <id> by itself only scopes the available choices. Localpi avoids loading multiple heavyweight local runtimes at the same time. When using the managed llama-server runtime, it either stops its previous managed server or clearly reports what is already running before starting another model.
Default Pi Behavior
Localpi launches Pi with:
- default tools:
read,bash,edit,write,grep,find,ls - a system prompt that explains local tool approval and local-model limits
- an approval gate before every tool call
- token speed and token count status while responses stream
- bounded Gemma/llama-server reasoning controlled by
--thinking - an in-session
/thinkingcommand for changing Pi's active thinking level - local state under
~/.local/state/localpi
The approval gate makes failed or denied tool calls explicit to the model so the model does not claim that a blocked command ran.
LM Studio Alternative
LM Studio exposes an OpenAI-compatible endpoint, usually:
http://127.0.0.1:1234/v1Load Gemma in LM Studio:
~/.lmstudio/bin/lms server start
~/.lmstudio/bin/lms load gemma-4-e4b-it -yThen run localpi against LM Studio explicitly:
localpi --runtime lmstudio --model gemma-4-e4b-itUsage
Run Pi interactively on the default local model:
localpiRun a non-interactive Pi prompt:
localpi -p "summarize this repo"Run an endless TUI demo:
localpi --demo --model gemma-e4bDemo mode requires an explicit model, opens the normal Pi TUI, and keeps one live Pi session so followup prompts continue from the first prompt while Pi owns streaming, tok/s status, slash commands, and exit behavior.
Override the demo prompts:
localpi --demo --model gemma-e4b --demo-initial-prompt-file ./prompts/story.txt --demo-followup-prompt "Continue. Try to write as long as possible."Pin a model alias:
localpi --model gemma-e4b -p "write a detailed implementation plan"Use a bounded reasoning budget with managed llama-server:
localpi --model gemma-12b --thinking low -p "classify this item"In an interactive session, use /thinking to pick a level or /thinking high to set one directly. This changes Pi's active thinking level for later turns and saves it for the next localpi launch. For managed llama-server, the server-side reasoning budget is still chosen at startup because changing it requires restarting the local server process.
For managed llama-server, thinking levels map to server-side reasoning:
| Level | llama-server reasoning |
| --------- | ---------------------------------------- |
| off | --reasoning off |
| minimal | --reasoning on --reasoning-budget 32 |
| low | --reasoning on --reasoning-budget 128 |
| medium | --reasoning on --reasoning-budget 512 |
| high | --reasoning on --reasoning-budget 2048 |
| xhigh | --reasoning on --reasoning-budget 8192 |
The fallback default is medium.
Point at vLLM:
localpi --runtime vllm --model qwen -p "review the src directory"Point at a different OpenAI-compatible local server:
localpi --runtime openai-compatible --base-url http://127.0.0.1:8000/v1 -p "review the src directory"Pass a Pi flag that localpi also owns after --:
localpi --model gemma-e4b -- --model some-pi-level-valueStop the managed llama-server runtime:
localpi --stopOptions
--runtime <auto|llama-server|lmstudio|vllm|openai-compatible>: runtime backend. Default:auto--provider <id>: catalog provider id to use, for examplelmstudioorvllm--model <alias|id|path|auto>: model alias, model id, or GGUF path--ctx <n>/--context-window <n>: model context window--max-tokens <n>: generated model max output tokens--base-url <url>: OpenAI-compatible endpoint for LM Studio or custom endpoints--server-command <path>:llama-serverexecutable path--llama-server <path>: alias for--server-command--host <host>: managedllama-serverhost. Default:127.0.0.1--port <n>: managedllama-serverport. Default:18194--gpu-layers <n>: managedllama-serverGPU layers. Default:999--parallel <n>: managedllama-serverparallel slots. Default:1--chat-template <path>: optional llama.cpp chat template file--state-dir <path>: runtime state directory. Default:~/.local/state/localpi--session-dir <path>: Pi session directory. Default:<state-dir>/sessions--pi-command <command>: Pi launch command--providers-file <path>: provider registry JSON--model-profile <path>: local model capability profile JSON--model-reasoning <bool>: override generated Pi reasoning capability--model-thinking-format <deepseek|qwen-chat-template>: override generated Pi thinking format--tools <list>: Pi tools allow list. Default:read,bash,edit,write,grep,find,ls--thinking <off|minimal|low|medium|high|xhigh>: Pi thinking level and managedllama-serverreasoning budget. Default: last saved level, thenmedium--demo: endlessly run Pi prompts inside the normal Pi TUI until interrupted or Pi exits; requires an explicit non-automodel--demo-initial-prompt <text>: first demo prompt--demo-followup-prompt <text>: repeated demo prompt after the first run--demo-initial-prompt-file <path>: UTF-8 file for the first demo prompt--demo-followup-prompt-file <path>: UTF-8 file for repeated demo prompts--no-approval: disable the tool approval gate--no-token-status: disable the token status extension--status: print runtime, model, and Pi config status--stop: stop the managedllama-serverprocess--list: list configured model aliases
Environment
LOCALPI_RUNTIMELOCALPI_MODELLOCALPI_PROVIDERLOCALPI_BASE_URLLOCALPI_PROVIDERS_FILELOCALPI_MODEL_PROFILELOCALPI_MODEL_REASONINGLOCALPI_MODEL_THINKING_FORMATLOCALPI_STATE_DIRLOCALPI_SESSION_DIRLOCALPI_PI_CMDLOCALPI_CONTEXT_WINDOWLOCALPI_MAX_TOKENSLOCALPI_LLAMA_SERVERLOCALPI_HOSTLOCALPI_PORTLOCALPI_GPU_LAYERSLOCALPI_PARALLELLOCALPI_CHAT_TEMPLATELOCALPI_TOOLSLOCALPI_THINKINGLOCALPI_DEMOLOCALPI_DEMO_INITIAL_PROMPTLOCALPI_DEMO_FOLLOWUP_PROMPTLOCALPI_DEMO_INITIAL_PROMPT_FILELOCALPI_DEMO_FOLLOWUP_PROMPT_FILELOCALPI_MODELS_FILELOCALPAGER_AGENT_PROFILELOCALPAGER_AGENT_REASONINGLOCALPAGER_AGENT_THINKING_FORMAT
LOCALPI_MODELS_FILE may point at a JSON file with this shape:
{
"models": {
"my-model": {
"id": "my-model-id",
"path": "/path/to/model.gguf",
"contextWindow": 32768,
"chatTemplate": "/path/to/template.jinja"
}
}
}Provider registries use the same file or LOCALPI_PROVIDERS_FILE:
{
"providers": {
"vllm-qwen": {
"type": "openai-compatible",
"name": "vLLM Qwen",
"baseUrl": "http://127.0.0.1:8000/v1",
"discover": true
}
}
}Use discover: false for endpoints that should not be probed during startup. They can still be selected explicitly with --provider vllm-qwen --model <id>.
Model capability profiles can fill in metadata that OpenAI-compatible servers do not expose through /v1/models, such as vLLM reasoning support:
{
"id": "gemma4-26b-a4b-nvfp4",
"model": "nvidia/Gemma-4-26B-A4B-NVFP4",
"base_url": "http://127.0.0.1:8000/v1",
"client": {
"context_window": 32768,
"max_tokens": 4096
},
"capabilities": {
"reasoning": true,
"thinking_format": "qwen-chat-template"
}
}LOCALPAGER_AGENT_PROFILE, LOCALPAGER_AGENT_REASONING, and LOCALPAGER_AGENT_THINKING_FORMAT are accepted as aliases so LocalPager Agent can pass the same profile metadata through to localpi.
Development
npm run format
npm run lint
npm run typecheck
npm test
npm run build
npm run check