localpi

v0.3.0

Published

16 hours ago

Pi-compatible local model launcher with managed llama-server support.

0High
0Medium
0Low

osolmaz

localpi

Localpi is a local Pi launcher for open-weight models.

By default, Localpi discovers available local providers, lets you choose when more than one model is loaded, points Pi at the selected model, and writes Pi config for the other discovered models so /model can switch among them during the session.

Localpi supports LM Studio, vLLM, custom OpenAI-compatible servers, and an optional managed llama-server fallback.

Localpi is intentionally generic. It does not contain classifier prompts, dataset workflows, GitHub routing logic, or final-schema output machinery. Structured classifier runs belong in caller tools such as localpager-agent.

See:

Runtime Specification

Install

npm install -g localpi

During development:

npm run localpi -- --status

After build:

node dist/src/cli/main.js --status

Runtime Model

Target default:

localpi --model gemma-12b

This uses the default auto runtime. If exactly one model is loaded locally, Localpi selects it. If multiple models are loaded in an interactive terminal, Localpi boots Pi with a temporary default and opens Pi's native model selector. If no external model is loaded and llama-server is installed, Localpi can fall back to the managed llama-server default. Thinking starts from --thinking, LOCALPI_THINKING, the last saved Pi thinking level, or medium.

LM Studio is explicit:

localpi --runtime lmstudio --model gemma-4-e4b-it

vLLM is explicit:

localpi --runtime vllm --model qwen

Custom OpenAI-compatible endpoints are also supported:

localpi --runtime openai-compatible --base-url http://127.0.0.1:8000/v1 --model my-model

Use --provider <id> with --model <id> to select a catalog entry without opening the picker. --provider <id> by itself only scopes the available choices. Localpi avoids loading multiple heavyweight local runtimes at the same time. When using the managed llama-server runtime, it either stops its previous managed server or clearly reports what is already running before starting another model.

Default Pi Behavior

Localpi launches Pi with:

default tools: read,bash,edit,write,grep,find,ls
a system prompt that explains local tool approval and local-model limits
an approval gate before every tool call
token speed and token count status while responses stream
bounded Gemma/llama-server reasoning controlled by --thinking
an in-session /thinking command for changing Pi's active thinking level
local state under ~/.local/state/localpi

The approval gate makes failed or denied tool calls explicit to the model so the model does not claim that a blocked command ran.

LM Studio Alternative

LM Studio exposes an OpenAI-compatible endpoint, usually:

http://127.0.0.1:1234/v1

Load Gemma in LM Studio:

~/.lmstudio/bin/lms server start
~/.lmstudio/bin/lms load gemma-4-e4b-it -y

Then run localpi against LM Studio explicitly:

localpi --runtime lmstudio --model gemma-4-e4b-it

Usage

Run Pi interactively on the default local model:

localpi

Run a non-interactive Pi prompt:

localpi -p "summarize this repo"

Run an endless TUI demo:

localpi --demo --model gemma-e4b

Demo mode requires an explicit model, opens the normal Pi TUI, and keeps one live Pi session so followup prompts continue from the first prompt while Pi owns streaming, tok/s status, slash commands, and exit behavior.

Override the demo prompts:

localpi --demo --model gemma-e4b --demo-initial-prompt-file ./prompts/story.txt --demo-followup-prompt "Continue. Try to write as long as possible."

Pin a model alias:

localpi --model gemma-e4b -p "write a detailed implementation plan"

Use a bounded reasoning budget with managed llama-server:

localpi --model gemma-12b --thinking low -p "classify this item"

In an interactive session, use /thinking to pick a level or /thinking high to set one directly. This changes Pi's active thinking level for later turns and saves it for the next localpi launch. For managed llama-server, the server-side reasoning budget is still chosen at startup because changing it requires restarting the local server process.

For managed llama-server, thinking levels map to server-side reasoning:

| Level | llama-server reasoning | | --------- | ---------------------------------------- | | off | --reasoning off | | minimal | --reasoning on --reasoning-budget 32 | | low | --reasoning on --reasoning-budget 128 | | medium | --reasoning on --reasoning-budget 512 | | high | --reasoning on --reasoning-budget 2048 | | xhigh | --reasoning on --reasoning-budget 8192 |

The fallback default is medium.

Point at vLLM:

localpi --runtime vllm --model qwen -p "review the src directory"

Point at a different OpenAI-compatible local server:

localpi --runtime openai-compatible --base-url http://127.0.0.1:8000/v1 -p "review the src directory"

Pass a Pi flag that localpi also owns after --:

localpi --model gemma-e4b -- --model some-pi-level-value

Stop the managed llama-server runtime:

localpi --stop

Options

--runtime <auto|llama-server|lmstudio|vllm|openai-compatible>: runtime backend. Default: auto
--provider <id>: catalog provider id to use, for example lmstudio or vllm
--model <alias|id|path|auto>: model alias, model id, or GGUF path
--ctx <n> / --context-window <n>: model context window
--max-tokens <n>: generated model max output tokens
--base-url <url>: OpenAI-compatible endpoint for LM Studio or custom endpoints
--server-command <path>: llama-server executable path
--llama-server <path>: alias for --server-command
--host <host>: managed llama-server host. Default: 127.0.0.1
--port <n>: managed llama-server port. Default: 18194
--gpu-layers <n>: managed llama-server GPU layers. Default: 999
--parallel <n>: managed llama-server parallel slots. Default: 1
--chat-template <path>: optional llama.cpp chat template file
--state-dir <path>: runtime state directory. Default: ~/.local/state/localpi
--session-dir <path>: Pi session directory. Default: <state-dir>/sessions
--pi-command <command>: Pi launch command
--providers-file <path>: provider registry JSON
--model-profile <path>: local model capability profile JSON
--model-reasoning <bool>: override generated Pi reasoning capability
--model-thinking-format <deepseek|qwen-chat-template>: override generated Pi thinking format
--tools <list>: Pi tools allow list. Default: read,bash,edit,write,grep,find,ls
--thinking <off|minimal|low|medium|high|xhigh>: Pi thinking level and managed llama-server reasoning budget. Default: last saved level, then medium
--demo: endlessly run Pi prompts inside the normal Pi TUI until interrupted or Pi exits; requires an explicit non-auto model
--demo-initial-prompt <text>: first demo prompt
--demo-followup-prompt <text>: repeated demo prompt after the first run
--demo-initial-prompt-file <path>: UTF-8 file for the first demo prompt
--demo-followup-prompt-file <path>: UTF-8 file for repeated demo prompts
--no-approval: disable the tool approval gate
--no-token-status: disable the token status extension
--status: print runtime, model, and Pi config status
--stop: stop the managed llama-server process
--list: list configured model aliases

Environment

LOCALPI_RUNTIME
LOCALPI_MODEL
LOCALPI_PROVIDER
LOCALPI_BASE_URL
LOCALPI_PROVIDERS_FILE
LOCALPI_MODEL_PROFILE
LOCALPI_MODEL_REASONING
LOCALPI_MODEL_THINKING_FORMAT
LOCALPI_STATE_DIR
LOCALPI_SESSION_DIR
LOCALPI_PI_CMD
LOCALPI_CONTEXT_WINDOW
LOCALPI_MAX_TOKENS
LOCALPI_LLAMA_SERVER
LOCALPI_HOST
LOCALPI_PORT
LOCALPI_GPU_LAYERS
LOCALPI_PARALLEL
LOCALPI_CHAT_TEMPLATE
LOCALPI_TOOLS
LOCALPI_THINKING
LOCALPI_DEMO
LOCALPI_DEMO_INITIAL_PROMPT
LOCALPI_DEMO_FOLLOWUP_PROMPT
LOCALPI_DEMO_INITIAL_PROMPT_FILE
LOCALPI_DEMO_FOLLOWUP_PROMPT_FILE
LOCALPI_MODELS_FILE
LOCALPAGER_AGENT_PROFILE
LOCALPAGER_AGENT_REASONING
LOCALPAGER_AGENT_THINKING_FORMAT

LOCALPI_MODELS_FILE may point at a JSON file with this shape:

{
  "models": {
    "my-model": {
      "id": "my-model-id",
      "path": "/path/to/model.gguf",
      "contextWindow": 32768,
      "chatTemplate": "/path/to/template.jinja"
    }
  }
}

Provider registries use the same file or LOCALPI_PROVIDERS_FILE:

{
  "providers": {
    "vllm-qwen": {
      "type": "openai-compatible",
      "name": "vLLM Qwen",
      "baseUrl": "http://127.0.0.1:8000/v1",
      "discover": true
    }
  }
}

Use discover: false for endpoints that should not be probed during startup. They can still be selected explicitly with --provider vllm-qwen --model <id>.

Model capability profiles can fill in metadata that OpenAI-compatible servers do not expose through /v1/models, such as vLLM reasoning support:

{
  "id": "gemma4-26b-a4b-nvfp4",
  "model": "nvidia/Gemma-4-26B-A4B-NVFP4",
  "base_url": "http://127.0.0.1:8000/v1",
  "client": {
    "context_window": 32768,
    "max_tokens": 4096
  },
  "capabilities": {
    "reasoning": true,
    "thinking_format": "qwen-chat-template"
  }
}

LOCALPAGER_AGENT_PROFILE, LOCALPAGER_AGENT_REASONING, and LOCALPAGER_AGENT_THINKING_FORMAT are accepted as aliases so LocalPager Agent can pass the same profile metadata through to localpi.

Development

npm run format
npm run lint
npm run typecheck
npm test
npm run build
npm run check

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

localpi

Install

Runtime Model

Default Pi Behavior

LM Studio Alternative

Usage

Options

Environment

Development