@darylkang/arbiter
v0.1.0
Published
Research-grade CLI for studying LLM behavior as a distribution.
Readme
Arbiter
Arbiter is a research-grade CLI for studying LLM response distributions under repeated, controlled sampling.
It is designed for teams that need:
- deterministic trial planning,
- auditable artifact outputs,
- reproducible run verification,
- and clear provenance for requested vs. actual model behavior.
Arbiter focuses on measurement quality and traceability. It does not claim model correctness.
Contract Status
This README defines Arbiter's stabilized v1 product and artifact contracts.
Implementation rollout is tracked in docs/exec-plans/.
If runtime behavior diverges from this document, treat that as either:
- an implementation defect to fix, or
- an explicit migration step that must be recorded in an ExecPlan.
What Arbiter does
Arbiter runs many trials against a fixed question and configuration, then records:
- trial-level execution outputs with parse and embedding summaries,
- batch-level novelty monitoring signals,
- optional embedding-group outputs,
- and a complete run manifest for verification.
This supports analysis of how response behavior changes across model/persona/protocol sampling choices.
Core principles
- Schema-first: output contracts are defined by JSON Schemas.
- Deterministic planning: trial plans are seeded and reproducible.
- Audit-first artifacts: runs emit machine-verifiable files, not just terminal logs.
- Provenance-aware: requested and actual model identifiers are both recorded.
Requirements
- Node.js
>=24 - macOS/Linux terminal (TTY for interactive mode)
- OpenRouter API key only for live runs (
OPENROUTER_API_KEY)
Install
Option A: Install globally from npm
npm install -g @darylkang/arbiterOption B: Install from source (editable/local development)
git clone https://github.com/darylkang/arbiter.git
cd arbiter
npm install
npm run build
npm linkVerify installation:
arbiter --version
arbiter --helpQuick start
Wizard entry (TTY)
Launch the wizard:
arbiterInitialize a config
arbiter initThis writes arbiter.config.json in CWD, or the first collision-safe filename:
arbiter.config.1.jsonarbiter.config.2.json- and so on
After writing, Arbiter prints:
- the created config file path
- suggested next commands:
arbiter,arbiter run --config <file>
Headless run (default)
arbiter run --config arbiter.config.jsonLive run override
export OPENROUTER_API_KEY=<your_key>
arbiter run --config arbiter.config.json --mode liveDashboard monitor (human-only)
arbiter run --config arbiter.config.json --dashboardIf stdout is not TTY, Arbiter prints a warning to stderr and continues headless.
CLI Contract (v1)
Arbiter exposes exactly three primary entry points:
arbiterarbiter initarbiter run
Global flags:
--help,-h--version,-V
Command behavior:
arbiter: launch wizard when stdout is TTY; otherwise print help and exit0.arbiter init: write a collision-safe default config in CWD and never overwrite existing files.arbiter run: headless execution command, requires--config <path>.
Run override flags (arbiter run):
--out <dir>(default:./runs)--workers <n>--batch-size <n>--max-trials <n>--mode <mock|live>--dashboard(TTY-only Stage 2/3 monitor)
Not part of v1:
- no
--headless - no
--verbose - no wizard flag (
--wizard) - no experiment-variable CLI flags (models, personas, protocol, decode, debate params, clustering thresholds)
- no redundant aliases beyond
-hand-V
Config Resolution Contract
Resolution precedence:
- built-in defaults
- config file
- CLI override flags
Per run directory, Arbiter writes:
config.source.json(exact input config as read)config.resolved.json(final resolved config used to execute)
The source config file is never mutated during run execution.
Run Directory Contract
Scope note:
- The artifact lists below apply to executed runs (
arbiter run ...), including graceful user stop. - Resolve-only directories are tooling-internal and intentionally slimmer than executed-run artifact packs.
Each run writes to:
runs/<run_id>/Run ID format:
YYYYMMDDTHHMMSSZ_<random6>(UTC timestamp + random suffix)
Always-produced files:
config.source.jsonconfig.resolved.jsonmanifest.jsontrial_plan.jsonltrials.jsonlmonitoring.jsonlreceipt.txt
Conditionally produced files:
embeddings.arrowwhen at least one eligible embedding is finalized to Arrowembeddings.jsonlas fallback when Arrow is not written, or when debug mode explicitly keeps JSONL embeddingsgroups/assignments.jsonlandgroups/state.jsonwhen grouping artifacts are emitteddebug/events.jsonlanddebug/execution.logonly when debug mode is enabled
Resolve-only run artifacts:
config.resolved.jsonmanifest.json
Consolidation notes:
trials.jsonlis the canonical per-trial record and includes parse plus embedding summaries.- for Debate runs, intermediate turns are persisted in per-trial
transcriptrecords intrials.jsonl. - final run-level metrics and embedding provenance summaries live in
manifest.json. - this contract supersedes legacy artifact names such as
parsed.jsonl,convergence_trace.jsonl,aggregates.json,embeddings.provenance.json, andclusters/*.
Exit Code Contract
Exit 0 for:
- normal completion,
- novelty saturation stop,
- max-trials stop,
- graceful
Ctrl+Cstop.
Use non-zero only for:
- invalid config,
- inability to start run,
- fatal execution failure.
Interpreting results responsibly
Arbiter measures distributional behavior, not correctness.
Important guidance:
- Stopping indicates novelty saturation under the configured measurement setup.
- Embedding groups are measurement artifacts, not ground-truth semantic classes.
- Free-tier models are useful for exploration but not ideal for publication-grade claims.
- Always report measurement settings and model provenance when sharing results.
Troubleshooting
error: config not found ...
Initialize a config first:
arbiter initLive run fails with missing API key
Set key in environment:
export OPENROUTER_API_KEY=<your_key>--dashboard used in non-TTY
Arbiter warns to stderr and continues headless by contract.
Documentation
- Design reference:
docs/DESIGN.md - Wizard UX spec:
docs/product-specs/tui-wizard.md - ExecPlan contract:
docs/PLANS.md - Contributor/agent rules:
AGENTS.md
License
MIT
