@darylkang/arbiter

v0.1.0

Published

2 months ago

Research-grade CLI for studying LLM behavior as a distribution.

0High
0Medium
0Low

darylkang

Arbiter

Arbiter is a research-grade CLI for studying LLM response distributions under repeated, controlled sampling.

It is designed for teams that need:

deterministic trial planning,
auditable artifact outputs,
reproducible run verification,
and clear provenance for requested vs. actual model behavior.

Arbiter focuses on measurement quality and traceability. It does not claim model correctness.

Contract Status

This README defines Arbiter's stabilized v1 product and artifact contracts.

Implementation rollout is tracked in docs/exec-plans/. If runtime behavior diverges from this document, treat that as either:

an implementation defect to fix, or
an explicit migration step that must be recorded in an ExecPlan.

What Arbiter does

Arbiter runs many trials against a fixed question and configuration, then records:

trial-level execution outputs with parse and embedding summaries,
batch-level novelty monitoring signals,
optional embedding-group outputs,
and a complete run manifest for verification.

This supports analysis of how response behavior changes across model/persona/protocol sampling choices.

Core principles

Schema-first: output contracts are defined by JSON Schemas.
Deterministic planning: trial plans are seeded and reproducible.
Audit-first artifacts: runs emit machine-verifiable files, not just terminal logs.
Provenance-aware: requested and actual model identifiers are both recorded.

Requirements

Node.js >=24
macOS/Linux terminal (TTY for interactive mode)
OpenRouter API key only for live runs (OPENROUTER_API_KEY)

Install

Option A: Install globally from npm

npm install -g @darylkang/arbiter

Option B: Install from source (editable/local development)

git clone https://github.com/darylkang/arbiter.git
cd arbiter
npm install
npm run build
npm link

Verify installation:

arbiter --version
arbiter --help

Quick start

Wizard entry (TTY)

Launch the wizard:

arbiter

Initialize a config

arbiter init

This writes arbiter.config.json in CWD, or the first collision-safe filename:

arbiter.config.1.json
arbiter.config.2.json
and so on

After writing, Arbiter prints:

the created config file path
suggested next commands: arbiter, arbiter run --config <file>

Headless run (default)

arbiter run --config arbiter.config.json

Live run override

export OPENROUTER_API_KEY=<your_key>
arbiter run --config arbiter.config.json --mode live

Dashboard monitor (human-only)

arbiter run --config arbiter.config.json --dashboard

If stdout is not TTY, Arbiter prints a warning to stderr and continues headless.

CLI Contract (v1)

Arbiter exposes exactly three primary entry points:

arbiter
arbiter init
arbiter run

Global flags:

--help, -h
--version, -V

Command behavior:

arbiter: launch wizard when stdout is TTY; otherwise print help and exit 0.
arbiter init: write a collision-safe default config in CWD and never overwrite existing files.
arbiter run: headless execution command, requires --config <path>.

Run override flags (arbiter run):

--out <dir> (default: ./runs)
--workers <n>
--batch-size <n>
--max-trials <n>
--mode <mock|live>
--dashboard (TTY-only Stage 2/3 monitor)

Not part of v1:

no --headless
no --verbose
no wizard flag (--wizard)
no experiment-variable CLI flags (models, personas, protocol, decode, debate params, clustering thresholds)
no redundant aliases beyond -h and -V

Config Resolution Contract

Resolution precedence:

built-in defaults
config file
CLI override flags

Per run directory, Arbiter writes:

config.source.json (exact input config as read)
config.resolved.json (final resolved config used to execute)

The source config file is never mutated during run execution.

Run Directory Contract

Scope note:

The artifact lists below apply to executed runs (arbiter run ...), including graceful user stop.
Resolve-only directories are tooling-internal and intentionally slimmer than executed-run artifact packs.

Each run writes to:

runs/<run_id>/

Run ID format:

YYYYMMDDTHHMMSSZ_<random6> (UTC timestamp + random suffix)

Always-produced files:

config.source.json
config.resolved.json
manifest.json
trial_plan.jsonl
trials.jsonl
monitoring.jsonl
receipt.txt

Conditionally produced files:

embeddings.arrow when at least one eligible embedding is finalized to Arrow
embeddings.jsonl as fallback when Arrow is not written, or when debug mode explicitly keeps JSONL embeddings
groups/assignments.jsonl and groups/state.json when grouping artifacts are emitted
debug/events.jsonl and debug/execution.log only when debug mode is enabled

Resolve-only run artifacts:

config.resolved.json
manifest.json

Consolidation notes:

trials.jsonl is the canonical per-trial record and includes parse plus embedding summaries.
for Debate runs, intermediate turns are persisted in per-trial transcript records in trials.jsonl.
final run-level metrics and embedding provenance summaries live in manifest.json.
this contract supersedes legacy artifact names such as parsed.jsonl, convergence_trace.jsonl, aggregates.json, embeddings.provenance.json, and clusters/*.

Exit Code Contract

Exit 0 for:

normal completion,
novelty saturation stop,
max-trials stop,
graceful Ctrl+C stop.

Use non-zero only for:

invalid config,
inability to start run,
fatal execution failure.

Interpreting results responsibly

Arbiter measures distributional behavior, not correctness.

Important guidance:

Stopping indicates novelty saturation under the configured measurement setup.
Embedding groups are measurement artifacts, not ground-truth semantic classes.
Free-tier models are useful for exploration but not ideal for publication-grade claims.
Always report measurement settings and model provenance when sharing results.

Troubleshooting

`error: config not found ...`

Initialize a config first:

arbiter init

Live run fails with missing API key

Set key in environment:

export OPENROUTER_API_KEY=<your_key>

`--dashboard` used in non-TTY

Arbiter warns to stderr and continues headless by contract.

Documentation

Design reference: docs/DESIGN.md
Wizard UX spec: docs/product-specs/tui-wizard.md
ExecPlan contract: docs/PLANS.md
Contributor/agent rules: AGENTS.md

License

MIT