@cvr/xp

v0.0.1

Published

2 months ago

Autonomous experiment daemon — LLM-driven optimization of any measurable metric

0High
0Medium
0Low

_cevr

ai autonomous benchmark claude cli codex effect experiment optimization

xp

Autonomous experiment daemon. Point an LLM at any benchmark, it optimizes the metric in a loop.

Built with Effect v4 and Bun.

Install

bun run build   # compiles binary to bin/xp + symlinks to ~/.bun/bin/

Usage

# Start an experiment
xp start optimize-fft \
  --metric latency --unit ms --direction min \
  --benchmark "./bench.sh" \
  --objective "reduce FFT latency" \
  --provider claude

# Monitor
xp status            # current state
xp logs              # daemon output
xp logs -f           # tail daemon output
xp results           # all trial results
xp results --last 5  # last 5 trials

# Steer the agent mid-run
xp steer "try SIMD intrinsics instead of auto-vectorization"

# Stop
xp stop

Commands

| Command | Description | | ------------------ | ----------------------------------------- | | start <name> | Initialize and start an experiment | | stop | Stop the daemon | | status | Show experiment state (--json) | | logs | View daemon log (-f to follow) | | results | Show trial results (--last N, --json) | | steer <guidance> | Send guidance to the running experiment |

`start` Flags

| Flag | Description | Default | | ------------------ | -------------------------------------------- | -------- | | --metric | Metric name to optimize | required | | --unit | Metric unit | "" | | --direction | min or max | required | | --benchmark | Shell command that emits METRIC name=value | required | | --objective | What the agent should optimize | required | | --provider | claude or codex | claude | | --max-iterations | Budget cap | 50 | | --max-failures | Max consecutive failures | 5 |

Benchmark Contract

The benchmark command must print metrics to stdout in this format:

METRIC latency=42.5
METRIC throughput=1200

One METRIC name=value per line. The --metric flag selects which one to optimize.

How It Works

Baseline: runs the benchmark on the current code to establish a starting point
Loop: invokes the LLM agent with context (objective, best score, dead ends, user guidance), agent makes changes in a git worktree, benchmark runs, result is kept or reverted
Persistence: all events logged to append-only JSONL, crash-safe with two-phase decisions
Worktree isolation: experiments run in .xp/worktree/ on an xp/<name> branch — your working directory stays clean

Development

bun run dev -- --help   # run from source
bun run gate            # typecheck + lint + fmt + test + build
bun test                # tests only

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

xp