@cliwatch/cli-bench
v0.7.1
Published
LLM CLI agent testing framework — benchmark how well AI models use your CLI tool
Maintainers
Readme
@cliwatch/cli-bench
LLM CLI agent testing framework — benchmark how well AI models use your CLI tool. Runs tasks directly on the host (no Docker required), has models execute commands via tool-calling, and validates results with assertions.
Quick start
# 1. Scaffold a config file
npx @cliwatch/cli-bench init
# 2. Edit cli-bench.yaml (define your CLI, providers, tasks)
# 3. Run (locally or in CI)
npx @cliwatch/cli-benchConfig file (cli-bench.yaml)
cli: docker
version_command: "docker --version"
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
tasks:
- id: pull-image
intent: "Pull the latest nginx image"
assert:
- ran: "docker pull.*nginx"
- verify:
run: "docker images nginx --format '{{.Repository}}'"
output_contains: "nginx"
- id: create-project
intent: "Create a new project called my-app"
setup:
- "mkdir -p /tmp/bench-workspace"
assert:
- ran: "mycli create.*my-app"
- exit_code: 0
- file_exists: "/tmp/bench-workspace/my-app/package.json"
- verify:
run: "mycli list --json"
output_contains: "my-app"Split tasks across files
cli: docker
providers: [anthropic/claude-sonnet-4-20250514]
tasks:
- file://tasks/basics.yaml
- file://tasks/advanced/*.yaml
- file://tasks/**/*.yaml # recursive globEach referenced file is a plain array of tasks:
# tasks/basics.yaml
- id: list-containers
intent: "List all running containers"
assert:
- ran: "docker ps"
- exit_code: 0Config fields
| Field | Required | Description |
|-------|----------|-------------|
| cli | Yes | CLI name (must be in PATH) |
| version_command | No | e.g. "mycli --version", for tracking |
| providers | No | Model IDs (default: claude-sonnet-4) |
| help_modes | No | injected, discoverable, none (default: [injected]) |
| concurrency | No | Max concurrent API calls (default: 3) |
| workdir | No | Working directory (default: temp dir per task) |
| upload | No | auto, always, never (default: auto) |
| repeat | No | Run all tasks N times (default: 1, range: 1-100) |
| system_prompt | No | Custom prompt appended to the default agent system message |
| thresholds | No | Pass rate thresholds (see docs) |
| env | No | Environment variables for all tasks (supports {{workdir}}) |
| setup | No | Commands to run before each task (supports {{workdir}}) |
| cleanup | No | Commands to run after each task (supports {{workdir}}) |
| tasks | Yes | Array of tasks or file:// references |
Assertion types
| Assertion | Example | Description |
|-----------|---------|-------------|
| ran | ran: "docker ps" | Agent ran a command matching regex |
| not_ran | not_ran: "rm -rf" | No command matched regex |
| run_count | run_count: {pattern: "curl", min: 1, max: 3} | Count of matching commands |
| output_contains | output_contains: "hello" | Last command stdout contains |
| output_equals | output_equals: "ok" | Last command stdout exact match |
| error_contains | error_contains: "warning" | Last command stderr contains |
| exit_code | exit_code: 0 | Last command exit code |
| file_exists | file_exists: "./my-app/package.json" | File exists |
| file_contains | file_contains: {path: "...", text: "..."} | File content check |
| verify | verify: {run: "cmd", output_contains: "ok"} | Run post-agent command, check output |
verify is the universal escape hatch — runs any command after the agent finishes.
GitHub Actions
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 22 }
- run: npm install -g my-cli
- run: npx @cliwatch/cli-bench
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }} # optional, uploads to dashboardNo Docker required. Commands run directly on the CI runner.
Environment variables
| Variable | Description |
|----------|-------------|
| AI_GATEWAY_API_KEY | Vercel AI Gateway key — provides access to all models |
| CLIWATCH_API_KEY | API key from app.cliwatch.com for uploading results |
Uploading results
Results upload automatically when CLIWATCH_API_KEY is set (default upload: auto). Override with upload: always or upload: never in your config, or pass --upload on the CLI.
Available models
| Model ID | Provider |
|----------|----------|
| anthropic/claude-sonnet-4-20250514 | Anthropic |
| anthropic/claude-haiku-4-5-20251001 | Anthropic |
| openai/gpt-4o | OpenAI |
| google/gemini-2.5-pro | Google |
Any model supported by the Vercel AI SDK gateway can be used — just pass the full provider/model-id.
Changelog
0.5.0
system_promptconfig field for custom agent instructions
0.4.0
- Repeat support, threshold checks, conversation traces, task suite hashing
0.3.0
- Config file mode, file references with globs, CI metadata, dashboard uploads
See CHANGELOG.md for full history.
License
MIT
