@cliwatch/cli-bench

v0.7.1

Published

4 hours ago

LLM CLI agent testing framework — benchmark how well AI models use your CLI tool

0High
0Medium
0Low

agentic-federation-mh

cli benchmark llm testing ai-agent cliwatch evaluation

@cliwatch/cli-bench

LLM CLI agent testing framework — benchmark how well AI models use your CLI tool. Runs tasks directly on the host (no Docker required), has models execute commands via tool-calling, and validates results with assertions.

Quick start

# 1. Scaffold a config file
npx @cliwatch/cli-bench init

# 2. Edit cli-bench.yaml (define your CLI, providers, tasks)

# 3. Run (locally or in CI)
npx @cliwatch/cli-bench

Config file (`cli-bench.yaml`)

cli: docker
version_command: "docker --version"

providers:
  - anthropic/claude-sonnet-4-20250514
  - openai/gpt-4o

tasks:
  - id: pull-image
    intent: "Pull the latest nginx image"
    assert:
      - ran: "docker pull.*nginx"
      - verify:
          run: "docker images nginx --format '{{.Repository}}'"
          output_contains: "nginx"

  - id: create-project
    intent: "Create a new project called my-app"
    setup:
      - "mkdir -p /tmp/bench-workspace"
    assert:
      - ran: "mycli create.*my-app"
      - exit_code: 0
      - file_exists: "/tmp/bench-workspace/my-app/package.json"
      - verify:
          run: "mycli list --json"
          output_contains: "my-app"

Split tasks across files

cli: docker
providers: [anthropic/claude-sonnet-4-20250514]
tasks:
  - file://tasks/basics.yaml
  - file://tasks/advanced/*.yaml
  - file://tasks/**/*.yaml          # recursive glob

Each referenced file is a plain array of tasks:

# tasks/basics.yaml
- id: list-containers
  intent: "List all running containers"
  assert:
    - ran: "docker ps"
    - exit_code: 0

Config fields

| Field | Required | Description | |-------|----------|-------------| | cli | Yes | CLI name (must be in PATH) | | version_command | No | e.g. "mycli --version", for tracking | | providers | No | Model IDs (default: claude-sonnet-4) | | help_modes | No | injected, discoverable, none (default: [injected]) | | concurrency | No | Max concurrent API calls (default: 3) | | workdir | No | Working directory (default: temp dir per task) | | upload | No | auto, always, never (default: auto) | | repeat | No | Run all tasks N times (default: 1, range: 1-100) | | system_prompt | No | Custom prompt appended to the default agent system message | | thresholds | No | Pass rate thresholds (see docs) | | env | No | Environment variables for all tasks (supports {{workdir}}) | | setup | No | Commands to run before each task (supports {{workdir}}) | | cleanup | No | Commands to run after each task (supports {{workdir}}) | | tasks | Yes | Array of tasks or file:// references |

Assertion types

| Assertion | Example | Description | |-----------|---------|-------------| | ran | ran: "docker ps" | Agent ran a command matching regex | | not_ran | not_ran: "rm -rf" | No command matched regex | | run_count | run_count: {pattern: "curl", min: 1, max: 3} | Count of matching commands | | output_contains | output_contains: "hello" | Last command stdout contains | | output_equals | output_equals: "ok" | Last command stdout exact match | | error_contains | error_contains: "warning" | Last command stderr contains | | exit_code | exit_code: 0 | Last command exit code | | file_exists | file_exists: "./my-app/package.json" | File exists | | file_contains | file_contains: {path: "...", text: "..."} | File content check | | verify | verify: {run: "cmd", output_contains: "ok"} | Run post-agent command, check output |

verify is the universal escape hatch — runs any command after the agent finishes.

GitHub Actions

steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-node@v4
    with: { node-version: 22 }
  - run: npm install -g my-cli
  - run: npx @cliwatch/cli-bench
    env:
      AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
      CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}  # optional, uploads to dashboard

No Docker required. Commands run directly on the CI runner.

Environment variables

| Variable | Description | |----------|-------------| | AI_GATEWAY_API_KEY | Vercel AI Gateway key — provides access to all models | | CLIWATCH_API_KEY | API key from app.cliwatch.com for uploading results |

Uploading results

Results upload automatically when CLIWATCH_API_KEY is set (default upload: auto). Override with upload: always or upload: never in your config, or pass --upload on the CLI.

Available models

| Model ID | Provider | |----------|----------| | anthropic/claude-sonnet-4-20250514 | Anthropic | | anthropic/claude-haiku-4-5-20251001 | Anthropic | | openai/gpt-4o | OpenAI | | google/gemini-2.5-pro | Google |

Any model supported by the Vercel AI SDK gateway can be used — just pass the full provider/model-id.

Changelog

0.5.0

system_prompt config field for custom agent instructions

0.4.0

Repeat support, threshold checks, conversation traces, task suite hashing

0.3.0

Config file mode, file references with globs, CI metadata, dashboard uploads

See CHANGELOG.md for full history.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@cliwatch/cli-bench

Quick start

Config file (cli-bench.yaml)

Split tasks across files

Config fields

Assertion types

GitHub Actions

Environment variables

Uploading results

Available models

Changelog

0.5.0

0.4.0

0.3.0

License

Config file (`cli-bench.yaml`)