skill-eval-runner

v0.1.0

Published

10 days ago

CLI runner for testing AI coding-agent Skills.

0High
0Medium
0Low

Skill Eval Runner

ser is a CLI test runner for SKILL.md files. It runs a skill in a fresh workspace, sends the test prompt to an agent, and checks the observable result: files, stdout, stderr, exit code, JSON, commands, duration, token usage, and the agent's final response.

It is useful once "I tried the prompt once and it looked fine" stops being enough. Skills tend to become part of real workflows, and those workflows do not always fail loudly. An agent can give a convincing answer while forgetting to create a file, putting it in the wrong place, or dropping an important rule after a small edit to the instructions.

Skill Eval Runner screenshot: dry-run suite in the terminal

What `ser` does

Finds *.skill-test.yml and *.skill-test.yaml files next to skills.
Creates a sandbox for each test case.
Passes the SKILL.md content and test prompt to an adapter.
Lets the agent work inside the sandbox through file and command tools.
Verifies the result with assertions.
Writes reports for people and CI: console, JSON, HTML, JUnit, Markdown, and GitHub Actions annotations.

The short version: ser is pytest for AI Skills.

Why this matters

A manual chat run is a weak regression test. Everything can look reasonable in the conversation while the real contract of the skill sits somewhere else: created files, directory structure, commands, JSON output, exit codes, and the absence of errors.

ser makes those expectations explicit. You describe what the skill should do, then run the same check locally, in a pull request, or in CI.

Status

Current version: 0.1.0.

The runner can already execute real suite files, run in dry-run mode, use Claude, use the codex adapter over the OpenAI API, connect to OpenAI-compatible endpoints, run tmpdir and Docker sandboxes, save artifacts, redact secrets, retry failed calls, and generate several report formats.

The public surface is still small, so pin the version in CI and upgrade deliberately.

Requirements:

Node.js >=20
Docker, only when using --sandbox docker
API keys, only for adapters that call a model

Installation

npm install -g skill-eval-runner
ser doctor

For local development from the repository:

npm install
npm run build
npm run dev -- doctor

Quick start

Start with dry-run. It does not call an LLM and does not create files on behalf of an agent. Its job is simpler: validate discovery, YAML, config, and reporting without tokens or API keys.

Create skill.skill-test.yml next to SKILL.md:

schema_version: '1.0'
name: sample-skill
tags: [sample, smoke]
skill: ./SKILL.md
adapter: dry-run

tests:
  - name: dry-run-smoke
    prompt: 'Explain what you would do in {{WORKSPACE}}.'
    assertions:
      - type: exit_code
        expected: 0
      - type: stderr_empty
      - type: response_contains
        contains: '[DRY RUN]'
      - type: token_usage_under
        max_total: 1

Run it:

ser run --dry-run --report console,json,html .

Once the suite shape is right, switch to a real adapter and assert on what the agent actually did in the workspace:

schema_version: '1.0'
name: migration-skill
tags: [critical]
skill: ./SKILL.md
adapter: claude

tests:
  - name: creates-user-migration
    prompt: 'Create a user table migration in {{WORKSPACE}}.'
    assertions:
      - type: exit_code
        expected: 0
      - type: stderr_empty
      - type: file_exists
        path: db/migrations/001_create_users.sql
      - type: file_contains
        path: db/migrations/001_create_users.sql
        contains: CREATE TABLE users

Example run with a provider key:

SER_ANTHROPIC_API_KEY=... ser run . --adapter claude --save-artifacts

Suite format

A minimal suite contains a path to the skill and a list of tests:

schema_version: '1.0'
name: docs-skill
skill: ./SKILL.md
tags: [docs]

setup_files:
  - src: ./fixtures/package.json
    dest: package.json

tests:
  - name: updates-readme
    tags: [smoke]
    prompt: 'Use {{WORKSPACE}} and update the README.'
    assertions:
      - type: file_exists
        path: README.md
      - type: response_not_contains
        contains: 'I cannot'

Useful fields:

| Field | Scope | Purpose | | ------------------- | ------------------- | ---------------------------------------------------------------------- | | tags | suite, test | Select fast, slow, critical, or experimental checks. | | setup_files | suite, test | Copy fixtures into the sandbox before the agent runs. | | setup_commands | suite, test | Prepare a workspace: install, generate, migrate, build. | | teardown_commands | suite, test | Clean up temporary state after a case when needed. | | env | suite, test, config | Pass environment variables to setup commands and adapters. | | timeout_seconds | suite, test, config | Override the shared timeout for long or short cases. | | adapter | suite, config, CLI | Choose dry-run, claude, codex, or openai-compat. | | sandbox | suite, config, CLI | Choose tmpdir or docker. | | sandbox_config | suite, config | Configure network, memory, CPU, process limits, or Dockerfile details. |

{{WORKSPACE}} in a prompt is replaced with the sandbox path. This is handy when a skill should work with a test project rather than the runner repository itself.

Commands

ser run [path]                  # find suites, run tests, write reports
ser init [path]                 # create example.skill-test.yml and .skilleval.yml
ser validate [path]             # validate YAML, skill paths, and suite structure
ser list [path]                 # show discovered suites and cases
ser report <run.json> --format html
ser report <run.json> --format junit
ser report <run.json> --format markdown
ser doctor                      # check Node, Docker, keys, and report directories
ser completion bash             # completion for bash, zsh, or fish

Common run options:

ser run . --filter 'migration-skill::creates-*'
ser run . --tags critical,not:slow
ser run . --adapter claude --model <model-name>
ser run . --adapter codex --model <model-name>
ser run . --adapter openai-compat --base-url http://localhost:11434/v1
ser run . --sandbox docker --save-artifacts
ser run . --max-cost 5
ser run . --report console,json,html,junit,markdown --report-dir .skilleval-reports

Exit codes:

| Code | Meaning | | ---- | ------------------------------------------------------------------------------------- | | 0 | All error-severity assertions passed. | | 1 | At least one test failed. | | 2 | Parsing or configuration error. | | 3 | Runtime error, adapter error, timeout, interruption, or max-cost guardrail triggered. |

Assertions

Assert on behavior you can observe. If a skill should create a file, check the file. If it should return structured data, check the JSON. Response text assertions are useful, but they are usually weaker than workspace assertions.

| Group | Assertions | | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Files | file_exists, file_not_exists, file_contains, file_matches_regex, file_count, file_diff, dir_structure | | Process | exit_code, stdout_contains, stdout_matches_regex, stderr_contains, stderr_empty, command_ran, duration_under, custom_command, no_exec_errors | | JSON | json_schema, json_path_equals | | Agent response | response_contains, response_not_contains, token_usage_under, semantic |

Any assertion can be a warning:

- type: response_contains
  contains: 'used project conventions'
  severity: warning

Warnings appear in reports but do not fail the test case.

semantic uses an LLM judge. It is useful for nuanced quality checks, but CI should lean on deterministic assertions: files, commands, JSON, and other concrete outputs.

Adapters

| Adapter | Use it when | | --------------- | ----------------------------------------------------------------------- | | dry-run | You want to check suite parsing, discovery, config, and reports. | | claude | You want to run the skill through Anthropic Messages API with tools. | | codex | You want to run the skill through the OpenAI API with the same sandbox. | | openai-compat | You want to use an OpenAI-compatible endpoint, local model, or gateway. |

API key priority:

--api-key
SER_ANTHROPIC_API_KEY or SER_OPENAI_API_KEY
ANTHROPIC_API_KEY or OPENAI_API_KEY
api_key in .skilleval.yml

For openai-compat, you usually also need --base-url or base_url in config.

Sandbox

tmpdir is fast and convenient locally. It creates a fresh temporary workspace, but it does not provide strict network or resource isolation.

docker is slower, but better suited for CI and untrusted skills. It can restrict network, memory, CPU, and process count.

Example:

sandbox: docker
docker_image: node:20-slim
sandbox_config:
  network: none
  memory_limit: '1g'
  cpu_limit: 1
  process_limit: 128

Reports

ser run . --report console,json,html,junit,markdown

Formats:

| Format | Purpose | | ---------------- | ------------------------------------- | | console | Short local summary in the terminal. | | json | Full machine-readable run result. | | html | Readable report for review and debug. | | junit | CI test report integration. | | markdown | PR comments or job summaries. | | github-actions | Workflow annotations in stdout. |

You can rebuild reports from a saved JSON run:

ser report .skilleval-reports/<timestamp>.json --format html
ser report .skilleval-reports/<timestamp>.json --format junit
ser report .skilleval-reports/<timestamp>.json --format markdown

Use --save-artifacts when debugging failures. Failed and errored cases save sandbox contents, adapter responses, and assertion details into the artifacts directory.

Configuration

ser looks for .skilleval.yml upward from the given path:

adapter: claude
sandbox: tmpdir
timeout_seconds: 120
concurrency: 1

report_formats:
  - console
  - json
  - junit
report_dir: .skilleval-reports

save_artifacts: false
artifacts_dir: .skilleval-artifacts
redact_secrets: true

retry:
  max_attempts: 2
  backoff: exponential
  delay_ms: 30000
  retry_on: [rate_limit, 5xx, network]

discovery_patterns:
  - '**/*.skill-test.yml'
  - '**/*.skill-test.yaml'

CLI flags take precedence over config. That keeps baseline settings in the repository while a CI job chooses the model, key, concurrency, and report formats.

CI

Example GitHub Actions workflow:

name: skill-evals

on:
  pull_request:
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm install -g skill-eval-runner
      - run: ser run . --adapter claude --report console,junit --report-dir .skilleval-reports
        env:
          SER_ANTHROPIC_API_KEY: ${{ secrets.SER_ANTHROPIC_API_KEY }}

For the first CI run, start with ser run . --dry-run --report console,junit. That checks discovery, config, and reports before you spend tokens.

Development

npm install
npm run typecheck
npm test
npm run build

Useful local commands:

npm run dev -- list fixtures/sample-skill
npm run dev -- validate fixtures/sample-skill
npm run dev -- run --dry-run fixtures/sample-skill
npm run dev -- run --dry-run --report console,json,html fixtures/sample-skill

Troubleshooting

ser doctor

Common issues:

| Symptom | Check | | ---------------------- | ------------------------------------------------------------------------------------------ | | No suites found | File names must match *.skill-test.yml or *.skill-test.yaml. | | SKILL.md not found | The skill path is resolved relative to the suite file. | | Docker fails | Start the Docker daemon or use --sandbox tmpdir. | | Missing API key | Set SER_ANTHROPIC_API_KEY, SER_OPENAI_API_KEY, or --api-key. | | Flaky test | Move the check from response text to a file, command, JSON, or another deterministic fact. | | Secrets appear in logs | Keep redact_secrets: true; it is enabled by default unless SER_REDACT_SECRETS=0. |

License

MIT. See LICENSE.