snapeval

v4.0.1

Published

14 days ago

Harness-agnostic eval runner for agentskills.io skills

Downloads

1,534

0High
0Medium
0Low

mtsach

ai-skills agentskills evaluation harness

snapeval

Harness-agnostic eval runner for agentskills.io skills.

snapeval runs every eval case with and without your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.

snapeval — greeter
Baseline = without SKILL.md (raw AI response)
────────────────────────────────────────────────────────────
  #1 formal greeting for Eleanor
    Skill: 100% | Baseline: 33% | 5.2s
  #2 casual greeting for Marcus
    Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s
  #3 pirate greeting for Zoe
    Skill: 100% | Baseline: 67% | 2.5s
────────────────────────────────────────────────────────────
Summary:
  Skill pass rate:    100.0%
  Baseline pass rate: 55.6%
  Improvement:        +44.4%

How it works

You write a SKILL.md and an evals.json with test cases and assertions
snapeval runs each eval twice — once with your skill loaded, once without (baseline)
Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)
A benchmark shows where your skill adds value vs. where the raw AI already handles it

Quick start

As a Copilot plugin

copilot plugin install matantsach/snapeval

Then in Copilot CLI, just say evaluate my skill — the snapeval skill handles the rest.

Standalone CLI

git clone https://github.com/matantsach/snapeval.git
cd snapeval && npm install
npx tsx bin/snapeval.ts eval <skill-dir>

Eval format

my-skill/
├── SKILL.md
└── evals/
    ├── evals.json
    └── scripts/         ← optional deterministic checks
        └── validate.sh

evals.json:

{
  "skill_name": "greeter",
  "evals": [
    {
      "id": 1,
      "label": "formal greeting for Eleanor",
      "prompt": "Can you give me a formal greeting for Eleanor?",
      "expected_output": "Returns the formal greeting addressed to Eleanor.",
      "assertions": [
        "Output contains the name Eleanor",
        "Output uses a formal tone",
        "script:validate.sh"
      ]
    }
  ]
}

| Field | Required | Description | |-------|----------|-------------| | id | yes | Unique numeric identifier | | prompt | yes | The user prompt sent to the harness | | expected_output | yes | Human description of the expected behavior | | label | no | Human-readable name shown in terminal output | | slug | no | Filesystem-safe name for the eval directory | | assertions | no | List of assertions to grade (LLM semantic or script: prefixed) | | files | no | Input files to attach to the prompt |

Assertions

Semantic — graded by an LLM. Write specific, verifiable statements:

"Output contains a YAML block with an 'id' field for each issue"
"Response declines because the pipeline already has unclaimed issues"

Script — prefix with script:. Scripts live in evals/scripts/, receive the output directory as $1, and pass on exit code 0:

"script:validate-json-structure.sh"

CLI reference

`eval`

Run evals, grade assertions, compute benchmark.

npx snapeval eval [skill-dir] [options]

| Flag | Description | Default | |------|-------------|---------| | --harness <name> | Harness adapter | copilot-sdk | | --inference <name> | Inference adapter for grading | auto | | --workspace <path> | Output directory | ../{skill_name}-workspace | | --runs <n> | Harness invocations per eval for statistical averaging | 1 | | --concurrency <n> | Parallel eval cases (1-10) | 1 | | --only <ids> | Run specific eval IDs (e.g. --only 1,3,5) | all | | --threshold <rate> | Minimum pass rate 0-1 for exit code 0 | none | | --old-skill <path> | Compare against old skill version | none | | --feedback | Write feedback.json template for human review | off |

Exit codes

| Code | Meaning | |------|---------| | 0 | Success | | 1 | Threshold not met (eval ran but pass rate below --threshold) | | 2 | Config/input error (bad JSON, missing fields, invalid flags) | | 3 | File not found (missing skill dir, evals.json, or script) | | 4 | Runtime error (harness failure, grading failure, timeout) |

Output artifacts

Each run creates an iteration directory:

workspace/
└── iteration-1/
    ├── benchmark.json       ← aggregate stats with delta
    ├── SKILL.md.snapshot    ← copy of skill used
    └── eval-{slug}/
        ├── with_skill/
        │   ├── outputs/output.txt
        │   ├── timing.json
        │   ├── grading.json
        │   └── transcript.log
        └── without_skill/
            ├── outputs/output.txt
            ├── timing.json
            └── grading.json

benchmark.json includes metadata: eval_count, eval_ids, skill_name, runs_per_eval, timestamp.

CI integration

name: Skill Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci
      - run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3

Exit code 1 when pass rate falls below threshold — blocks the PR.

Configuration

Create snapeval.config.json in your skill or project root:

{
  "harness": "copilot-sdk",
  "inference": "auto",
  "workspace": "../{skill_name}-workspace",
  "runs": 1,
  "concurrency": 1
}

Resolution order: defaults → project config → skill-dir config → CLI flags.

Harness adapters

| Adapter | Description | Default | |---------|-------------|---------| | copilot-sdk | Programmatic via @github/copilot-sdk with native skill loading | yes | | copilot-cli | Shells out to copilot CLI binary | no |

The SDK harness loads skills natively via skillDirectories, captures full transcripts, and extracts real token counts from assistant.usage events.

Inference adapters

| Adapter | Description | |---------|-------------| | auto | Uses @github/copilot-sdk by default, falls back to GitHub Models API | | copilot-sdk | @github/copilot-sdk programmatic | | github-models | GitHub Models API (requires GITHUB_TOKEN) |

Contributing

See CONTRIBUTING.md.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

snapeval

How it works

Quick start

As a Copilot plugin

Standalone CLI

Eval format

Assertions

CLI reference

eval

Exit codes

Output artifacts

CI integration

Configuration

Harness adapters

Inference adapters

Contributing

License

`eval`