snapeval
v4.0.1
Published
Harness-agnostic eval runner for agentskills.io skills
Downloads
1,534
Maintainers
Readme
snapeval
Harness-agnostic eval runner for agentskills.io skills.
snapeval runs every eval case with and without your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.
snapeval — greeter
Baseline = without SKILL.md (raw AI response)
────────────────────────────────────────────────────────────
#1 formal greeting for Eleanor
Skill: 100% | Baseline: 33% | 5.2s
#2 casual greeting for Marcus
Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s
#3 pirate greeting for Zoe
Skill: 100% | Baseline: 67% | 2.5s
────────────────────────────────────────────────────────────
Summary:
Skill pass rate: 100.0%
Baseline pass rate: 55.6%
Improvement: +44.4%How it works
- You write a
SKILL.mdand anevals.jsonwith test cases and assertions - snapeval runs each eval twice — once with your skill loaded, once without (baseline)
- Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)
- A benchmark shows where your skill adds value vs. where the raw AI already handles it
Quick start
As a Copilot plugin
copilot plugin install matantsach/snapevalThen in Copilot CLI, just say evaluate my skill — the snapeval skill handles the rest.
Standalone CLI
git clone https://github.com/matantsach/snapeval.git
cd snapeval && npm install
npx tsx bin/snapeval.ts eval <skill-dir>Eval format
my-skill/
├── SKILL.md
└── evals/
├── evals.json
└── scripts/ ← optional deterministic checks
└── validate.shevals.json:
{
"skill_name": "greeter",
"evals": [
{
"id": 1,
"label": "formal greeting for Eleanor",
"prompt": "Can you give me a formal greeting for Eleanor?",
"expected_output": "Returns the formal greeting addressed to Eleanor.",
"assertions": [
"Output contains the name Eleanor",
"Output uses a formal tone",
"script:validate.sh"
]
}
]
}| Field | Required | Description |
|-------|----------|-------------|
| id | yes | Unique numeric identifier |
| prompt | yes | The user prompt sent to the harness |
| expected_output | yes | Human description of the expected behavior |
| label | no | Human-readable name shown in terminal output |
| slug | no | Filesystem-safe name for the eval directory |
| assertions | no | List of assertions to grade (LLM semantic or script: prefixed) |
| files | no | Input files to attach to the prompt |
Assertions
Semantic — graded by an LLM. Write specific, verifiable statements:
"Output contains a YAML block with an 'id' field for each issue"
"Response declines because the pipeline already has unclaimed issues"Script — prefix with script:. Scripts live in evals/scripts/, receive the output directory as $1, and pass on exit code 0:
"script:validate-json-structure.sh"CLI reference
eval
Run evals, grade assertions, compute benchmark.
npx snapeval eval [skill-dir] [options]| Flag | Description | Default |
|------|-------------|---------|
| --harness <name> | Harness adapter | copilot-sdk |
| --inference <name> | Inference adapter for grading | auto |
| --workspace <path> | Output directory | ../{skill_name}-workspace |
| --runs <n> | Harness invocations per eval for statistical averaging | 1 |
| --concurrency <n> | Parallel eval cases (1-10) | 1 |
| --only <ids> | Run specific eval IDs (e.g. --only 1,3,5) | all |
| --threshold <rate> | Minimum pass rate 0-1 for exit code 0 | none |
| --old-skill <path> | Compare against old skill version | none |
| --feedback | Write feedback.json template for human review | off |
Exit codes
| Code | Meaning |
|------|---------|
| 0 | Success |
| 1 | Threshold not met (eval ran but pass rate below --threshold) |
| 2 | Config/input error (bad JSON, missing fields, invalid flags) |
| 3 | File not found (missing skill dir, evals.json, or script) |
| 4 | Runtime error (harness failure, grading failure, timeout) |
Output artifacts
Each run creates an iteration directory:
workspace/
└── iteration-1/
├── benchmark.json ← aggregate stats with delta
├── SKILL.md.snapshot ← copy of skill used
└── eval-{slug}/
├── with_skill/
│ ├── outputs/output.txt
│ ├── timing.json
│ ├── grading.json
│ └── transcript.log
└── without_skill/
├── outputs/output.txt
├── timing.json
└── grading.jsonbenchmark.json includes metadata: eval_count, eval_ids, skill_name, runs_per_eval, timestamp.
CI integration
name: Skill Evaluation
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm ci
- run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3Exit code 1 when pass rate falls below threshold — blocks the PR.
Configuration
Create snapeval.config.json in your skill or project root:
{
"harness": "copilot-sdk",
"inference": "auto",
"workspace": "../{skill_name}-workspace",
"runs": 1,
"concurrency": 1
}Resolution order: defaults → project config → skill-dir config → CLI flags.
Harness adapters
| Adapter | Description | Default |
|---------|-------------|---------|
| copilot-sdk | Programmatic via @github/copilot-sdk with native skill loading | yes |
| copilot-cli | Shells out to copilot CLI binary | no |
The SDK harness loads skills natively via skillDirectories, captures full transcripts, and extracts real token counts from assistant.usage events.
Inference adapters
| Adapter | Description |
|---------|-------------|
| auto | Uses @github/copilot-sdk by default, falls back to GitHub Models API |
| copilot-sdk | @github/copilot-sdk programmatic |
| github-models | GitHub Models API (requires GITHUB_TOKEN) |
Contributing
See CONTRIBUTING.md.
