npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

snapeval

v4.0.1

Published

Harness-agnostic eval runner for agentskills.io skills

Downloads

1,534

Readme

snapeval

Harness-agnostic eval runner for agentskills.io skills.

CI npm version License: MIT

snapeval runs every eval case with and without your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.

snapeval — greeter
Baseline = without SKILL.md (raw AI response)
────────────────────────────────────────────────────────────
  #1 formal greeting for Eleanor
    Skill: 100% | Baseline: 33% | 5.2s
  #2 casual greeting for Marcus
    Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s
  #3 pirate greeting for Zoe
    Skill: 100% | Baseline: 67% | 2.5s
────────────────────────────────────────────────────────────
Summary:
  Skill pass rate:    100.0%
  Baseline pass rate: 55.6%
  Improvement:        +44.4%

How it works

  1. You write a SKILL.md and an evals.json with test cases and assertions
  2. snapeval runs each eval twice — once with your skill loaded, once without (baseline)
  3. Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)
  4. A benchmark shows where your skill adds value vs. where the raw AI already handles it

Quick start

As a Copilot plugin

copilot plugin install matantsach/snapeval

Then in Copilot CLI, just say evaluate my skill — the snapeval skill handles the rest.

Standalone CLI

git clone https://github.com/matantsach/snapeval.git
cd snapeval && npm install
npx tsx bin/snapeval.ts eval <skill-dir>

Eval format

my-skill/
├── SKILL.md
└── evals/
    ├── evals.json
    └── scripts/         ← optional deterministic checks
        └── validate.sh

evals.json:

{
  "skill_name": "greeter",
  "evals": [
    {
      "id": 1,
      "label": "formal greeting for Eleanor",
      "prompt": "Can you give me a formal greeting for Eleanor?",
      "expected_output": "Returns the formal greeting addressed to Eleanor.",
      "assertions": [
        "Output contains the name Eleanor",
        "Output uses a formal tone",
        "script:validate.sh"
      ]
    }
  ]
}

| Field | Required | Description | |-------|----------|-------------| | id | yes | Unique numeric identifier | | prompt | yes | The user prompt sent to the harness | | expected_output | yes | Human description of the expected behavior | | label | no | Human-readable name shown in terminal output | | slug | no | Filesystem-safe name for the eval directory | | assertions | no | List of assertions to grade (LLM semantic or script: prefixed) | | files | no | Input files to attach to the prompt |

Assertions

Semantic — graded by an LLM. Write specific, verifiable statements:

"Output contains a YAML block with an 'id' field for each issue"
"Response declines because the pipeline already has unclaimed issues"

Script — prefix with script:. Scripts live in evals/scripts/, receive the output directory as $1, and pass on exit code 0:

"script:validate-json-structure.sh"

CLI reference

eval

Run evals, grade assertions, compute benchmark.

npx snapeval eval [skill-dir] [options]

| Flag | Description | Default | |------|-------------|---------| | --harness <name> | Harness adapter | copilot-sdk | | --inference <name> | Inference adapter for grading | auto | | --workspace <path> | Output directory | ../{skill_name}-workspace | | --runs <n> | Harness invocations per eval for statistical averaging | 1 | | --concurrency <n> | Parallel eval cases (1-10) | 1 | | --only <ids> | Run specific eval IDs (e.g. --only 1,3,5) | all | | --threshold <rate> | Minimum pass rate 0-1 for exit code 0 | none | | --old-skill <path> | Compare against old skill version | none | | --feedback | Write feedback.json template for human review | off |

Exit codes

| Code | Meaning | |------|---------| | 0 | Success | | 1 | Threshold not met (eval ran but pass rate below --threshold) | | 2 | Config/input error (bad JSON, missing fields, invalid flags) | | 3 | File not found (missing skill dir, evals.json, or script) | | 4 | Runtime error (harness failure, grading failure, timeout) |

Output artifacts

Each run creates an iteration directory:

workspace/
└── iteration-1/
    ├── benchmark.json       ← aggregate stats with delta
    ├── SKILL.md.snapshot    ← copy of skill used
    └── eval-{slug}/
        ├── with_skill/
        │   ├── outputs/output.txt
        │   ├── timing.json
        │   ├── grading.json
        │   └── transcript.log
        └── without_skill/
            ├── outputs/output.txt
            ├── timing.json
            └── grading.json

benchmark.json includes metadata: eval_count, eval_ids, skill_name, runs_per_eval, timestamp.

CI integration

name: Skill Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci
      - run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3

Exit code 1 when pass rate falls below threshold — blocks the PR.

Configuration

Create snapeval.config.json in your skill or project root:

{
  "harness": "copilot-sdk",
  "inference": "auto",
  "workspace": "../{skill_name}-workspace",
  "runs": 1,
  "concurrency": 1
}

Resolution order: defaults → project config → skill-dir config → CLI flags.

Harness adapters

| Adapter | Description | Default | |---------|-------------|---------| | copilot-sdk | Programmatic via @github/copilot-sdk with native skill loading | yes | | copilot-cli | Shells out to copilot CLI binary | no |

The SDK harness loads skills natively via skillDirectories, captures full transcripts, and extracts real token counts from assistant.usage events.

Inference adapters

| Adapter | Description | |---------|-------------| | auto | Uses @github/copilot-sdk by default, falls back to GitHub Models API | | copilot-sdk | @github/copilot-sdk programmatic | | github-models | GitHub Models API (requires GITHUB_TOKEN) |

Contributing

See CONTRIBUTING.md.

License

MIT