@cristobalme/skill-test
v0.1.0
Published
Test agent Skills (SKILL.md): static lint, activation triggering, and behavioral grading. Zero-config, offline-capable, CI-first.
Maintainers
Readme
skill-test
Test agent Skills (SKILL.md files) before you ship them. skill-test
validates skills across three layers:
- Static lint — offline, deterministic checks against the live Agent Skills spec: frontmatter, naming rules, description length, body size, broken file references, and risky instruction patterns.
- Triggering — does the agent actually load your skill for the prompts it should (and skip the ones it shouldn't)? Measured as precision/recall over a labeled corpus.
- Behavioral — does the skill produce correct output on real tasks? (graded, sandboxed)
No telemetry. No phone-home. The static layer needs no network and no API key.
Quick start
npx @cristobalme/skill-test lint ./my-skillThe package is published under the @cristobalme scope; the binary it installs
is named skill-test.
Commands
skill-test lint <path...> # Layer 1 — static, offline, deterministic
skill-test trigger <path...> # Layer 2 — activation precision/recall (needs API key + spec)
skill-test run <path...> # Layer 3 — behavioral task grading
skill-test check <path...> # Runs every layer available given config/keys<path> accepts a single SKILL.md, a skill directory, or a directory of many
skills (walked recursively).
Global flags
| Flag | Effect |
| ---------------- | --------------------------------------------------------------- |
| --json | Emit machine-readable JSON to stdout |
| --junit <file> | Write a JUnit XML report to <file> (renders in CI dashboards) |
| --cheap | Skip the behavioral (run) layer |
| --quiet | Only print failures |
| --no-color | Disable ANSI color (also auto-off when stdout isn't a TTY) |
| --model <id> | Override the classifier model for the trigger layer |
Exit codes
| Code | Meaning |
| ---- | ---------------------------------------------------------------------- |
| 0 | All checks that ran passed |
| 1 | One or more failures (lint error, or trigger false pos/negatives) |
| 2 | Usage or configuration error (no skill found, trigger without a key) |
Layers degrade gracefully: check runs lint always, and runs the trigger layer
only when an ANTHROPIC_API_KEY and a SKILL.test.yaml are present. A missing
key or spec is skipped, not failed — so check is safe to drop into any CI.
The SKILL.test.yaml spec
Co-locate a SKILL.test.yaml next to your SKILL.md to enable the trigger layer:
skill: ./SKILL.md
triggering:
should_activate:
- "fill out this PDF form"
- "complete the application pdf"
should_not_activate:
- "write me a poem"
- "summarize this spreadsheet"
tasks: [] # behavioral tasks — Phase 5The trigger layer asks the model to make the same load/skip decision a host
agent makes at startup, using only the skill's name + description (never
the body). It reports precision/recall/F1 over the labeled prompts. Results are
cached on disk (keyed by model + description + prompt), so reruns are free.
GitHub Action
Test every skill on each PR and get a results comment. Drop this into
.github/workflows/skill-test.yml (full copy in
examples/skill-test.yml):
name: skill-test
on: [pull_request]
permissions:
contents: read
pull-requests: write
jobs:
skill-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: OWNER/skill-test/action@v1
with:
path: .
# optional — enables the trigger layer; lint runs without it
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}Replace OWNER with the org/user the action is published under. Without an API
key the Action runs the offline lint layer and still posts a comment. With one,
it adds activation precision/recall. It posts (and updates in place) a single PR
comment:
| Skill | Lint | Trigger |
| -------------- | ----------- | ------------------ |
| good-skill | ✅ | ✅ P 100% · R 100% |
| broken-skill | ❌ 2 errors | ⏭️ skipped |
Add the badge
[](https://www.npmjs.com/package/@cristobalme/skill-test)Privacy
No telemetry, no analytics, no phone-home. The static lint layer runs fully
offline. Only trigger and run call the Anthropic API, and only with the
metadata/inputs needed for the check.
Status
v0.1.0 ships the static lint layer, the activation trigger layer, the unified
check with JSON/JUnit output, and the GitHub Action + badge. The behavioral
run layer (sandboxed task grading) is the next release.
License
MIT
