@cristobalme/skill-test

v0.1.0

Published

23 days ago

Test agent Skills (SKILL.md): static lint, activation triggering, and behavioral grading. Zero-config, offline-capable, CI-first.

0High
0Medium
0Low

cristobalme

skill skills agent agent-skills SKILL.md lint test ci claude anthropic

skill-test

Test agent Skills (SKILL.md files) before you ship them. skill-test validates skills across three layers:

Static lint — offline, deterministic checks against the live Agent Skills spec: frontmatter, naming rules, description length, body size, broken file references, and risky instruction patterns.
Triggering — does the agent actually load your skill for the prompts it should (and skip the ones it shouldn't)? Measured as precision/recall over a labeled corpus.
Behavioral — does the skill produce correct output on real tasks? (graded, sandboxed)

No telemetry. No phone-home. The static layer needs no network and no API key.

Quick start

npx @cristobalme/skill-test lint ./my-skill

The package is published under the @cristobalme scope; the binary it installs is named skill-test.

Commands

skill-test lint    <path...>   # Layer 1 — static, offline, deterministic
skill-test trigger <path...>   # Layer 2 — activation precision/recall (needs API key + spec)
skill-test run     <path...>   # Layer 3 — behavioral task grading
skill-test check   <path...>   # Runs every layer available given config/keys

<path> accepts a single SKILL.md, a skill directory, or a directory of many skills (walked recursively).

Global flags

| Flag | Effect | | ---------------- | --------------------------------------------------------------- | | --json | Emit machine-readable JSON to stdout | | --junit <file> | Write a JUnit XML report to <file> (renders in CI dashboards) | | --cheap | Skip the behavioral (run) layer | | --quiet | Only print failures | | --no-color | Disable ANSI color (also auto-off when stdout isn't a TTY) | | --model <id> | Override the classifier model for the trigger layer |

Exit codes

| Code | Meaning | | ---- | ---------------------------------------------------------------------- | | 0 | All checks that ran passed | | 1 | One or more failures (lint error, or trigger false pos/negatives) | | 2 | Usage or configuration error (no skill found, trigger without a key) |

Layers degrade gracefully: check runs lint always, and runs the trigger layer only when an ANTHROPIC_API_KEY and a SKILL.test.yaml are present. A missing key or spec is skipped, not failed — so check is safe to drop into any CI.

The `SKILL.test.yaml` spec

Co-locate a SKILL.test.yaml next to your SKILL.md to enable the trigger layer:

skill: ./SKILL.md
triggering:
  should_activate:
    - "fill out this PDF form"
    - "complete the application pdf"
  should_not_activate:
    - "write me a poem"
    - "summarize this spreadsheet"
tasks: [] # behavioral tasks — Phase 5

The trigger layer asks the model to make the same load/skip decision a host agent makes at startup, using only the skill's name + description (never the body). It reports precision/recall/F1 over the labeled prompts. Results are cached on disk (keyed by model + description + prompt), so reruns are free.

GitHub Action

Test every skill on each PR and get a results comment. Drop this into .github/workflows/skill-test.yml (full copy in examples/skill-test.yml):

name: skill-test
on: [pull_request]
permissions:
  contents: read
  pull-requests: write
jobs:
  skill-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - uses: OWNER/skill-test/action@v1
        with:
          path: .
          # optional — enables the trigger layer; lint runs without it
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}

Replace OWNER with the org/user the action is published under. Without an API key the Action runs the offline lint layer and still posts a comment. With one, it adds activation precision/recall. It posts (and updates in place) a single PR comment:

| Skill | Lint | Trigger | | -------------- | ----------- | ------------------ | | good-skill | ✅ | ✅ P 100% · R 100% | | broken-skill | ❌ 2 errors | ⏭️ skipped |

Add the badge

[![skills tested](https://img.shields.io/badge/skills-tested-8A2BE2)](https://www.npmjs.com/package/@cristobalme/skill-test)

Privacy

No telemetry, no analytics, no phone-home. The static lint layer runs fully offline. Only trigger and run call the Anthropic API, and only with the metadata/inputs needed for the check.

Status

v0.1.0 ships the static lint layer, the activation trigger layer, the unified check with JSON/JUnit output, and the GitHub Action + badge. The behavioral run layer (sandboxed task grading) is the next release.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme