skilltest

v0.10.0

Published

4 months ago

The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.

0High
0Medium
0Low

lsaraiva

agent-skills cli evaluation lint

skilltest

The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your SKILL.md files.

skilltest is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.

The repository itself uses a fast Vitest suite for offline unit and integration coverage of the parser, linters, trigger math, config resolution, reporters, and linter orchestration.

Why skilltest?

Agent Skills are quick to write but hard to validate before deployment:

Descriptions can be too vague to trigger reliably.
Broken paths in scripts/, references/, or assets/ fail silently.
You cannot easily measure trigger precision/recall.
You do not know whether outputs are good until users exercise the skill.

skilltest closes this gap with one CLI and five modes.

Install

Global:

npm install -g skilltest

Without install:

npx skilltest --help

Requires Node.js >=18.

Quick Start

Lint a skill:

skilltest lint ./path/to/skill

Trigger test:

skilltest trigger ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929

End-to-end eval:

skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929

Run full quality gate:

skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9

Propose a verified rewrite without touching the source file:

skilltest improve ./path/to/skill --provider anthropic

Apply the verified rewrite in place:

skilltest improve ./path/to/skill --provider anthropic --apply

Write a self-contained HTML report:

skilltest check ./path/to/skill --html ./reports/check.html

Model-backed commands default to --concurrency 5. Use --concurrency 1 to force the old sequential execution order. Seeded trigger runs stay deterministic regardless of concurrency. lint, trigger, eval, and check support --html <path> for offline reports. improve is terminal/JSON only in v1.

Example lint summary:

skilltest lint
target: ./test-fixtures/sample-skill
summary: 29/29 checks passed, 0 warnings, 0 failures

Configuration

skilltest resolves config in this order:

.skilltestrc in the target skill root
.skilltestrc in the current working directory
the nearest package.json containing skilltestrc

CLI flags override config values.

Example .skilltestrc:

{
  "provider": "anthropic",
  "model": "claude-sonnet-4-5-20250929",
  "concurrency": 5,
  "trigger": {
    "numQueries": 20,
    "threshold": 0.8,
    "seed": 123
  },
  "eval": {
    "numRuns": 5,
    "threshold": 0.9,
    "maxToolIterations": 10
  }
}

Commands

`skilltest lint <path-to-skill>`

Static analysis only. Fast and offline.

What it checks:

Frontmatter:
- YAML presence and validity
- name required, max 64, lowercase/numbers/hyphens, no leading/trailing/consecutive hyphens
- description required, non-empty, max 1024
- warn if no license
- warn if description is weak on both what and when
Structure:
- warns if SKILL.md exceeds 500 lines
- warns if long references (300+ lines) have no table of contents
- validates referenced files in scripts/, references/, assets/
- detects broken relative file references
Content heuristics:
- warns if no headers
- warns if no examples
- warns on vague phrases
- warns on angle brackets in frontmatter
- fails on obvious secret patterns
- warns on empty/too-short body
- warns on very short description
Security heuristics:
- fails on dangerous command patterns (destructive deletes, pipe-to-shell remote scripts)
- fails on obvious sensitive-data exfiltration instructions
- warns on privilege-escalation language (sudo, disable approvals, require_escalated)
- warns when shell instructions exist without explicit safety guardrails
Progressive disclosure:
- warns if SKILL.md is large and no references/ exists
- validates references are relative and inside skill root
- warns on deep reference chains
Compatibility hints:
- warns on provider-specific conventions such as allowed-tools
- emits a likely compatibility summary

Flags:

--html <path> write a self-contained HTML report
--plugin <path> load a custom lint plugin file (repeatable)

Plugin Rules

You can run custom lint rules alongside the built-in checks. Plugin rules use the same LintContext and LintIssue types as the core linter, and their results appear in the same LintReport.

Config:

{
  "lint": {
    "plugins": ["./my-rules.js"]
  }
}

CLI:

skilltest lint ./skill --plugin ./my-rules.js

Minimal plugin example:

export default {
  rules: [
    {
      checkId: "custom:no-todo",
      title: "No TODO comments",
      check(context) {
        const body = context.frontmatter.content;
        if (/\bTODO\b/.test(body)) {
          return [
            {
              id: "custom.no-todo",
              checkId: "custom:no-todo",
              title: "No TODO comments",
              status: "warn",
              message: "SKILL.md contains a TODO marker."
            }
          ];
        }
        return [
          {
            id: "custom.no-todo",
            checkId: "custom:no-todo",
            title: "No TODO comments",
            status: "pass",
            message: "No TODO markers found."
          }
        ];
      }
    }
  ]
};

Notes:

Plugin files are loaded with dynamic import().
.js and .mjs work directly; .ts plugins must be precompiled by the user.
Plugin rules run after all built-in lint checks, in the order the plugin files are listed.
CLI --plugin values replace config-file lint.plugins values.

`skilltest trigger <path-to-skill>`

Measures trigger behavior for your skill description with model simulation.

Flow:

Reads name and description from frontmatter.
Generates balanced trigger/non-trigger queries (or loads custom query file).
For each query, asks model to select one skill from a mixed list:
- your skill under test
- realistic fake skills
- optional sibling competitor skills from --compare
Computes TP, TN, FP, FN, precision, recall, F1.

For reproducible fake-skill sampling, pass --seed <number>. When a seed is used, terminal and JSON output include it so the run can be repeated exactly. If you use .skilltestrc, trigger.seed sets the default and the CLI flag overrides it. The fake-skill setup is precomputed before requests begin, so the same seed produces the same trigger cases at any concurrency level.

Flags:

--model <model> default: claude-sonnet-4-5-20250929
--provider <anthropic|openai> default: anthropic
--queries <path> use custom queries JSON
--compare <path> path to a sibling skill directory to use as a competitor (repeatable)
--num-queries <n> default: 20 (must be even)
--seed <number> RNG seed for reproducible fake-skill sampling
--concurrency <n> default: 5
--html <path> write a self-contained HTML report
--save-queries <path> save generated query set
--api-key <key> explicit key override
--verbose show full model decision text

Comparative Trigger Testing

Test whether your skill is distinctive enough to be selected over similar real skills:

skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2

Config:

{
  "trigger": {
    "compare": ["../similar-skill-1", "../similar-skill-2"]
  }
}

Comparative mode includes the real competitor skills in the candidate list alongside fake skills. This reveals confusion between skills with overlapping descriptions that standard trigger testing would miss.

`skilltest eval <path-to-skill>`

Runs full skill behavior and grades outputs against assertions.

Flow:

Loads prompts from file or auto-generates 5 prompts.
Injects full SKILL.md as system instructions.
Runs prompt on chosen model.
Uses grader model to score each assertion with evidence.

Flags:

--prompts <path> custom prompts JSON
--model <model> default: claude-sonnet-4-5-20250929
--grader-model <model> default: same as --model
--provider <anthropic|openai> default: anthropic
--concurrency <n> default: 5
--html <path> write a self-contained HTML report
--save-results <path> write full JSON result
--api-key <key> explicit key override
--verbose show full model responses

Config-only eval setting:

eval.maxToolIterations default: 10 safety cap for tool-aware eval loops

Tool-Aware Eval

When an eval prompt defines tools, skilltest runs the prompt in a mock tool environment instead of plain text-only execution. The model can call the mocked tools during eval, and skilltest records the calls alongside the normal grader assertions.

Tool responses are always mocked. skilltest does not execute real tools, scripts, shell commands, MCP servers, or APIs during eval.

Example prompt file:

[
  {
    "prompt": "Parse this deployment checklist and tell me what is missing.",
    "assertions": ["output should mention the missing rollback plan"],
    "tools": [
      {
        "name": "read_file",
        "description": "Read a file from the workspace",
        "parameters": [
          { "name": "path", "type": "string", "description": "File path to read", "required": true }
        ],
        "responses": {
          "{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
          "*": "[mock] File not found"
        }
      },
      {
        "name": "run_script",
        "description": "Execute a shell script",
        "parameters": [
          { "name": "command", "type": "string", "description": "Command to run", "required": true }
        ],
        "responses": {
          "*": "Script executed successfully. Output: 3 items checked, 1 missing."
        }
      }
    ],
    "toolAssertions": [
      { "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
      { "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
      {
        "type": "tool_argument_match",
        "toolName": "read_file",
        "expectedArgs": { "path": "checklist.md" },
        "description": "Model should read checklist.md specifically"
      }
    ]
  }
]

Run it with:

skilltest eval ./my-skill --prompts ./eval-prompts.json

`skilltest check <path-to-skill>`

Runs lint + trigger + eval in one command and applies quality thresholds.

Default behavior:

Run lint.
Stop before model calls if lint has failures.
Run trigger and eval only when lint passes.
When concurrency is greater than 1, run trigger and eval in parallel.
Fail quality gate when either threshold is below target.

Flags:

--provider <anthropic|openai> default: anthropic
--model <model> default: claude-sonnet-4-5-20250929 (auto-switches to gpt-4.1-mini for --provider openai when unchanged)
--grader-model <model> default: same as resolved --model
--api-key <key> explicit key override
--queries <path> custom trigger queries JSON
--compare <path> path to a sibling skill directory to use as a competitor (repeatable)
--num-queries <n> default: 20 (must be even)
--seed <number> RNG seed for reproducible trigger sampling
--prompts <path> custom eval prompts JSON
--plugin <path> load a custom lint plugin file (repeatable)
--concurrency <n> default: 5 (1 keeps the old sequential check behavior)
--html <path> write a self-contained HTML report
--min-f1 <n> default: 0.8
--min-assert-pass-rate <n> default: 0.9
--save-results <path> save combined check result JSON
--continue-on-lint-fail continue trigger/eval even if lint fails
--verbose include detailed trigger/eval sections

`skilltest improve <path-to-skill>`

Rewrites SKILL.md, verifies the rewrite on a frozen test set, and optionally applies it.

Default behavior:

Run a baseline check with continue-on-lint-fail=true.
Freeze the exact trigger queries and eval prompts used in that baseline run.
Ask the model for a structured JSON rewrite of SKILL.md.
Rebuild and validate the candidate locally:
- must stay parseable
- must keep the same skill name
- must keep the current license when one already exists
- must not introduce broken relative references
Verify the candidate by rerunning check against a copied skill directory with the frozen trigger/eval inputs.
Only write files when the candidate measurably improves the skill and passes the configured quality gates.

Flags:

--provider <anthropic|openai> default: anthropic
--model <model> default: claude-sonnet-4-5-20250929 (auto-switches to gpt-4.1-mini for --provider openai when unchanged)
--api-key <key> explicit key override
--queries <path> custom trigger queries JSON
--compare <path> path to a sibling skill directory to use as a competitor (repeatable)
--num-queries <n> default: 20 (must be even when auto-generating)
--seed <number> RNG seed for reproducible trigger sampling
--prompts <path> custom eval prompts JSON
--plugin <path> load a custom lint plugin file (repeatable)
--concurrency <n> default: 5
--output <path> write the verified candidate SKILL.md to a separate file
--save-results <path> save full improve result JSON
--min-f1 <n> default: 0.8
--min-assert-pass-rate <n> default: 0.9
--apply write the verified rewrite back to the source SKILL.md
--verbose include full baseline and verification reports

Notes:

improve is dry-run by default.
--apply only writes when parse, lint, trigger, and eval verification all pass.
Before/after metrics are measured against the same generated or user-supplied trigger queries and eval prompts, not a fresh sample.

Global Flags

--help show help
--version show version
--json output only valid JSON to stdout
--no-color disable terminal colors

Input File Formats

Trigger queries (--queries):

[
  {
    "query": "Please validate this deployment checklist and score it.",
    "should_trigger": true
  },
  {
    "query": "Write a SQL migration for adding an index.",
    "should_trigger": false
  }
]

Eval prompts (--prompts):

[
  {
    "prompt": "Validate this markdown checklist for a production release.",
    "assertions": [
      "output should include pass/warn/fail style categorization",
      "output should provide at least one remediation recommendation"
    ]
  }
]

Tool-aware eval prompts (--prompts):

[
  {
    "prompt": "Parse this deployment checklist and tell me what is missing.",
    "assertions": ["output should mention remediation steps"],
    "tools": [
      {
        "name": "read_file",
        "description": "Read a file from the workspace",
        "parameters": [
          { "name": "path", "type": "string", "description": "File path to read", "required": true }
        ],
        "responses": {
          "{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
          "*": "[mock] File not found"
        }
      },
      {
        "name": "run_script",
        "description": "Execute a shell script",
        "parameters": [
          { "name": "command", "type": "string", "description": "Command to run", "required": true }
        ],
        "responses": {
          "*": "Script executed successfully. Output: 3 items checked, 1 missing."
        }
      }
    ],
    "toolAssertions": [
      { "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
      { "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
      {
        "type": "tool_argument_match",
        "toolName": "read_file",
        "expectedArgs": { "path": "checklist.md" },
        "description": "Model should read checklist.md specifically"
      }
    ]
  }
]

Output and Exit Codes

Exit codes:

0: success
1: quality gate failed (lint, check, improve blocked, or other command-specific failure conditions)
2: runtime/config/API/parse error

JSON mode examples:

skilltest lint ./skill --json
skilltest trigger ./skill --json
skilltest eval ./skill --json
skilltest check ./skill --json
skilltest improve ./skill --json

HTML report examples:

skilltest lint ./skill --html ./reports/lint.html
skilltest trigger ./skill --html ./reports/trigger.html
skilltest eval ./skill --html ./reports/eval.html
skilltest check ./skill --json --html ./reports/check.html

Seeded trigger example:

skilltest trigger ./skill --seed 123

API Keys

Anthropic:

export ANTHROPIC_API_KEY=your-key

OpenAI:

export OPENAI_API_KEY=your-key

Override at runtime:

skilltest trigger ./skill --api-key your-key

Current provider status:

anthropic: implemented
openai: implemented

OpenAI quick example:

skilltest trigger ./path/to/skill --provider openai --model gpt-4.1-mini
skilltest eval ./path/to/skill --provider openai --model gpt-4.1-mini

Note:

If you pass --provider openai and keep the Anthropic default model value, skilltest automatically switches to gpt-4.1-mini.

CICD Integration

GitHub Actions example to lint skills on pull requests:

name: skill-lint

on:
  pull_request:
    paths:
      - "**/SKILL.md"
      - "**/references/**"
      - "**/scripts/**"
      - "**/assets/**"

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run lint
      - run: npm run test
      - run: npm run build
      - run: npx skilltest lint path/to/skill --json

Optional nightly trigger/eval:

name: skill-eval-nightly

on:
  schedule:
    - cron: "0 4 * * *"

jobs:
  trigger-eval:
    runs-on: ubuntu-latest
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run build
      - run: npx skilltest trigger path/to/skill --num-queries 20 --json
      - run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
      - run: npx skilltest check path/to/skill --min-f1 0.8 --min-assert-pass-rate 0.9 --json

Local Development

npm install
npm run lint
npm run test
npm run build
node dist/index.js --help

npm test runs the Vitest suite. The tests are offline and do not call model providers.

Manual CLI smoke tests:

node dist/index.js lint test-fixtures/sample-skill/
node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json

Release Checklist

npm run lint
npm run build
npm run test
npm pack --dry-run
npm publish --dry-run

Then publish:

npm publish

Contributing

Issues and pull requests are welcome. Include:

clear reproduction steps
expected vs actual behavior
sample SKILL.md or fixtures when relevant

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

skilltest

Why skilltest?

Install

Quick Start

Configuration

Commands

skilltest lint <path-to-skill>

Plugin Rules

skilltest trigger <path-to-skill>

Comparative Trigger Testing

skilltest eval <path-to-skill>

Tool-Aware Eval

skilltest check <path-to-skill>

skilltest improve <path-to-skill>

Global Flags

Input File Formats

Output and Exit Codes

API Keys

CICD Integration

Local Development

Release Checklist

Contributing

License

`skilltest lint <path-to-skill>`

`skilltest trigger <path-to-skill>`

`skilltest eval <path-to-skill>`

`skilltest check <path-to-skill>`

`skilltest improve <path-to-skill>`