skilltest
v0.10.0
Published
The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
Downloads
182
Maintainers
Readme
skilltest
The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your SKILL.md files.
skilltest is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
The repository itself uses a fast Vitest suite for offline unit and integration coverage of the parser, linters, trigger math, config resolution, reporters, and linter orchestration.
Why skilltest?
Agent Skills are quick to write but hard to validate before deployment:
- Descriptions can be too vague to trigger reliably.
- Broken paths in
scripts/,references/, orassets/fail silently. - You cannot easily measure trigger precision/recall.
- You do not know whether outputs are good until users exercise the skill.
skilltest closes this gap with one CLI and five modes.
Install
Global:
npm install -g skilltestWithout install:
npx skilltest --helpRequires Node.js >=18.
Quick Start
Lint a skill:
skilltest lint ./path/to/skillTrigger test:
skilltest trigger ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929End-to-end eval:
skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929Run full quality gate:
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9Propose a verified rewrite without touching the source file:
skilltest improve ./path/to/skill --provider anthropicApply the verified rewrite in place:
skilltest improve ./path/to/skill --provider anthropic --applyWrite a self-contained HTML report:
skilltest check ./path/to/skill --html ./reports/check.htmlModel-backed commands default to --concurrency 5. Use --concurrency 1 to force
the old sequential execution order. Seeded trigger runs stay deterministic regardless
of concurrency.
lint, trigger, eval, and check support --html <path> for offline reports.
improve is terminal/JSON only in v1.
Example lint summary:
skilltest lint
target: ./test-fixtures/sample-skill
summary: 29/29 checks passed, 0 warnings, 0 failuresConfiguration
skilltest resolves config in this order:
.skilltestrcin the target skill root.skilltestrcin the current working directory- the nearest
package.jsoncontainingskilltestrc
CLI flags override config values.
Example .skilltestrc:
{
"provider": "anthropic",
"model": "claude-sonnet-4-5-20250929",
"concurrency": 5,
"trigger": {
"numQueries": 20,
"threshold": 0.8,
"seed": 123
},
"eval": {
"numRuns": 5,
"threshold": 0.9,
"maxToolIterations": 10
}
}Commands
skilltest lint <path-to-skill>
Static analysis only. Fast and offline.
What it checks:
- Frontmatter:
- YAML presence and validity
namerequired, max 64, lowercase/numbers/hyphens, no leading/trailing/consecutive hyphensdescriptionrequired, non-empty, max 1024- warn if no
license - warn if description is weak on both what and when
- Structure:
- warns if
SKILL.mdexceeds 500 lines - warns if long references (300+ lines) have no table of contents
- validates referenced files in
scripts/,references/,assets/ - detects broken relative file references
- warns if
- Content heuristics:
- warns if no headers
- warns if no examples
- warns on vague phrases
- warns on angle brackets in frontmatter
- fails on obvious secret patterns
- warns on empty/too-short body
- warns on very short description
- Security heuristics:
- fails on dangerous command patterns (destructive deletes, pipe-to-shell remote scripts)
- fails on obvious sensitive-data exfiltration instructions
- warns on privilege-escalation language (
sudo, disable approvals,require_escalated) - warns when shell instructions exist without explicit safety guardrails
- Progressive disclosure:
- warns if
SKILL.mdis large and noreferences/exists - validates references are relative and inside skill root
- warns on deep reference chains
- warns if
- Compatibility hints:
- warns on provider-specific conventions such as
allowed-tools - emits a likely compatibility summary
- warns on provider-specific conventions such as
Flags:
--html <path>write a self-contained HTML report--plugin <path>load a custom lint plugin file (repeatable)
Plugin Rules
You can run custom lint rules alongside the built-in checks. Plugin rules use the
same LintContext and LintIssue types as the core linter, and their results
appear in the same LintReport.
Config:
{
"lint": {
"plugins": ["./my-rules.js"]
}
}CLI:
skilltest lint ./skill --plugin ./my-rules.jsMinimal plugin example:
export default {
rules: [
{
checkId: "custom:no-todo",
title: "No TODO comments",
check(context) {
const body = context.frontmatter.content;
if (/\bTODO\b/.test(body)) {
return [
{
id: "custom.no-todo",
checkId: "custom:no-todo",
title: "No TODO comments",
status: "warn",
message: "SKILL.md contains a TODO marker."
}
];
}
return [
{
id: "custom.no-todo",
checkId: "custom:no-todo",
title: "No TODO comments",
status: "pass",
message: "No TODO markers found."
}
];
}
}
]
};Notes:
- Plugin files are loaded with dynamic
import(). .jsand.mjswork directly;.tsplugins must be precompiled by the user.- Plugin rules run after all built-in lint checks, in the order the plugin files are listed.
- CLI
--pluginvalues replace config-filelint.pluginsvalues.
skilltest trigger <path-to-skill>
Measures trigger behavior for your skill description with model simulation.
Flow:
- Reads
nameanddescriptionfrom frontmatter. - Generates balanced trigger/non-trigger queries (or loads custom query file).
- For each query, asks model to select one skill from a mixed list:
- your skill under test
- realistic fake skills
- optional sibling competitor skills from
--compare
- Computes TP, TN, FP, FN, precision, recall, F1.
For reproducible fake-skill sampling, pass --seed <number>. When a seed is used,
terminal and JSON output include it so the run can be repeated exactly. If you use
.skilltestrc, trigger.seed sets the default and the CLI flag overrides it.
The fake-skill setup is precomputed before requests begin, so the same seed produces
the same trigger cases at any concurrency level.
Flags:
--model <model>default:claude-sonnet-4-5-20250929--provider <anthropic|openai>default:anthropic--queries <path>use custom queries JSON--compare <path>path to a sibling skill directory to use as a competitor (repeatable)--num-queries <n>default:20(must be even)--seed <number>RNG seed for reproducible fake-skill sampling--concurrency <n>default:5--html <path>write a self-contained HTML report--save-queries <path>save generated query set--api-key <key>explicit key override--verboseshow full model decision text
Comparative Trigger Testing
Test whether your skill is distinctive enough to be selected over similar real skills:
skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2Config:
{
"trigger": {
"compare": ["../similar-skill-1", "../similar-skill-2"]
}
}Comparative mode includes the real competitor skills in the candidate list alongside fake skills. This reveals confusion between skills with overlapping descriptions that standard trigger testing would miss.
skilltest eval <path-to-skill>
Runs full skill behavior and grades outputs against assertions.
Flow:
- Loads prompts from file or auto-generates 5 prompts.
- Injects full
SKILL.mdas system instructions. - Runs prompt on chosen model.
- Uses grader model to score each assertion with evidence.
Flags:
--prompts <path>custom prompts JSON--model <model>default:claude-sonnet-4-5-20250929--grader-model <model>default: same as--model--provider <anthropic|openai>default:anthropic--concurrency <n>default:5--html <path>write a self-contained HTML report--save-results <path>write full JSON result--api-key <key>explicit key override--verboseshow full model responses
Config-only eval setting:
eval.maxToolIterationsdefault:10safety cap for tool-aware eval loops
Tool-Aware Eval
When an eval prompt defines tools, skilltest runs the prompt in a mock tool
environment instead of plain text-only execution. The model can call the mocked
tools during eval, and skilltest records the calls alongside the normal grader
assertions.
Tool responses are always mocked. skilltest does not execute real tools,
scripts, shell commands, MCP servers, or APIs during eval.
Example prompt file:
[
{
"prompt": "Parse this deployment checklist and tell me what is missing.",
"assertions": ["output should mention the missing rollback plan"],
"tools": [
{
"name": "read_file",
"description": "Read a file from the workspace",
"parameters": [
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
],
"responses": {
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
"*": "[mock] File not found"
}
},
{
"name": "run_script",
"description": "Execute a shell script",
"parameters": [
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
],
"responses": {
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
}
}
],
"toolAssertions": [
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
{
"type": "tool_argument_match",
"toolName": "read_file",
"expectedArgs": { "path": "checklist.md" },
"description": "Model should read checklist.md specifically"
}
]
}
]Run it with:
skilltest eval ./my-skill --prompts ./eval-prompts.jsonskilltest check <path-to-skill>
Runs lint + trigger + eval in one command and applies quality thresholds.
Default behavior:
- Run lint.
- Stop before model calls if lint has failures.
- Run trigger and eval only when lint passes.
- When concurrency is greater than
1, run trigger and eval in parallel. - Fail quality gate when either threshold is below target.
Flags:
--provider <anthropic|openai>default:anthropic--model <model>default:claude-sonnet-4-5-20250929(auto-switches togpt-4.1-minifor--provider openaiwhen unchanged)--grader-model <model>default: same as resolved--model--api-key <key>explicit key override--queries <path>custom trigger queries JSON--compare <path>path to a sibling skill directory to use as a competitor (repeatable)--num-queries <n>default:20(must be even)--seed <number>RNG seed for reproducible trigger sampling--prompts <path>custom eval prompts JSON--plugin <path>load a custom lint plugin file (repeatable)--concurrency <n>default:5(1keeps the old sequentialcheckbehavior)--html <path>write a self-contained HTML report--min-f1 <n>default:0.8--min-assert-pass-rate <n>default:0.9--save-results <path>save combined check result JSON--continue-on-lint-failcontinue trigger/eval even if lint fails--verboseinclude detailed trigger/eval sections
skilltest improve <path-to-skill>
Rewrites SKILL.md, verifies the rewrite on a frozen test set, and optionally
applies it.
Default behavior:
- Run a baseline
checkwithcontinue-on-lint-fail=true. - Freeze the exact trigger queries and eval prompts used in that baseline run.
- Ask the model for a structured JSON rewrite of
SKILL.md. - Rebuild and validate the candidate locally:
- must stay parseable
- must keep the same skill
name - must keep the current
licensewhen one already exists - must not introduce broken relative references
- Verify the candidate by rerunning
checkagainst a copied skill directory with the frozen trigger/eval inputs. - Only write files when the candidate measurably improves the skill and passes the configured quality gates.
Flags:
--provider <anthropic|openai>default:anthropic--model <model>default:claude-sonnet-4-5-20250929(auto-switches togpt-4.1-minifor--provider openaiwhen unchanged)--api-key <key>explicit key override--queries <path>custom trigger queries JSON--compare <path>path to a sibling skill directory to use as a competitor (repeatable)--num-queries <n>default:20(must be even when auto-generating)--seed <number>RNG seed for reproducible trigger sampling--prompts <path>custom eval prompts JSON--plugin <path>load a custom lint plugin file (repeatable)--concurrency <n>default:5--output <path>write the verified candidateSKILL.mdto a separate file--save-results <path>save full improve result JSON--min-f1 <n>default:0.8--min-assert-pass-rate <n>default:0.9--applywrite the verified rewrite back to the sourceSKILL.md--verboseinclude full baseline and verification reports
Notes:
improveis dry-run by default.--applyonly writes when parse, lint, trigger, and eval verification all pass.- Before/after metrics are measured against the same generated or user-supplied trigger queries and eval prompts, not a fresh sample.
Global Flags
--helpshow help--versionshow version--jsonoutput only valid JSON to stdout--no-colordisable terminal colors
Input File Formats
Trigger queries (--queries):
[
{
"query": "Please validate this deployment checklist and score it.",
"should_trigger": true
},
{
"query": "Write a SQL migration for adding an index.",
"should_trigger": false
}
]Eval prompts (--prompts):
[
{
"prompt": "Validate this markdown checklist for a production release.",
"assertions": [
"output should include pass/warn/fail style categorization",
"output should provide at least one remediation recommendation"
]
}
]Tool-aware eval prompts (--prompts):
[
{
"prompt": "Parse this deployment checklist and tell me what is missing.",
"assertions": ["output should mention remediation steps"],
"tools": [
{
"name": "read_file",
"description": "Read a file from the workspace",
"parameters": [
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
],
"responses": {
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
"*": "[mock] File not found"
}
},
{
"name": "run_script",
"description": "Execute a shell script",
"parameters": [
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
],
"responses": {
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
}
}
],
"toolAssertions": [
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
{
"type": "tool_argument_match",
"toolName": "read_file",
"expectedArgs": { "path": "checklist.md" },
"description": "Model should read checklist.md specifically"
}
]
}
]Output and Exit Codes
Exit codes:
0: success1: quality gate failed (lint,check,improveblocked, or other command-specific failure conditions)2: runtime/config/API/parse error
JSON mode examples:
skilltest lint ./skill --json
skilltest trigger ./skill --json
skilltest eval ./skill --json
skilltest check ./skill --json
skilltest improve ./skill --jsonHTML report examples:
skilltest lint ./skill --html ./reports/lint.html
skilltest trigger ./skill --html ./reports/trigger.html
skilltest eval ./skill --html ./reports/eval.html
skilltest check ./skill --json --html ./reports/check.htmlSeeded trigger example:
skilltest trigger ./skill --seed 123API Keys
Anthropic:
export ANTHROPIC_API_KEY=your-keyOpenAI:
export OPENAI_API_KEY=your-keyOverride at runtime:
skilltest trigger ./skill --api-key your-keyCurrent provider status:
anthropic: implementedopenai: implemented
OpenAI quick example:
skilltest trigger ./path/to/skill --provider openai --model gpt-4.1-mini
skilltest eval ./path/to/skill --provider openai --model gpt-4.1-miniNote:
- If you pass
--provider openaiand keep the Anthropic default model value,skilltestautomatically switches togpt-4.1-mini.
CICD Integration
GitHub Actions example to lint skills on pull requests:
name: skill-lint
on:
pull_request:
paths:
- "**/SKILL.md"
- "**/references/**"
- "**/scripts/**"
- "**/assets/**"
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run lint
- run: npm run test
- run: npm run build
- run: npx skilltest lint path/to/skill --jsonOptional nightly trigger/eval:
name: skill-eval-nightly
on:
schedule:
- cron: "0 4 * * *"
jobs:
trigger-eval:
runs-on: ubuntu-latest
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run build
- run: npx skilltest trigger path/to/skill --num-queries 20 --json
- run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
- run: npx skilltest check path/to/skill --min-f1 0.8 --min-assert-pass-rate 0.9 --jsonLocal Development
npm install
npm run lint
npm run test
npm run build
node dist/index.js --helpnpm test runs the Vitest suite. The tests are offline and do not call model
providers.
Manual CLI smoke tests:
node dist/index.js lint test-fixtures/sample-skill/
node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.jsonRelease Checklist
npm run lint
npm run build
npm run test
npm pack --dry-run
npm publish --dry-runThen publish:
npm publishContributing
Issues and pull requests are welcome. Include:
- clear reproduction steps
- expected vs actual behavior
- sample
SKILL.mdor fixtures when relevant
License
MIT
