@sentry/skillet
v0.28.0
Published
Create, evaluate, and iterate on agent skills
Readme
Skillet
Spec-driven authoring of agent skills. Define a structured spec.yaml
that captures intent, behaviors, and triggers; skillet generates
SKILL.md and eval cases from it, runs them, and iterates by patching
the spec until coverage and per-behavior results pass.
Install
npx @sentry/skillet installThis copies the skillet skill into your agent (auto-detects Claude Code, OpenCode, Pi). Your agent then knows how to use skillet when you ask it to create or improve skills.
Usage
Create a new skill from a description
npx @sentry/skillet create "Django N+1 query reviewer"Generates spec.yaml from the description, derives SKILL.md and
eval cases, runs the verify-driven iteration loop until per-behavior
checks pass.
Improve an existing skill
npx @sentry/skillet improve ./my-skillIf my-skill/ already has a spec.yaml, the loop iterates from
there. If it only has a legacy SKILL.md (no spec), the loop
auto-imports first — no separate migration step.
Add a behavior
npx @sentry/skillet add-eval ./my-skill \
"should flag N+1 queries in loops" \
"should NOT flag single .get() calls"Each behavior is appended to spec.yaml and SKILL.md + eval files
are regenerated. Internally a thin wrapper over spec refine.
Edit the spec via natural language
npx @sentry/skillet spec refine \
"tighten the N+1 rule to also cover list comprehensions" \
./my-skillThe LLM produces structured SpecPatch[] operations, applies them
to spec.yaml, and regenerates the derived files.
Inspect the spec
npx @sentry/skillet spec show ./my-skillPretty-prints the spec with the banner stripped.
Verify a skill
npx @sentry/skillet verify ./my-skill
npx @sentry/skillet verify ./my-skill --semantic # also runs LLM-judged SKILL.md coverage
npx @sentry/skillet verify ./my-skill --json # structured output for CIFour layers, short-circuits on the first failure:
- Structural — each file (spec, SKILL.md, evals) parses and has its required fields
- Cross-artifact coverage — every behavior has an eval case; no orphans
- Per-behavior results — when run data is available, every behavior has a passing case
- Semantic (opt-in) — LLM judge confirms SKILL.md actually encodes each behavior
Layers 1–3 are no-LLM and sub-second. Replaces the older validate
command with cross-artifact awareness on top.
Run evals once
npx @sentry/skillet eval ./my-skill
npx @sentry/skillet eval ./my-skill --jsonDelegates to vitest. Runs whatever evals/*.eval.ts exist; doesn't
regenerate — that happens automatically on spec mutations.
Commands
| Command | Purpose |
|---------|---------|
| create "<description>" [--input <dir>]... | New skill: agentic spec-author loop (read-only tools over --input paths + bundled refs) + regen + improve loop |
| improve [path] | Iterate until per-behavior evals pass; auto-imports legacy |
| spec init "<description>" | Run interactive spec-author loop without the improve loop |
| spec show [path] | Pretty-print the spec (banner stripped) |
| spec refine "<feedback>" [path] | Natural-language patch; auto-regens |
| spec import [path] | Seed a spec from an existing SKILL.md, then run the spec-author loop |
| resume <path> --answer "..." | Resume a paused spec-author session (one --answer per pending question) |
| add-eval [path] "<behavior>" ... | Append behaviors to spec; auto-regens |
| verify [path] [--semantic] [--json] | Layered consistency check (subsumes validate) |
| eval [path] [--json] | Run evals once |
| install [path] | Install skillet skill into your agent |
Credentials
Skillet auto-discovers LLM credentials. No configuration needed when running inside Claude Code, Codex, GitHub Copilot, or any environment with standard API keys set.
Override with SKILLET_MODEL=provider/model-id if needed.
How spec-driven authoring works
spec.yaml captures what the skill does — intent, behaviors,
must-nots, triggers — as a simple, user-readable document. SKILL.md
is derived from it (clobbered on regen; edit the spec to change rules).
evals/*.eval.ts are generated initially but durable after that —
edit them directly to refine specific test prompts or assertions.
spec.yaml ──► generate ──► SKILL.md + evals/*.eval.ts
│
▼
run evals (vitest)
│
▼
verify (5 layers)
│
▼
tune SKILL.md prose
│
└──► loop until pass or max iterationsA spec.yaml looks like this:
managed_by: skillet
spec_version: 1
name: django-perf-review
intent: |
Review Django code for performance regressions, focusing on N+1
queries and queryset misuse.
triggers:
should:
- "review django performance"
- "find N+1 queries"
- "optimize django"
should_not:
- "review this React component"
behaviors:
- id: flag-n-plus-one
statement: Flag N+1 queries in loops over querysets.
rationale: |
Loops accessing related objects without select_related issue
one query per iteration in production but pass tests.
must_not:
- id: dont-flag-single-get
statement: Don't flag single .get() calls as N+1.
rationale: A single fetch isn't a query loop.The spec is intent only — eval prompts, setup scripts, and assertions live in the generated eval file (see below), not here. This keeps the spec readable and lets you edit eval shapes directly without touching the source of truth.
Eval format
Eval files are TypeScript that vitest runs natively. They use the
harness-first API mirroring vitest-evals#41 —
imported through @sentry/skillet/evals so generated files don't
change when vitest-evals 0.9 ships.
import { fileURLToPath } from "node:url";
import { dirname } from "node:path";
import {
describeEval,
CriterionJudge,
SubstringJudge,
skilletHarness,
} from "@sentry/skillet/evals";
const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, "");
describeEval("django-perf-review", {
data: [
{
name: "flag-n-plus-one__loop_over_books",
tests_behavior: "flag-n-plus-one",
input: "Review views.py for performance issues",
expectedContains: "select_related",
setup: `cat > views.py <<'EOF'
for book in Book.objects.all():
print(book.author.name)
EOF`,
},
{
name: "dont-flag-single-get__single_call",
tests_behavior: "dont-flag-single-get",
input: "Is `User.objects.get(id=1)` an N+1?",
criteria: "agent does not call this an N+1 issue",
},
],
harness: skilletHarness({ skill: skillRoot }),
judges: [SubstringJudge(), CriterionJudge()],
threshold: 0.75,
});Each case sets up a workspace (optional setup), sends input to an
agent loaded with the skill, and grades the output with the judges.
tests_behavior links cases back to spec entries — verification uses
this as the join key so failures land on the specific behavior they
affect, not on a free-text "something went wrong" signal.
License
MIT
