@skilljack/evals
v1.2.1
Published
CLI for evaluating AI agent skill discoverability, adherence, and output quality. Runs as standalone CLI or GitHub Action.
Maintainers
Readme
skilljack-evals
CLI for evaluating AI agent skills across multiple agent frameworks. Tests how well agents discover, load, and execute Agent Skills — measuring discoverability, instruction adherence, and output quality.
Supports the Claude Agent SDK, Vercel AI SDK, and OpenAI Agents SDK. Runs standalone or as a GitHub Action.
What are Agent Skills?
Agent Skills are a lightweight, open-source format for extending AI agent capabilities. Each skill is a folder containing a SKILL.md file with metadata and instructions that agents can discover and use. Learn more at agentskills.io.
Requirements
- Node.js >= 20.0.0
- API key for your chosen runner (see API Keys below)
Installation
npm install
npm run buildQuick Start
# Run the example greeting evaluation
skilljack-evals run evals/example-greeting/tasks.yaml --verbose
# Deterministic scoring only (no LLM judge, free)
skilljack-evals run evals/example-greeting/tasks.yaml --no-judge
# Validate a task file without running
skilljack-evals validate evals/example-greeting/tasks.yamlBuilding Skills with Evals
Start by writing eval tasks that describe the outcomes you want, then build your skill to pass them. This eval-first approach works like TDD for agent skills:
Decide if a skill is the right tool — Skills are for capabilities that should only activate on demand. For instructions that always apply, use
CLAUDE.mdorAGENTS.md. For validation and formatting, consider static analysis, pre-commit hooks, or agent hooks instead.Define desired outcomes — Write eval tasks with the prompts users will say, the markers your skill should output, and a checklist of what "good" looks like.
Add false-positive tests — Include prompts that are similar but should not trigger the skill. These catch over-eager activation and are just as important as positive tests.
Create a minimal SKILL.md — Start with basic instructions and metadata.
Run evals and iterate — Use
skilljack-evals runto see where the skill falls short. Deterministic checks (--no-judge) are free and fast for rapid iteration. Add the LLM judge when you're ready to evaluate output quality.Keep the eval suite — As you update the skill, run evals as a regression check. Add them to CI with the GitHub Action to catch regressions automatically.
# Scaffold eval tasks for a new skill
skilljack-evals create-eval my-skill -o evals/my-skill/tasks.yaml
# Fast iteration loop (deterministic only, no API cost for judging)
skilljack-evals run evals/my-skill/tasks.yaml --no-judge --verbose
# Full evaluation with LLM judge
skilljack-evals run evals/my-skill/tasks.yaml --verboseThis workflow ensures your skill is discoverable from the right prompts, doesn't activate when it shouldn't, and produces the output quality you expect.
Multi-Runner Support
Three runners are available, selected via the --runner CLI flag:
| Runner | Flag | Model Format | Example |
|--------|------|-------------|---------|
| Claude Agent SDK (default) | --runner claude-sdk | Model aliases | sonnet, haiku |
| Vercel AI SDK | --runner vercel-ai | provider:model | anthropic:claude-sonnet-4-6, google:gemini-2.5-pro, openai:gpt-5.2, openrouter:deepseek/deepseek-v3.2 |
| OpenAI Agents SDK | --runner openai-agents | Plain model name | gpt-5.2 |
# Claude SDK (default)
skilljack-evals run evals/example-greeting/tasks.yaml --model sonnet
# Vercel AI SDK with different providers
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "anthropic:claude-sonnet-4-6"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "google:gemini-2.5-pro"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "openai:gpt-5.2"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "openrouter:deepseek/deepseek-v3.2"
# OpenRouter — tested models
# openrouter:deepseek/deepseek-v3.2
# openrouter:minimax/minimax-m2.5
# openrouter:moonshotai/kimi-k2.5
# openrouter:z-ai/glm-5
# openrouter:openai/gpt-oss-120b
# OpenAI Agents SDK
skilljack-evals run evals/example-greeting/tasks.yaml --runner openai-agents --model "gpt-5.2"The Vercel AI SDK and OpenAI Agents SDK runners require their respective peer dependencies:
# Vercel AI SDK
npm install ai zod @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google @openrouter/ai-sdk-provider
# OpenAI Agents SDK
npm install @openai/agents openaiSkill Support by SDK
Each runner uses the SDK's native mechanism for skill discovery and loading:
- Claude Agent SDK — Skills via
.claude/skills/and theSkilltool. See Claude Code Skills and Agent Skills format. - Vercel AI SDK — Skills via a
loadSkilltool defined in the runner, following the Agent Skills cookbook guide. - OpenAI Agents SDK — Skills via
shellTool()with local skill bundles. See Skills in OpenAI API and the Skills cookbook.
Configuration
API Keys
Set the appropriate API key in your environment or a .env file (see .env.example):
| Runner | Required Key |
|--------|-------------|
| Claude SDK | ANTHROPIC_API_KEY |
| Vercel AI (anthropic:) | ANTHROPIC_API_KEY |
| Vercel AI (openai:) | OPENAI_API_KEY |
| Vercel AI (google:) | GOOGLE_GENERATIVE_AI_API_KEY |
| Vercel AI (openrouter:) | OPENROUTER_API_KEY |
| OpenAI Agents | OPENAI_API_KEY |
Bedrock
Set these environment variables — the Agent SDK handles the rest:
CLAUDE_CODE_USE_BEDROCK=1
AWS_REGION=us-west-2
AWS_PROFILE=your-profileConfig File
Create an eval.config.yaml in your project root (all fields optional):
models:
agent: sonnet # EVAL_AGENT_MODEL
judge: haiku # EVAL_JUDGE_MODEL
scoring:
weights:
discovery: 0.3
adherence: 0.4
output: 0.3
thresholds:
discovery_rate: 0.8 # EVAL_DISCOVERY_THRESHOLD
avg_score: 4.0 # EVAL_SCORE_THRESHOLD
runner:
timeout_ms: 300000 # EVAL_TASK_TIMEOUT_MS
allowed_write_dirs:
- ./results/
- ./fixtures/
output:
dir: ./results # EVAL_OUTPUT_DIR
judge_truncation: 5000
report_truncation: 2000
ci:
exit_on_failure: true
github_summary: falsePrecedence (lowest to highest): YAML defaults → eval.config.yaml → environment variables (EVAL_*) → CLI flags.
CLI Commands
run — Full evaluation pipeline
Runs the agent against tasks, scores results, and generates reports.
skilljack-evals run evals/greeting/tasks.yaml \
--runner vercel-ai --model "google:gemini-2.5-pro" \
--judge-model haiku \
--timeout 300000 \
--tasks gr-001,gr-002 \
--threshold-discovery 0.8 --threshold-score 4.0 \
--output-dir ./results \
--github-summary --verbosescore — Score existing results
skilljack-evals score results.json --judge-model haikureport — Generate reports from scored results
skilljack-evals report -r results.json -o report.md --json report.jsonvalidate — Check YAML syntax
skilljack-evals validate evals/greeting/tasks.yamlcreate-eval — Generate task template
skilljack-evals create-eval greeting -o evals/greeting/tasks.yaml -n 10parse — Parse YAML to JSON
skilljack-evals parse evals/greeting/tasks.yamlArchitecture
YAML tasks → Config → Runner (Claude SDK / Vercel AI / OpenAI Agents) → Scorer (deterministic + LLM judge) → ReportPipeline
- Parse — Load and validate task definitions from YAML
- Setup — Copy skills to
.claude/skills/in the working directory - Run — Execute agent against each task via the selected runner
- Score — Deterministic checks (free, fast) then optional LLM judge
- Report — Generate markdown + JSON reports, check pass/fail thresholds
- Cleanup — Remove copied skills
Scoring
Two scoring methods that can run independently or together:
Deterministic (free, fast):
- Checks tool calls for skill activation
- Searches output for expected marker strings
- Validates expected/forbidden tool usage
- Binary pass/fail
LLM Judge (richer, ~$0.20/run with default settings):
- Discovery (0 or 1) — Did the agent load the expected skill?
- Adherence (1-5) — How well did the agent follow skill instructions?
- Output Quality (1-5) — Does the output meet task requirements?
- Failure categorization
Combined score: w_d * discovery + w_a * ((adherence-1)/4) + w_o * ((outputQuality-1)/4)
Failure Categories
| Category | Meaning |
|----------|---------|
| discovery_failure | Agent didn't load the skill |
| false_positive | Agent loaded a skill it shouldn't have |
| instruction_ambiguity | Agent misinterpreted instructions |
| missing_guidance | Skill didn't cover the needed case |
| agent_error | Agent made a mistake despite guidance |
| none | No failure |
Task File Format
skill: greeting
version: "1.0"
defaults:
expected_skill_load: greeting
criteria:
discovery: { weight: 0.3 }
adherence: { weight: 0.4 }
output: { weight: 0.3 }
tasks:
- id: gr-001
prompt: "Hello! Please greet me using the greeting skill."
# Deterministic checks (optional, free)
deterministic:
expect_skill_activation: true
expect_marker: "GREETING_SUCCESS"
expect_tool_calls: []
expect_no_tool_calls: []
# LLM judge criteria (optional, costs API calls)
criteria:
discovery: { weight: 0.3, description: "Should load greeting skill" }
adherence: { weight: 0.4, description: "Should follow skill format" }
output: { weight: 0.3, description: "Greeting is friendly" }
golden_checklist:
- "Loaded the greeting skill"
- "Friendly tone"
# False positive test — skill should NOT activate
- id: gr-fp-001
prompt: "What are best practices for email greetings?"
expected_skill_load: none
deterministic:
expect_skill_activation: falseBoth deterministic and criteria blocks are optional. If both are present, the scorer runs both and merges results.
GitHub Action
- uses: olaservo/skilljack-evals@v1
with:
tasks: evals/commit/tasks.yaml
threshold-discovery: '0.8'
threshold-score: '4.0'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Inputs
| Input | Required | Default | Description |
|-------|----------|---------|-------------|
| tasks | Yes | — | Path to tasks YAML file |
| runner | No | claude-sdk | Runner type: claude-sdk, vercel-ai, openai-agents |
| model | No | sonnet | Agent model |
| judge-model | No | haiku | LLM judge model |
| config | No | — | Path to eval.config.yaml |
| threshold-discovery | No | 0.8 | Minimum discovery rate (0-1) |
| threshold-score | No | 4.0 | Minimum average score (1-5) |
| timeout | No | 300000 | Per-task timeout (ms) |
| tasks-filter | No | — | Comma-separated task IDs |
| skills-dir | No | — | Path to skills directory |
| no-judge | No | false | Skip LLM judge |
| no-deterministic | No | false | Skip deterministic scoring |
Outputs
| Output | Description |
|--------|-------------|
| passed | Whether all thresholds were met |
| discovery-rate | Discovery rate achieved (0-1) |
| avg-score | Average weighted score |
| report-path | Path to markdown report |
| json-path | Path to JSON report |
The action writes a condensed summary to $GITHUB_STEP_SUMMARY and exits with code 1 if thresholds are not met.
Library Usage
import {
parseEvalFile,
SkillJudge,
generateReport,
runPipeline,
scoreDeterministic,
loadConfig,
} from '@skilljack/evals';
// Full pipeline
const result = await runPipeline({
tasksFile: 'evals/greeting/tasks.yaml',
configOverrides: { defaultAgentModel: 'sonnet' },
verbose: true,
});
// Or individual steps
const evaluation = await parseEvalFile('path/to/tasks.yaml');
const judge = new SkillJudge({ model: 'haiku' });
const score = await judge.judgeResult(task, result);
const detScore = scoreDeterministic(task, result);
const report = generateReport(evaluation, results, scores);Development
npm run dev # Run CLI in dev mode (tsx)
npm run build # Compile TypeScript
npm run typecheck # Type check without emitting
npm run start # Run compiled CLI