skillave
v0.1.6
Published
Skill evaluation pipeline: generate test cases, execute them, and verify results
Maintainers
Readme
skillave
A skill evaluation pipeline for AI agents: generate test cases, execute them, and verify results.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GENERATION │ ──► │ EXECUTION │ ──► │ VERIFICATION│
│ │ │ │ │ │
│ Create test │ │ Run tests │ │ Grade │
│ cases │ │ headlessly │ │ results │
└─────────────┘ └─────────────┘ └─────────────┘What is skillave?
skillave is the execution engine in a skill evaluation pipeline:
- Generation: Define test cases with prompts and expectations (compatible with skill-creator schema)
- Execution: Run tests headlessly via ACP (Agent Client Protocol), capturing all tool calls into structured traces
- Verification: Evaluate traces against expectations - programmatically or with AI-assisted grading
The core tool focuses on execution - deterministic, parallel, and traceable. Grading is delegated to downstream tools for flexibility.
Features
- Headless Execution: Runs ACP agents in a deterministic environment without UI.
- Structured Tracing: Records all interactions (Input, Output, Tool Calls) into standard JSONL format.
- Parallel Execution: Supports running multiple evals and runs concurrently.
- Bun-powered: Built with Bun for fast startup and execution.
Prerequisites
- Node.js (for
npx) - An ACP-capable agent command available in your PATH.
Installation
Run without installation (recommended):
npx -y skillave --versionFrom source (for contributors):
git clone https://github.com/williamfzc/skillave.git
cd skillave
bun install
bun run build
node dist/index.js --versionEvals Schema
skillave consumes a JSON file defining the evaluations to run. It minimally requires id and prompt.
{
"skill_name": "example-skill",
"evals": [
{
"id": "test-1",
"prompt": "Write a python script to calculate fibonacci numbers",
"//": "Other fields (expectations, files, etc.) are ignored by skillave but may be used by downstream graders"
}
]
}Usage
Run Evals
Execute evals and generate traces:
npx -y skillave eval \
--config skillave.jsonOr pass arguments directly (via config file is recommended for complex setups):
Config File (skillave.json):
{
"workspace": "./workspace",
"evals": "./evals.json",
"command": "opencode acp",
"runs": 1,
"jobs": 4,
"timeout": 180
}Options:
--config <path>: Path to config file (default:skillave.json)
Output Structure
The tool generates a workspace with the following structure:
workspace/
├── result.json # Workspace-level summary + index for locating each run
└── <eval_id>/
└── <run_index>/
└── trace.jsonl # Structured trace events (tool calls, etc.)result.json fields
result.json contains:
acp: Execution configuration (runner mode, command/args, timeout, concurrency).runs_per_eval: How many times each eval is executed (from configruns).total_evals: Number of eval cases in the evals file.total_runs:total_evals * runs_per_eval.results: Array of per-run summaries. Each item points totrace_path.
Each item in results[] contains:
duration_ms,tool_call_count,prompt_countacp_server(when available): protocol version + server name/versionmodel/token_usage(when the ACP server returns these fields)
How many files are generated?
Let:
E = total_evalsR = runs_per_eval
Then the workspace will contain:
- Exactly 1 workspace-level file:
result.json - Exactly E * R per-run files: one
trace.jsonlper run, stored at<eval_id>/<run_index>/trace.jsonl
So the total number of files is: 1 + (E * R) (not counting directories).
Trace Format
trace.jsonl contains a sequence of events. Example:
{"type":"tool_call","name":"Write","input":{"path":"/tmp/test.txt","content":"Hello"},"tool_call_id":0}
{"type":"tool_call","name":"bash","input":{"command":"echo 'done'"},"tool_call_id":1}Included Skill
This repository ships with a skill for AI agents: skills/skillave/
The skill teaches agents how to use skillave across three phases:
| Phase | Reference | What it covers |
|-------|-----------|----------------|
| Generation | references/generation.md | Creating test cases, evals schema, prompt design |
| Execution | references/execution.md | Running evals, config options, output format |
| Verification | references/verification.md | Grading results, tool expectations, reporting |
Agents can invoke the skillave skill when they need to run evaluations or work with skill-creator workflows.
Development
Running Tests
bun test