skillave

v0.1.6

Published

9 days ago

Skill evaluation pipeline: generate test cases, execute them, and verify results

0High
0Medium
0Low

williamfzc

skill evaluation ACP agent testing skill-creator

skillave

A skill evaluation pipeline for AI agents: generate test cases, execute them, and verify results.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  GENERATION │ ──► │  EXECUTION  │ ──► │ VERIFICATION│
│             │     │             │     │             │
│ Create test │     │ Run tests   │     │ Grade       │
│ cases       │     │ headlessly  │     │ results     │
└─────────────┘     └─────────────┘     └─────────────┘

What is skillave?

skillave is the execution engine in a skill evaluation pipeline:

Generation: Define test cases with prompts and expectations (compatible with skill-creator schema)
Execution: Run tests headlessly via ACP (Agent Client Protocol), capturing all tool calls into structured traces
Verification: Evaluate traces against expectations - programmatically or with AI-assisted grading

The core tool focuses on execution - deterministic, parallel, and traceable. Grading is delegated to downstream tools for flexibility.

Features

Headless Execution: Runs ACP agents in a deterministic environment without UI.
Structured Tracing: Records all interactions (Input, Output, Tool Calls) into standard JSONL format.
Parallel Execution: Supports running multiple evals and runs concurrently.
Bun-powered: Built with Bun for fast startup and execution.

Prerequisites

Node.js (for npx)
An ACP-capable agent command available in your PATH.

Installation

Run without installation (recommended):

npx -y skillave --version

From source (for contributors):

git clone https://github.com/williamfzc/skillave.git
cd skillave
bun install
bun run build
node dist/index.js --version

Evals Schema

skillave consumes a JSON file defining the evaluations to run. It minimally requires id and prompt.

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": "test-1",
      "prompt": "Write a python script to calculate fibonacci numbers",
      "//": "Other fields (expectations, files, etc.) are ignored by skillave but may be used by downstream graders"
    }
  ]
}

Usage

Run Evals

Execute evals and generate traces:

npx -y skillave eval \
  --config skillave.json

Or pass arguments directly (via config file is recommended for complex setups):

Config File (skillave.json):

{
  "workspace": "./workspace",
  "evals": "./evals.json",
  "command": "opencode acp",
  "runs": 1,
  "jobs": 4,
  "timeout": 180
}

Options:

--config <path>: Path to config file (default: skillave.json)

Output Structure

The tool generates a workspace with the following structure:

workspace/
  ├── result.json          # Workspace-level summary + index for locating each run
  └── <eval_id>/
      └── <run_index>/
          └── trace.jsonl       # Structured trace events (tool calls, etc.)

result.json fields

result.json contains:

acp: Execution configuration (runner mode, command/args, timeout, concurrency).
runs_per_eval: How many times each eval is executed (from config runs).
total_evals: Number of eval cases in the evals file.
total_runs: total_evals * runs_per_eval.
results: Array of per-run summaries. Each item points to trace_path.

Each item in results[] contains:

duration_ms, tool_call_count, prompt_count
acp_server (when available): protocol version + server name/version
model / token_usage (when the ACP server returns these fields)

How many files are generated?

Let:

E = total_evals
R = runs_per_eval

Then the workspace will contain:

Exactly 1 workspace-level file: result.json
Exactly E * R per-run files: one trace.jsonl per run, stored at <eval_id>/<run_index>/trace.jsonl

So the total number of files is: 1 + (E * R) (not counting directories).

Trace Format

trace.jsonl contains a sequence of events. Example:

{"type":"tool_call","name":"Write","input":{"path":"/tmp/test.txt","content":"Hello"},"tool_call_id":0}
{"type":"tool_call","name":"bash","input":{"command":"echo 'done'"},"tool_call_id":1}

Included Skill

This repository ships with a skill for AI agents: skills/skillave/

The skill teaches agents how to use skillave across three phases:

| Phase | Reference | What it covers | |-------|-----------|----------------| | Generation | references/generation.md | Creating test cases, evals schema, prompt design | | Execution | references/execution.md | Running evals, config options, output format | | Verification | references/verification.md | Grading results, tool expectations, reporting |

Agents can invoke the skillave skill when they need to run evaluations or work with skill-creator workflows.

Development

Running Tests

bun test

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

skillave

What is skillave?

Features

Prerequisites

Installation

Evals Schema

Usage

Run Evals

Output Structure

result.json fields

How many files are generated?

Trace Format

Included Skill

Development

Running Tests