npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

skills-dojo

v0.6.1

Published

Toolkit for testing and evaluating AI agent skills

Readme

Skills Dojo

A CLI for testing and improving AI agent skills. You write a skill, write evals against it, and Dojo tells you whether the skill works.

Skills follow the Agent Skills specification.

Example run output

Install

npm install -g skills-dojo

What it tests

Selection evals answer: does the agent pick the right skill for the job?

Effectiveness evals answer: does the skill actually help the agent produce correct output?

Getting started

Selection eval

Create a skill with a SKILL.md and an eval file:

skills/
  code-review/
    SKILL.md
    evals/
      selection.yaml

skills/code-review/SKILL.md:

---
name: code-review
description: Review code changes for security, performance, and correctness issues.
---

# Code Review

Analyzes diffs and pull requests...

skills/code-review/evals/selection.yaml:

evals:
  - name: should-select-code-review
    prompt: "Review this pull request for potential security issues and suggest improvements."

  - name: should-not-select-code-review
    prompt: "Write a Python function that calculates the Fibonacci sequence."
    assert: none

When assert is omitted, Dojo expects the agent to select the skill the eval lives under. Use assert: none to test that the agent does not select any skill.

Effectiveness eval

Add an effectiveness.yaml and fixture directories:

skills/
  sql-queries/
    SKILL.md
    evals/
      effectiveness.yaml
      fixtures/
        aggregate-query/
          tests/
            schema.sql
          golden/
            notes.md

skills/sql-queries/evals/effectiveness.yaml:

evals:
  - name: aggregate-monthly-revenue
    prompt: "Write a SQL query that calculates total revenue per month from the orders table."
    criteria:
      - name: groups-by-month
        description: Uses GROUP BY with a date function to aggregate by month
        pass_threshold: 0.7
      - name: correct-columns
        description: Returns both month and revenue columns
        pass_threshold: 0.7
      - name: null-handling
        description: Handles NULL values appropriately
        pass_threshold: 0.5

The agent runs in a sandboxed temp directory with real tools (bash, read_file, write_file, list_files). Files from tests/ become the agent's working directory. An LLM judge scores the result against your criteria. Put reference material in golden/ to help the judge calibrate.

Run it

dojo run

Variants

Variants let you A/B test different skill formulations. There are two types depending on the eval type.

Selection variants (inline)

For selection evals, variants test different description values to see which wording helps the agent pick the right skill:

variants:
  - name: concise
    value: Write and optimize SQL queries across all major database dialects.

  - name: verbose
    value: >
      Write correct, performant SQL across all major data warehouse and database
      dialects including Snowflake, BigQuery, Databricks, PostgreSQL, MySQL, and
      SQL Server.

evals:
  - name: should-select-sql-queries
    prompt: "Write a query that finds the top 10 customers by revenue using a window function."

Each eval runs once with the current skill description, then once per variant. Results show up in a matrix so you can compare.

Effectiveness variants (filesystem)

For effectiveness evals, variants are full skill directories that follow the agentskills.io spec. This lets you test fundamentally different skill formulations — not just description changes, but different instructions, scripts, references, and assets.

Place variant skills in evals/variants/<name>/:

skills/
  sql-queries/
    SKILL.md
    evals/
      effectiveness.yaml
      fixtures/
        aggregate-query/
          tests/
            schema.sql
      variants/
        terse-instructions/
          SKILL.md
        verbose-instructions/
          SKILL.md
          scripts/
            validate.sh

Each variant directory is a complete agentskills.io skill. The directory name is the variant ID used to reference it in effectiveness.yaml:

evals:
  - name: aggregate-monthly-revenue
    variants: [terse-instructions, verbose-instructions]
    prompt: "Write a SQL query that calculates total revenue per month."
    criteria:
      - name: correct-aggregation
        description: Groups by month and sums revenue
        pass_threshold: 0.7

You can also define inline variants in effectiveness.yaml for quick experiments:

variants:
  - name: minimal-prompt
    value: |
      ---
      name: minimal-prompt
      description: Minimal SQL skill.
      ---

      # SQL

      Write SQL. Be concise.

If an inline variant has the same name as a filesystem variant, the filesystem version wins (with a warning).

Run modes

Control which combinations run with run-mode:

| Mode | What runs | |------|-----------| | all (default) | Current skill + all variants | | variants-only | Variants only, skips current | | current-only | Current only, skips variants |

Set at file level or per-eval (eval wins):

run-mode: variants-only

evals:
  - name: compare-variants
    run-mode: all  # overrides file-level
    variants: [terse-instructions]
    prompt: "..."

Filter to a single variant from the CLI:

dojo run --variant terse-instructions

Decoys

Decoys are fake skills injected alongside real ones to test whether the agent can tell the difference:

evals:
  - name: select-with-decoys
    prompt: "Review this pull request for potential security issues."
    decoys:
      - name: code-formatter
        value: Automatically format code to match style guidelines.
      - name: code-explainer
        value: Explain what a piece of code does in plain English.

CLI

dojo run [skill]              Run evals (optionally filter by skill name)
  -e, --eval <name>           Filter by eval name
  -V, --variant <name>        Run only a specific variant
  -t, --eval-type <type>      Filter: "selection", "effectiveness", or "all"
  --selection                  Run only selection evals
  --effectiveness              Run only effectiveness evals
  -m, --evaluation-model       Override evaluation model
  -j, --judge-model            Override judge model
  --model-provider             Override model provider
  -f, --fixture <name>        Filter to a specific fixture
  --judge-filter <id>         Filter to a specific judge
  -p, --parallelism <n>       Max concurrent eval runs (default: CPU cores)
  --no-parallelism            Run sequentially
  -o, --output <path>         Write combined report JSON
  -i, --inspect               Show session events (tool calls, errors)
  --keep-sandbox              Keep sandbox temp dirs after run
  -y, --yes                   Skip confirmation prompts

dojo list                     List discovered skills and evals
dojo validate                 Validate skills and eval files

Global flags:

-s, --skills-dir <dir>    Override skills directory (repeatable)
-c, --config <path>       Path to config file
-d, --cwd <dir>           Working directory

Configuration

Optional dojo.toml in your project root. Everything has sensible defaults, so you can skip this entirely until you need to customize something.

[skills]
# Default: searches skills/, .agents/skills/, .github/skills/, .claude/skills/,
#          .codex/skills/, .gemini/skills/, .openclaw/skills/, .opencode/skills/
dir = ['skills']

[model]
provider = 'anthropic'
evaluator = 'claude-sonnet-4-6'
judge = 'claude-opus-4-6'

[effectiveness]
warn_fixture_threshold = 4
confirm_fixture_threshold = 12

[reporting]
per-skill = true
consolidated = false

Providers

| Provider | Setup | |----------|-------| | anthropic (default) | Set ANTHROPIC_API_KEY | | openai | Set OPENAI_API_KEY | | copilot | GitHub Copilot SDK | | vercel | Vercel AI SDK. Use <provider>/<model-id> model strings (e.g. openai/gpt-4o-mini) |

Eval schema reference

File-level fields

| Field | Type | Default | Description | |-------|------|---------|-------------| | model | string | provider default | Model for the evaluator | | timeout | number | 30 | Timeout in seconds | | skills | "all" or string[] | "all" | Which skills to offer the agent | | run-mode | "all", "variants-only", "current-only" | "all" | Which combinations to run | | variants | Variant[] | -- | Variant definitions | | evals | Eval[] | -- | Required. The eval definitions |

Eval-level fields

| Field | Type | Default | Description | |-------|------|---------|-------------| | name | string | -- | Required. Eval identifier | | prompt | string | -- | Required. The prompt sent to the agent | | assert | string[], "none", "any" | [skillName] | Expected selection result | | model | string | file-level | Override model for this eval | | timeout | number | file-level | Override timeout | | skills | "all" or string[] | file-level | Override available skills | | run-mode | "all", "variants-only", "current-only" | file-level | Override run mode | | variants | "all", string[], Variant[] | "all" | Which variants to run | | decoys | Decoy[] | -- | Fake skills for discrimination testing | | enabled | boolean | true | Skip this eval when false |

Assert behavior

  • omitted -- expects the skill the eval lives under
  • "none" -- agent must not load any skill
  • "any" -- agent must load something (any skill counts)
  • ["skill-a", "skill-b"] -- agent must load one of these

Cascading

Fields cascade: eval-level beats file-level, file-level beats config, config beats defaults.

Reports

Reports are saved per-skill after each run:

<skill-dir>/evals/reports/<run-id>/report.json
<skill-dir>/evals/reports/<run-id>/effectiveness-report.json
<skill-dir>/evals/reports/<run-id>/logs.json

Documentation

Full docs at skillsdojo.dev.