npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@effectorhq/skill-eval

v0.1.0

Published

Evaluation framework for AI agent skills — measure whether capabilities actually work

Readme

skill-eval

Evaluation framework for AI agent skills — measure whether capabilities actually work before they ship.

Status: Alpha License: MIT


Why

ClawHub has 13,729 skills. 67% of them fail in practice. The ecosystem needs a way to answer: does this skill actually do what it claims?

skill-eval is a framework for writing, running, and scoring evaluations against agent skills. It draws on:

What It Measures

Structural Quality (static, no execution)

| Metric | What it checks | Weight | |--------|---------------|--------| | frontmatter_completeness | All required YAML fields present and valid | 0.15 | | section_coverage | Purpose, When to Use, When NOT to Use, Setup, Commands, Examples, Notes | 0.15 | | type_declaration | Has effector.toml with [effector.interface] input/output/context | 0.10 | | permission_alignment | Declared permissions match detected behavior (via effector-audit) | 0.10 | | description_quality | Length, specificity, starts with verb, avoids vague language | 0.05 | | install_completeness | At least one install method with all required fields | 0.05 |

Functional Quality (requires execution sandbox)

| Metric | What it checks | Weight | |--------|---------------|--------| | prerequisite_resolution | All declared requires.bins and requires.env resolve | 0.10 | | invocation_success | Skill produces parseable output for a reference input | 0.15 | | output_type_match | Actual output matches declared output type shape | 0.10 | | error_handling | Graceful behavior on invalid input (no crash, clear message) | 0.05 |

Composite Score

score = Σ(metric_score × weight)

Scale: 0.0 (broken) → 1.0 (production-ready). Thresholds:

| Grade | Range | Meaning | |-------|-------|---------| | A | 0.85–1.00 | Production-ready, publish to ClawHub | | B | 0.70–0.84 | Functional, needs polish | | C | 0.50–0.69 | Partially working, significant gaps | | D | 0.25–0.49 | Fundamentally broken | | F | 0.00–0.24 | Non-functional |

Eval File Format

Evals are YAML files describing expected behavior:

# evals/linear.eval.yml
skill: linear
version: ">=1.0.0"

prerequisites:
  env:
    - LINEAR_API_KEY
  bins:
    - curl
    - jq

cases:
  - name: list-open-issues
    input: "What are my open Linear issues?"
    expect:
      output_type: JSON
      contains_fields: ["id", "title", "state"]
      no_error: true

  - name: create-issue
    input: "Create a Linear issue titled 'Test from skill-eval'"
    expect:
      output_type: JSON
      contains_fields: ["id", "identifier"]
      no_error: true
    teardown: "Delete the created issue"

  - name: invalid-key
    input: "List my issues"
    env_override:
      LINEAR_API_KEY: "invalid_key"
    expect:
      no_crash: true
      error_message_contains: ["unauthorized", "401", "invalid"]

scoring:
  pass_threshold: 0.70
  weights:
    invocation_success: 0.4
    output_type_match: 0.3
    error_handling: 0.2
    prerequisite_resolution: 0.1

Directory Structure

skill-eval/
├── src/
│   ├── runner.js          # Eval execution engine
│   ├── scorer.js          # Metric computation + grading
│   ├── reporter.js        # Output formatting (terminal, JSON, markdown)
│   └── static-analyzer.js # Structural quality checks (no execution)
├── evals/
│   ├── linear.eval.yml    # Reference eval for linear-skill
│   └── README.md          # How to write evals
├── fixtures/
│   ├── passing-skill/     # A skill that scores A
│   └── failing-skill/     # A skill that scores F (for testing the framework)
├── scripts/
│   └── run-eval.js        # CLI entry point
├── package.json
└── README.md

Usage

# Evaluate a single skill (structural only — no execution)
npx skill-eval ./path/to/skill --static-only

# Evaluate with execution (requires sandbox + prerequisites)
npx skill-eval ./path/to/skill --eval evals/linear.eval.yml

# Evaluate all skills in a directory
npx skill-eval ./skills/ --static-only --report markdown > report.md

# JSON output for CI
npx skill-eval ./path/to/skill --format json

Output Example

skill-eval v0.1.0 — linear-skill

Structural Quality
  ✓ frontmatter_completeness    1.00  (all fields present)
  ✓ section_coverage            1.00  (7/7 sections)
  ✓ type_declaration            1.00  (effector.toml with typed interface)
  ✓ permission_alignment        1.00  (no drift detected)
  ✓ description_quality         0.90  (good length, specific)
  ✓ install_completeness        1.00  (manual install with steps)

Functional Quality
  ✓ prerequisite_resolution     1.00  (curl, jq found; LINEAR_API_KEY set)
  ✓ invocation_success          1.00  (3/3 cases passed)
  ✓ output_type_match           1.00  (JSON output matches declaration)
  ✓ error_handling              0.80  (graceful on invalid key, no crash)

──────────────────────────────────
Score: 0.97 / 1.00  →  Grade A
──────────────────────────────────

Integration with effectorHQ

  • skill-lint checks syntax and structure → skill-eval checks behavior and quality
  • effector-audit checks permission drift → skill-eval uses that as one metric
  • effector-types defines the type vocabulary → skill-eval checks output against declared types
  • clawhub-analysis provides corpus baselines → skill-eval grades against ecosystem norms

Roadmap

  • [ ] v0.1.0 — Static analyzer + reference eval for linear-skill
  • [ ] v0.2.0 — Execution sandbox (Docker-based) + functional metrics
  • [ ] v0.3.0 — CI integration (skill-eval-action for GitHub Actions)
  • [ ] v0.4.0 — Batch mode for ClawHub-wide audits
  • [ ] v1.0.0 — Stable API, published to npm as @effectorhq/skill-eval

Prior Art


MIT License — effectorHQ Contributors