npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@zeroindex-ai/eval-pack

v0.3.0

Published

Opinionated open-source eval harness for RAG and agent apps built on Claude

Readme

@zeroindex-ai/eval-pack

Opinionated, dependency-light eval harness for RAG and agent applications built on Claude. By ZeroIndex LLC.

npm version License: MIT

Status: v0.2.0. API may change with feedback. See PROJECT.md for design, decisions, and the ordered work list.

What this is

eval-pack is the generalization of the LLM-as-judge + programmatic-checks pipeline that runs inside ask-zeroindex. It gives you three things you'd otherwise build yourself for every AI app:

  1. A golden-set runner with throttle handling, per-item error isolation, and threshold gating
  2. A judge abstraction (Claude-backed in v0.1) plus built-in programmatic checks (mustMention, mustNotMention, citationCount, expectRefusal)
  3. A standalone HTML report renderer that produces one self-contained file per run — suitable for CI artifacts, browser-local viewing, or iframe embedding

What it deliberately is not: a heavyweight platform. No service to host, no proprietary file format, no SaaS subscription. One package, subpath exports, a small CLI, and a companion GitHub composite action.

Install

# minimal — no judge
pnpm add -D @zeroindex-ai/eval-pack

# with Claude judge (default)
pnpm add -D @zeroindex-ai/eval-pack @anthropic-ai/sdk

Requires Node 20+. As of v0.2.0, @anthropic-ai/sdk is an optional peerDependency. Install it alongside eval-pack if you use the Claude judge; omit it if you only run --judge none (programmatic checks only).

Quick start

// evals/eval.ts
import { runEval, mustMention, citationCount, markerCitationExtractor } from '@zeroindex-ai/eval-pack';
import { claudeJudge } from '@zeroindex-ai/eval-pack/judge-claude';

// 1. Wrap your pipeline as a `subject` function.
const subject = async (question: string) => {
  const chunks = await yourRetriever(question);
  const text = await yourLLM(question, chunks);
  return { text, retrievedRefs: chunks.map((c) => String(c.id)) };
};

// 2. Configure checks + judge + run.
const report = await runEval({
  golden: 'evals/golden.json',
  subject,
  citationExtractor: markerCitationExtractor(/\[chunk:(\d+)\]/g),
  checks: [mustMention(), citationCount({ min: 1, skipWhen: (item) => item.expect_refusal === true })],
  judge: claudeJudge({ model: 'claude-sonnet-4-6' }),
  resultsDir: 'evals/results',
});

const passed = report.results.filter((r) => r.pass).length;
console.log(`${passed}/${report.results.length} passed`);

Or run from the CLI against a JS subject file:

pnpm eval-pack run \
  --subject ./evals/subject.js \
  --golden ./evals/golden.json \
  --threshold 0.8 \
  --html-out ./evals/results/latest.html

Golden-set format

A golden set is a JSON file with a version wrapper and an array of items:

{
  "version": "1.0",
  "items": [
    {
      "id": "services-list",
      "category": "positive",
      "question": "What services do you offer?",
      "relevant_refs": ["3", "4", "5"],
      "must_mention": ["audit", "build"],
      "must_not_mention": [],
      "expect_refusal": false,
      "tags": ["services"]
    }
  ]
}

Only id, category, and question are required. category is free-form — pick the axes that matter for your app (positive, negative, adversarial, etc.).

Package exports

| Subpath | What | Key exports | | -------------------------------------- | --------------------- | ------------------------------------------------------------------------------------------------ | | @zeroindex-ai/eval-pack | Core | runEval, types, built-in checks, citation extractors, defaultPassRule, byCategory, metrics | | @zeroindex-ai/eval-pack/checks | Convenience re-export | The four built-in checks | | @zeroindex-ai/eval-pack/judge-claude | Claude judge | claudeJudge() factory, default prompt builders | | @zeroindex-ai/eval-pack/report-html | Report renderer | renderHtml(report, opts) | | eval-pack (CLI binary) | Command-line entry | eval-pack run [options] |

Built-in checks

| Check | What it asserts | | ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | | mustMention(opts?) | Every term in item.must_mention appears in the answer (case-insensitive by default; set caseSensitive: true for exact) | | mustNotMention(opts?) | No term in item.must_not_mention appears | | citationCount({ min, skipWhen? }) | result.citationRefs.length >= min, with optional auto-skip for refusal items | | expectRefusal({ phrases? }) | Heuristic refusal detection. Passes when item.expect_refusal === (actual refusal detected). Auto-skips items without expect_refusal set. Coarse, English-only case-insensitive substring match — can false-positive (e.g. "I don't know why…") and miss non-English refusals; override phrases to adjust. |

Custom checks are functions with the shape (item: GoldenItem, partial: PartialResult) => CheckResult. No DSL, no plugin registration — just write a function and pass it.

CLI

eval-pack run [options]

Options:
  --golden <path>          Path to golden JSON (default: evals/golden.json)
  --subject <path>         JS file exporting a 'subject' function (required)
  --judge <name>           "claude" or "none" (default: claude)
  --judge-model <id>       Override judge model (default: claude-sonnet-4-6)
  --threshold <ratio>      Pass-rate threshold 0–1
  --filter <key=value>     Filter golden items. Keys are independent
                           (category, tags) — one value per key; repeating
                           the SAME key overwrites the earlier value.
                           Tags accepts a comma-separated list — items match
                           if they have ANY of the listed tags.
                           e.g. --filter tags=smoke,core
  --limit <n>              Max items to run
  --throttle-ms <n>        Sleep between items (vendor rate-limit hedging)
  --results-dir <path>     Where to write run JSON (default: evals/results)
  --html-out <path>        Write HTML report to this path
  --quiet                  Suppress per-item stdout
  --help, -h               Show this help

Exit codes: 0 pass, 1 below threshold, 2 usage error, 3 every item errored.

--threshold and --throttle-ms fall back to EVAL_PASS_THRESHOLD and EVAL_THROTTLE_MS env vars when not set on the CLI.

TypeScript subject files require running the CLI under tsx. The published bin has a node shebang, so point tsx at the bin script directly: pnpm exec tsx node_modules/.bin/eval-pack run ... (this is exactly what the GitHub composite action does). Or pre-compile your subject to JS. Bundling tsx is a v0.2 candidate.

GitHub composite action

Drop one uses: into your workflow instead of stitching pnpm + Node + the CLI invocation yourself:

# .github/workflows/eval.yml
name: Eval
on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 14 * * *'
  workflow_dispatch:

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: zeroindex-ai/eval-pack/action@v1
        with:
          golden: evals/golden.json
          subject: ./evals/subject.ts
          threshold: '0.8'
          html-out: evals/results/latest.html
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          # plus whatever env your subject's pipeline needs

The action sets up pnpm + Node, runs pnpm install --frozen-lockfile, invokes the CLI via tsx (so TS subjects work), and uploads the JSON + HTML report as workflow artifacts (always — even on failure). Inputs mirror the CLI flags; tsx and @zeroindex-ai/eval-pack must be devDependencies in your package.json.

Full action reference: action/action.yml.

HTML report

renderHtml(report, opts) returns a single self-contained HTML string — no client JS, no external assets. Safe to view via file://, embed via iframe, or ship as a CI artifact.

import { renderHtml } from '@zeroindex-ai/eval-pack/report-html';
import { writeFileSync } from 'node:fs';

const html = renderHtml(report, {
  projectName: 'my-rag-app',
  threshold: 0.8,
});
writeFileSync('eval-report.html', html);

The report shows: headline pass-rate (vs threshold if provided), per-category table, expanded failure cards (question, retrieved refs, cited refs, recall, all checks, judge appropriate/grounded/reason, full answer text), a collapsed pass list, and an errors table when items threw.

Develop

git clone https://github.com/zeroindex-ai/eval-pack
cd eval-pack
pnpm install
pnpm typecheck
pnpm lint
pnpm test
pnpm build

CI runs the same gates across Node 20, 22, and 24 on every PR and push to main.

Status & roadmap

  • v0.1 — core, Claude judge, HTML report, CLI
  • v0.1.xask-zeroindex reference port, GitHub composite action, dummy-agent example, evals.zeroindex.ai published reports
  • v0.2.0@anthropic-ai/sdk moved to optional peerDependency; tag-driven release.yml workflow with npm provenance
  • 🔭 v0.2 candidates — OpenAI / Gemini judges (adapter pattern), agent-specific primitives (tool-call assertions, trajectory grading), run-over-run diffing + regression detection

Full design + decision log + ordered work list: PROJECT.md.

License

MIT — see LICENSE.