npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

agent-eval-kit

v0.1.0

Published

TypeScript-native eval framework for AI agent workflows. Record-replay, deterministic + LLM graders, trajectory evaluation.

Readme

agent-eval-kit

npm tests license

TypeScript-native eval framework for AI agent workflows. Record once, replay forever, grade instantly.

Documentation · GitHub


Testing AI agents is expensive, slow, and non-deterministic. agent-eval-kit fixes this with a record-replay workflow:

  1. Record — capture live agent responses as fixtures (one-time API cost)
  2. Replay — grade recorded outputs instantly at zero cost
  3. Gate — enforce pass rates, cost budgets, and latency limits in CI
  4. Compare — diff two runs to catch regressions

Quick Start

npm install agent-eval-kit

Requires Node.js 20+. Generate a starter config with agent-eval-kit init, or write one manually:

// eval.config.ts
import { defineConfig, contains, latency } from "agent-eval-kit";

export default defineConfig({
  suites: [
    {
      name: "basic-qa",
      target: async (input) => {
        const response = await myAgent(input.prompt);
        return { text: response.text, latencyMs: response.duration };
      },
      cases: [
        {
          id: "capital-france",
          input: { prompt: "What is the capital of France?" },
          expected: { text: "Paris" },
        },
      ],
      defaultGraders: [
        { grader: contains("Paris"), required: true },
        { grader: latency(5000) },
      ],
      gates: { passRate: 0.95 },
    },
  ],
});
agent-eval-kit record --suite basic-qa   # record fixtures (live API calls)
agent-eval-kit run --mode replay         # replay instantly (after generation), $0 cost

Features

  • 20 built-in graders — text (contains, regex, exactMatch), tool calls (toolSequence, toolArgsMatch), metrics (latency, cost, tokenCount), safety (safetyKeywords, noHallucinatedNumbers), structured output (jsonSchema), and LLM-as-judge (llmRubric, factuality, llmClassify)
  • Grader composition — combine with all(), any(), not()
  • 3 execution modeslive (real calls), replay (cached fixtures), judge-only (re-grade with new graders, no re-run)
  • Quality gates — enforce pass rate, max cost, and p95 latency thresholds; non-zero exit on failure
  • Run comparison — diff any two runs to surface regressions and improvements
  • Multi-trial runs — flakiness detection with Wilson score confidence intervals
  • Watch mode — re-run evals on file changes (--watch)
  • External cases — load from JSONL or YAML files alongside inline cases
  • Plugin system — custom graders and lifecycle hooks (beforeRun, afterTrial, afterRun)
  • 4 reporters — console, JSON, JUnit XML, Markdown
  • MCP server — 8 tools + 3 resources for AI assistant integration
  • CI-native — JUnit reporter, GitHub Actions Step Summary, git hook installation

Examples

| Example | What it covers | Run it | |---------|---------------|--------| | quickstart/ | Minimal setup — 1 case, 2 graders | agent-eval-kit run --config examples/quickstart | | text-grading/ | Text, safety, metric, composition, and LLM judge graders | agent-eval-kit run --config examples/text-grading | | tool-agent/ | Tool call grading, hallucination detection, plugins | agent-eval-kit run --config examples/tool-agent |

See examples/README.md for setup details.

Documentation

Full docs at flanaganse.github.io/agent-eval-kit:

Contributing

Contributions welcome — please open an issue first to discuss changes.

pnpm install && pnpm test && pnpm lint

License

MIT