npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

evalkit

v0.2.0

Published

Lightweight deterministic evaluators for AI agents. Binary pass/fail checks, zero dependencies, no LLM cost.

Readme

evalkit

Lightweight deterministic evaluators for AI agents. Binary pass/fail checks, zero dependencies, no LLM cost.

Why

Before you reach for LLM-as-judge or complex scoring rubrics, you should have 10-20 core test cases with deterministic, binary checks that run on every commit. These checks have zero API cost, zero ambiguity, and produce the same result every time.

Install

npm install evalkit

Quick Start

import { runSuite } from 'evalkit';

// Results stream to console as each case completes
const result = await runSuite({
  cases: 'golden-set.yaml',
  agent: async (query) => {
    const res = await myAgent.invoke(query);
    return {
      responseText: res.text,
      actualTools: res.toolsUsed,
      latencyMs: res.duration,
    };
  },
});
// eval-001  What is my portfolio allocation?           PASS  1.2s
// eval-002  Show me my current holdings                FAIL  3.4s
//           content_match: Missing: $
//
// 1/2 passed (4.6s)

if (result.failed > 0) process.exit(1);

Runner

Load test cases from JSON or YAML and run them against your agent.

Test case format

JSON

{
  "test_cases": [
    {
      "id": "eval-001",
      "query": "What is my portfolio allocation?",
      "checks": {
        "expectedTools": ["portfolio_holdings"],
        "mustContain": ["%", "AAPL"],
        "mustNotContain": ["I don't know"],
        "thresholdMs": 20000
      },
      "metadata": { "category": "portfolio" }
    }
  ]
}

YAML (built-in parser, no dependencies)

# golden-set.yaml
test_cases:
  - id: eval-001
    query: "What is my portfolio allocation?"
    checks:
      expectedTools:
        - portfolio_holdings
      mustContain:
        - "%"
        - "AAPL"
      mustNotContain:
        - "I don't know"
      thresholdMs: 20000
    metadata:
      category: portfolio
      difficulty: basic

Agent callback

The runner calls your function with each test case's query and expects an AgentResult back:

interface AgentResult {
  responseText: string;       // The agent's text response (required)
  actualTools?: string[];     // Tools the agent called
  latencyMs?: number;         // How long the agent took
  toolCallCount?: number;     // Number of tool calls (or derived from actualTools.length)
  cost?: number;              // Token count or dollar cost
}

You provide the adapter — evalkit never touches your SDK, keys, or auth.

runSuite() options

const result = await runSuite({
  cases: 'golden-set.yaml',     // File path (.json, .yaml, .yml) or inline SuiteConfig object
  agent: myAgentFn,             // Your (query: string) => Promise<AgentResult> callback
  name: 'Portfolio Suite',      // Optional suite name (overrides name from file)
  concurrency: 3,              // Run cases in parallel (default: 1, sequential)
  print: true,                 // Stream results to console (default: true)
  onCaseComplete: (caseResult) => {
    console.log(`${caseResult.id}: ${caseResult.passed ? 'PASS' : 'FAIL'}`);
  },                            // Optional progress callback, fired after each case
});

Or pass cases inline — no file needed:

const result = await runSuite({
  cases: {
    test_cases: [
      { id: 'test-1', query: 'Hello', checks: { mustContain: ['hi'] } },
    ],
  },
  agent: myAgentFn,
});

Available checks

| Check field | What it validates | |---|---| | expectedTools | Agent called exactly these tools (set equality) | | mustContain | Response contains all strings (case-insensitive) | | mustNotContain | Response contains none of these strings | | thresholdMs | Response time under threshold | | json | Response is valid JSON ({ requireObject: true } for objects only) | | schema | Parsed JSON has required keys with correct types | | copOutPhrases | Response isn't empty or a cop-out ("I don't know", etc.) | | lengthMin / lengthMax | Response character count within bounds | | regexPatterns | Response matches regex patterns (regexMode: 'all' \| 'any') | | toolCallMin / toolCallMax | Number of tool calls within bounds | | costBudget | Cost under budget |

A case with no checks always passes — useful for smoke tests that just verify the agent doesn't crash.

Individual evaluators

Every evaluator returns an EvalResult with passed: boolean and details: string. You can use them standalone without the runner.

Core checks

import { toolSelection, contentMatch, negativeMatch, latency } from 'evalkit';

toolSelection({
  expected: ['search', 'summarize'],
  actual: ['summarize', 'search'],
});
// passed: true — order-independent set equality

contentMatch({
  responseText: 'The GDP growth rate is 2.5%',
  mustContain: ['GDP', 'growth'],
});
// passed: true — case-insensitive substring match

negativeMatch({
  responseText: 'Here is your analysis.',
  mustNotContain: ["I don't know", 'error'],
});
// passed: true

latency({ latencyMs: 1200, thresholdMs: 5000 });
// passed: true — default threshold: 20,000ms

Format checks

import { jsonValid, schemaMatch, nonEmpty, lengthBounds } from 'evalkit';

jsonValid({ text: '{"valid": true}', requireObject: true });
// passed: true

schemaMatch({
  data: { name: 'Alice', age: 30 },
  requiredKeys: ['name', 'age'],
  typeChecks: { name: 'string', age: 'number' },
});
// passed: true — zero-dep, no Zod needed

nonEmpty({ responseText: "I don't know" });
// passed: false — "Response is a cop-out phrase"

lengthBounds({ responseText: 'Hello world', min: 5, max: 1000 });
// passed: true

Pattern & behavioral checks

import { regexMatch, toolCallCount, costBudget } from 'evalkit';

regexMatch({
  responseText: 'Contact: [email protected]',
  patterns: [/\S+@\S+\.\S+/],
  mode: 'all', // 'all' (default) or 'any'
});
// passed: true

toolCallCount({ count: 3, min: 1, max: 5 });
// passed: true

costBudget({ actual: 5000, budget: 10000 });
// passed: true — works with token counts or dollar amounts

runChecks()

Run any combination of checks at once. Only runs checks for which inputs are provided.

import { runChecks } from 'evalkit';

const result = runChecks({
  responseText: 'Your portfolio has 15 holdings',
  expectedTools: ['portfolio_holdings'],
  actualTools: ['portfolio_holdings'],
  mustContain: ['portfolio', 'holdings'],
  mustNotContain: ["I don't know"],
  latencyMs: 1500,
  thresholdMs: 5000,
});
// { passed: true, results: [...], summary: '4/4 checks passed' }

Factory pattern

Each evaluator has a create*Evaluator factory for reuse across test cases:

import { createContentMatchEvaluator } from 'evalkit';

const checkContent = createContentMatchEvaluator({
  mustContain: ['portfolio', 'holdings', 'allocation'],
});

// Reuse across test cases
const r1 = checkContent({ responseText: response1 });
const r2 = checkContent({ responseText: response2 });

Integrating with your agent

evalkit is SDK-agnostic. You write a thin adapter function that calls your agent and returns an AgentResult. Here are patterns for common setups:

Any agent (generic pattern)

import { runSuite, AgentFn } from 'evalkit';

const agent: AgentFn = async (query) => {
  const start = Date.now();
  const response = await callMyAgent(query);
  return {
    responseText: response.text,
    actualTools: response.toolCalls?.map((t) => t.name),
    latencyMs: Date.now() - start,
    toolCallCount: response.toolCalls?.length,
    cost: response.usage?.totalTokens,
  };
};

const result = await runSuite({ cases: 'golden-set.yaml', agent });

CI / GitHub Actions

// eval.ts — run with: npx tsx eval.ts
import { runSuite } from 'evalkit';

const result = await runSuite({
  cases: 'golden-set.yaml',
  agent: myAgentFn,
});

process.exit(result.failed > 0 ? 1 : 0);
# .github/workflows/evals.yml
- run: npx tsx eval.ts

Design principles

  • Zero runtime dependencies — no Zod, no Ajv, no LLM calls
  • Binary results — every check returns passed: boolean
  • Deterministic — same input always produces same output
  • CI-friendly — fast enough to run on every commit
  • SDK-agnostic — you provide the agent callback, evalkit runs the checks

License

MIT