evalkit

v0.2.0

Published

3 months ago

Lightweight deterministic evaluators for AI agents. Binary pass/fail checks, zero dependencies, no LLM cost.

0High
0Medium
0Low

wkhori

ai agent evals evaluation testing golden-set deterministic langchain llm evalkit

evalkit

Lightweight deterministic evaluators for AI agents. Binary pass/fail checks, zero dependencies, no LLM cost.

Why

Before you reach for LLM-as-judge or complex scoring rubrics, you should have 10-20 core test cases with deterministic, binary checks that run on every commit. These checks have zero API cost, zero ambiguity, and produce the same result every time.

Install

npm install evalkit

Quick Start

import { runSuite } from 'evalkit';

// Results stream to console as each case completes
const result = await runSuite({
  cases: 'golden-set.yaml',
  agent: async (query) => {
    const res = await myAgent.invoke(query);
    return {
      responseText: res.text,
      actualTools: res.toolsUsed,
      latencyMs: res.duration,
    };
  },
});
// eval-001  What is my portfolio allocation?           PASS  1.2s
// eval-002  Show me my current holdings                FAIL  3.4s
//           content_match: Missing: $
//
// 1/2 passed (4.6s)

if (result.failed > 0) process.exit(1);

Runner

Load test cases from JSON or YAML and run them against your agent.

Test case format

JSON

{
  "test_cases": [
    {
      "id": "eval-001",
      "query": "What is my portfolio allocation?",
      "checks": {
        "expectedTools": ["portfolio_holdings"],
        "mustContain": ["%", "AAPL"],
        "mustNotContain": ["I don't know"],
        "thresholdMs": 20000
      },
      "metadata": { "category": "portfolio" }
    }
  ]
}

YAML (built-in parser, no dependencies)

# golden-set.yaml
test_cases:
  - id: eval-001
    query: "What is my portfolio allocation?"
    checks:
      expectedTools:
        - portfolio_holdings
      mustContain:
        - "%"
        - "AAPL"
      mustNotContain:
        - "I don't know"
      thresholdMs: 20000
    metadata:
      category: portfolio
      difficulty: basic

Agent callback

The runner calls your function with each test case's query and expects an AgentResult back:

interface AgentResult {
  responseText: string;       // The agent's text response (required)
  actualTools?: string[];     // Tools the agent called
  latencyMs?: number;         // How long the agent took
  toolCallCount?: number;     // Number of tool calls (or derived from actualTools.length)
  cost?: number;              // Token count or dollar cost
}

You provide the adapter — evalkit never touches your SDK, keys, or auth.

`runSuite()` options

const result = await runSuite({
  cases: 'golden-set.yaml',     // File path (.json, .yaml, .yml) or inline SuiteConfig object
  agent: myAgentFn,             // Your (query: string) => Promise<AgentResult> callback
  name: 'Portfolio Suite',      // Optional suite name (overrides name from file)
  concurrency: 3,              // Run cases in parallel (default: 1, sequential)
  print: true,                 // Stream results to console (default: true)
  onCaseComplete: (caseResult) => {
    console.log(`${caseResult.id}: ${caseResult.passed ? 'PASS' : 'FAIL'}`);
  },                            // Optional progress callback, fired after each case
});

Or pass cases inline — no file needed:

const result = await runSuite({
  cases: {
    test_cases: [
      { id: 'test-1', query: 'Hello', checks: { mustContain: ['hi'] } },
    ],
  },
  agent: myAgentFn,
});

Available checks

| Check field | What it validates | |---|---| | expectedTools | Agent called exactly these tools (set equality) | | mustContain | Response contains all strings (case-insensitive) | | mustNotContain | Response contains none of these strings | | thresholdMs | Response time under threshold | | json | Response is valid JSON ({ requireObject: true } for objects only) | | schema | Parsed JSON has required keys with correct types | | copOutPhrases | Response isn't empty or a cop-out ("I don't know", etc.) | | lengthMin / lengthMax | Response character count within bounds | | regexPatterns | Response matches regex patterns (regexMode: 'all' \| 'any') | | toolCallMin / toolCallMax | Number of tool calls within bounds | | costBudget | Cost under budget |

A case with no checks always passes — useful for smoke tests that just verify the agent doesn't crash.

Individual evaluators

Every evaluator returns an EvalResult with passed: boolean and details: string. You can use them standalone without the runner.

Core checks

import { toolSelection, contentMatch, negativeMatch, latency } from 'evalkit';

toolSelection({
  expected: ['search', 'summarize'],
  actual: ['summarize', 'search'],
});
// passed: true — order-independent set equality

contentMatch({
  responseText: 'The GDP growth rate is 2.5%',
  mustContain: ['GDP', 'growth'],
});
// passed: true — case-insensitive substring match

negativeMatch({
  responseText: 'Here is your analysis.',
  mustNotContain: ["I don't know", 'error'],
});
// passed: true

latency({ latencyMs: 1200, thresholdMs: 5000 });
// passed: true — default threshold: 20,000ms

Format checks

import { jsonValid, schemaMatch, nonEmpty, lengthBounds } from 'evalkit';

jsonValid({ text: '{"valid": true}', requireObject: true });
// passed: true

schemaMatch({
  data: { name: 'Alice', age: 30 },
  requiredKeys: ['name', 'age'],
  typeChecks: { name: 'string', age: 'number' },
});
// passed: true — zero-dep, no Zod needed

nonEmpty({ responseText: "I don't know" });
// passed: false — "Response is a cop-out phrase"

lengthBounds({ responseText: 'Hello world', min: 5, max: 1000 });
// passed: true

Pattern & behavioral checks

import { regexMatch, toolCallCount, costBudget } from 'evalkit';

regexMatch({
  responseText: 'Contact: [email protected]',
  patterns: [/\S+@\S+\.\S+/],
  mode: 'all', // 'all' (default) or 'any'
});
// passed: true

toolCallCount({ count: 3, min: 1, max: 5 });
// passed: true

costBudget({ actual: 5000, budget: 10000 });
// passed: true — works with token counts or dollar amounts

`runChecks()`

Run any combination of checks at once. Only runs checks for which inputs are provided.

import { runChecks } from 'evalkit';

const result = runChecks({
  responseText: 'Your portfolio has 15 holdings',
  expectedTools: ['portfolio_holdings'],
  actualTools: ['portfolio_holdings'],
  mustContain: ['portfolio', 'holdings'],
  mustNotContain: ["I don't know"],
  latencyMs: 1500,
  thresholdMs: 5000,
});
// { passed: true, results: [...], summary: '4/4 checks passed' }

Factory pattern

Each evaluator has a create*Evaluator factory for reuse across test cases:

import { createContentMatchEvaluator } from 'evalkit';

const checkContent = createContentMatchEvaluator({
  mustContain: ['portfolio', 'holdings', 'allocation'],
});

// Reuse across test cases
const r1 = checkContent({ responseText: response1 });
const r2 = checkContent({ responseText: response2 });

Integrating with your agent

evalkit is SDK-agnostic. You write a thin adapter function that calls your agent and returns an AgentResult. Here are patterns for common setups:

Any agent (generic pattern)

import { runSuite, AgentFn } from 'evalkit';

const agent: AgentFn = async (query) => {
  const start = Date.now();
  const response = await callMyAgent(query);
  return {
    responseText: response.text,
    actualTools: response.toolCalls?.map((t) => t.name),
    latencyMs: Date.now() - start,
    toolCallCount: response.toolCalls?.length,
    cost: response.usage?.totalTokens,
  };
};

const result = await runSuite({ cases: 'golden-set.yaml', agent });

CI / GitHub Actions

// eval.ts — run with: npx tsx eval.ts
import { runSuite } from 'evalkit';

const result = await runSuite({
  cases: 'golden-set.yaml',
  agent: myAgentFn,
});

process.exit(result.failed > 0 ? 1 : 0);

# .github/workflows/evals.yml
- run: npx tsx eval.ts

Design principles

Zero runtime dependencies — no Zod, no Ajv, no LLM calls
Binary results — every check returns passed: boolean
Deterministic — same input always produces same output
CI-friendly — fast enough to run on every commit
SDK-agnostic — you provide the agent callback, evalkit runs the checks

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

evalkit

Why

Install

Quick Start

Runner

Test case format

JSON

YAML (built-in parser, no dependencies)

Agent callback

runSuite() options

Available checks

Individual evaluators

Core checks

Format checks

Pattern & behavioral checks

runChecks()

Factory pattern

Integrating with your agent

Any agent (generic pattern)

CI / GitHub Actions

Design principles

License

`runSuite()` options

`runChecks()`