npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@llmbench/core

v1.0.0

Published

Evaluation engine, providers, and scorers for LLMBench

Downloads

774

Readme

@llmbench/core

Evaluation engine, providers, scorers, and SDK for the LLMBench platform.

npm version License


This is the core engine that powers LLMBench. Use it directly if you want to build custom evaluation pipelines, integrate with your own tooling, or embed LLM evaluation into your application.

Installation

npm install @llmbench/core

SDK — One-Call Evaluation

The simplest way to run evaluations programmatically:

evaluate()

import { evaluate } from "@llmbench/core";

const result = await evaluate({
  testCases: [
    { input: "What is 2+2?", expected: "4" },
    { input: "Capital of France?", expected: "Paris" },
  ],
  providers: [
    { type: "openai", name: "GPT-4o", model: "gpt-4o" },
  ],
  scorers: [
    { id: "exact-match", name: "Exact Match", type: "exact-match" },
    { id: "contains", name: "Contains", type: "contains" },
  ],
});

console.log(result.status);          // "completed"
console.log(result.summary);         // { totalCases, completedCases, failedCases, totalCost, ... }
console.log(result.scorerAverages);  // { "exact-match": 1.0, "contains": 1.0 }

EvaluateOptions:

| Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | testCases | SimpleTestCase[] | Yes | -- | Array of test cases | | providers | ProviderConfig[] | Yes | -- | Array of providers | | scorers | ScorerConfig[] | No | [exact-match] | Scorers; [] = no scoring | | onEvent | (event) => void | No | -- | Event listener for progress | | concurrency | number | No | 5 | Parallel evaluations | | maxRetries | number | No | 3 | Retries on transient errors | | timeoutMs | number | No | 30000 | Per-request timeout | | db | LLMBenchDB | No | -- | Pre-existing DB handle | | dbPath | string | No | in-memory | Persistent DB path | | projectName | string | No | sdk-eval | Project name | | datasetName | string | No | sdk-dataset | Dataset name | | customProviders | Map<string, fn> | No | -- | Custom provider functions | | cache | { ttlHours? } | No | -- | Enable caching with TTL |

SimpleTestCase:

| Field | Type | Required | Description | |-------|------|----------|-------------| | input | string | Yes | Prompt text | | expected | string | No | Expected output for global scorers | | messages | ChatMessage[] | No | Multi-turn conversation | | context | object | No | Template interpolation variables | | tags | string[] | No | Tags | | assert | TestCaseAssertion[] | No | Per-test-case assertions (override global scorers) |

evaluateQuick()

Convenience wrapper for single-prompt evaluation:

import { evaluateQuick } from "@llmbench/core";

const result = await evaluateQuick({
  prompt: "What is the meaning of life?",
  expected: "42",
  providers: [{ type: "openai", name: "GPT-4o", model: "gpt-4o" }],
});

Per-Test-Case Assertions

Test cases can override global scorers with inline assertions:

const result = await evaluate({
  testCases: [
    {
      input: "Name a color",
      assert: [
        { type: "regex", value: "(red|blue|green|yellow)" },
        { type: "contains", value: "color" },
      ],
    },
    {
      input: "What is 2+2?",
      expected: "4",  // uses global scorers
    },
  ],
  providers: [{ type: "openai", name: "GPT", model: "gpt-4o" }],
  scorers: [{ id: "exact-match", name: "Exact Match", type: "exact-match" }],
});

When assert is present, those assertions replace global scorers for that test case. Each assertion specifies its own expected value via the value field.

Custom Providers via SDK

const result = await evaluate({
  testCases: [{ input: "Hello", expected: "Hi" }],
  providers: [{ type: "custom", name: "MyAPI", model: "v1" }],
  customProviders: new Map([
    ["MyAPI", async (input) => ({
      output: "Hi there!",
      latencyMs: 50,
      tokenUsage: { inputTokens: 5, outputTokens: 3, totalTokens: 8 },
    })],
  ]),
});

Config & Dataset Loading

Config Loading

import { loadConfig, mergeWithDefaults } from "@llmbench/core/config";

// Auto-detects llmbench.config.ts, .js, .mjs, .yaml, or .yml in cwd
const config = await loadConfig();

// Or specify a path (YAML or TypeScript)
const config2 = await loadConfig("./path/to/config.yaml");

// Apply defaults (dbPath, port, concurrency, retries, timeout)
const full = mergeWithDefaults(config);

Dataset Loading

import { loadDataset } from "@llmbench/core/config";

// Auto-detects JSON or YAML by extension
const dataset = loadDataset("./datasets/qa.yaml");

// dataset.name        — dataset name
// dataset.testCases   — array of test cases with input, expected, assert, etc.

Validates all fields including per-test-case assertions. Throws descriptive errors for invalid data.

Providers

Built-in Providers

import {
  OpenAIProvider,
  AnthropicProvider,
  GoogleProvider,
  OllamaProvider,
  CustomProvider,
  createProvider,  // factory function
} from "@llmbench/core/providers";

OpenAI

const provider = new OpenAIProvider({
  type: "openai",
  name: "GPT-4o",
  model: "gpt-4o",
  // apiKey: resolved from OPENAI_API_KEY env var by default
  temperature: 0,
  maxTokens: 1024,
  timeoutMs: 30000,
});

const response = await provider.generate("What is 2 + 2?");
// { output: "4", latencyMs: 230, tokenUsage: { inputTokens: 12, outputTokens: 1, totalTokens: 13 } }

Anthropic

const provider = new AnthropicProvider({
  type: "anthropic",
  name: "Claude Sonnet",
  model: "claude-sonnet-4-6",
  maxTokens: 1024,
});

Google AI

const provider = new GoogleProvider({
  type: "google",
  name: "Gemini Flash",
  model: "gemini-2.0-flash",
});

Ollama (local models)

const provider = new OllamaProvider({
  type: "ollama",
  name: "Llama 3.2",
  model: "llama3.2",
  baseUrl: "http://localhost:11434", // default
});

Custom Provider

const provider = new CustomProvider(
  { type: "custom", name: "My API", model: "v1" },
  async (input, config) => {
    const res = await fetch("https://my-api.com/generate", {
      method: "POST",
      body: JSON.stringify({ prompt: input }),
    });
    const data = await res.json();
    return {
      output: data.text,
      latencyMs: data.duration_ms,
      tokenUsage: {
        inputTokens: data.input_tokens,
        outputTokens: data.output_tokens,
        totalTokens: data.input_tokens + data.output_tokens,
      },
    };
  },
);

Factory Function

import { createProvider } from "@llmbench/core/providers";

// Automatically picks the right provider class based on config.type
const provider = createProvider({ type: "openai", name: "GPT-4o", model: "gpt-4o" });

Provider Features

All providers inherit from BaseProvider, which provides:

  • Config merging — Override temperature, maxTokens, etc. per-call via provider.generate(input, overrides)
  • Timeout signals — Uses AbortSignal.timeout() (Node 20+) for per-request timeouts
  • API key resolution — Reads from config or falls back to environment variables
  • System messages — Supports systemMessage with {{variable}} interpolation
  • Retry with backoff — Retries on 429/5xx with exponential backoff (1s, 2s, 4s... up to 30s)

Scorers

Built-in Scorers

import {
  ExactMatchScorer,
  ContainsScorer,
  RegexScorer,
  JsonMatchScorer,
  CosineSimilarityScorer,
  LLMJudgeScorer,
  WeightedAverageScorer,
  createScorer,  // factory function
} from "@llmbench/core/scorers";

All scorers implement IScorer and return ScoreResult with value (0-1), reason, and optional metadata.

Exact Match

const scorer = new ExactMatchScorer(); // case-insensitive, trimmed by default
await scorer.score("Paris", "paris");   // { value: 1 }
await scorer.score("Paris", "London");  // { value: 0 }

const strict = new ExactMatchScorer({ caseSensitive: true, trim: false });
await strict.score("Paris", "paris");   // { value: 0 }

Contains

const scorer = new ContainsScorer();
await scorer.score("The answer is 42", "42");   // { value: 1 }
await scorer.score("Hello World", "hello");      // { value: 1 }

const strict = new ContainsScorer({ caseSensitive: true });
await strict.score("Hello World", "hello");      // { value: 0 }

Regex

const scorer = new RegexScorer(); // case-insensitive by default
await scorer.score("The answer is 42", "\\d+");  // { value: 1 }
await scorer.score("hello", "^\\d+$");           // { value: 0 }

JSON Match

const scorer = new JsonMatchScorer();
await scorer.score('{"a":1,"b":2}', '{"b":2,"a":1}');  // { value: 1 } — order independent

const partial = new JsonMatchScorer({ partial: true });
await partial.score('{"a":1,"b":2,"c":3}', '{"a":1,"b":2}');  // { value: 1 }

Cosine Similarity

const scorer = new CosineSimilarityScorer();
await scorer.score("hello world", "hello world");                     // { value: 1.0 }
await scorer.score("The cat sat on the mat", "The cat is on the mat"); // { value: ~0.85 }

LLM Judge

const judgeProvider = new OpenAIProvider({
  type: "openai", name: "Judge", model: "gpt-4o",
});

const scorer = new LLMJudgeScorer(judgeProvider, {
  name: "Quality Judge",
  promptTemplate: `Score 0-1. Input: {{input}} Expected: {{expected}} Actual: {{output}}
Return JSON: { "score": <number>, "reason": "<explanation>" }`,
});

Weighted Composite

const scorer = new WeightedAverageScorer([
  { scorer: new ExactMatchScorer(), weight: 3 },
  { scorer: new ContainsScorer(), weight: 1 },
]);
// If exact=0, contains=1: value = (0*3 + 1*1) / (3+1) = 0.25

Cost Calculation

import { CostCalculator } from "@llmbench/core/cost";

const calculator = new CostCalculator();
const estimate = calculator.calculate("gpt-4o", "openai", {
  inputTokens: 1000, outputTokens: 500, totalTokens: 1500,
});
// { inputCost: 0.0025, outputCost: 0.005, totalCost: 0.0075, currency: "USD" }

Built-in pricing for 50+ models across OpenAI, Anthropic, and Google AI.

CI Gates

import { ThresholdGate } from "@llmbench/core/gate";

const gate = new ThresholdGate({
  minScore: 0.8,
  maxFailureRate: 0.1,
  maxCost: 5.00,
  maxLatencyMs: 10000,
  scorerThresholds: { "exact-match": 0.9 },
});

const result = gate.evaluateRun(run, scoresByResultId);
// { passed: true/false, violations: [{ gate, threshold, actual, message }] }

Run Comparison

import { RunComparator } from "@llmbench/core/comparison";

const comparator = new RunComparator(evalRunRepo, evalResultRepo, scoreRepo);
const result = await comparator.compare(runIdA, runIdB);

// result.scorerComparisons — per-scorer average score delta
// result.costComparison    — total cost delta and % change
// result.latencyComparison — avg latency delta and % change
// result.regressions       — test cases where Run B scored worse
//   severity: "high" (>30% drop) | "medium" (>15%) | "low" (>5%)

Engine Internals

The EvaluationEngine handles:

  • ConcurrencyConcurrencyManager limits parallel provider calls (configurable per run)
  • RetriesRetryHandler with exponential backoff (1s base, 30s max, configurable max retries)
  • EventsEventBus emits typed events: run:started, case:started, case:completed, case:failed, run:progress, run:completed, run:failed
  • Per-test-case assertions — When a test case has assert[], inline scorers override global scorers. Invalid inline types (llm-judge, composite) fail fast before making API calls.
  • Template interpolation{{variable}} substitution in prompts and system messages using test case context
  • Caching — SHA-256 keyed response cache with optional TTL, stored in SQLite
  • Cost tracking — Calculated per request using the built-in pricing table

Subpath Exports

| Import path | Contents | |-------------|----------| | @llmbench/core | All public exports | | @llmbench/core/providers | Provider classes + createProvider factory | | @llmbench/core/scorers | Scorer classes + createScorer factory | | @llmbench/core/engine | EvaluationEngine, EventBus, ConcurrencyManager, RetryHandler | | @llmbench/core/cost | CostCalculator, PRICING_TABLE | | @llmbench/core/comparison | RunComparator | | @llmbench/core/gate | ThresholdGate | | @llmbench/core/config | loadConfig, loadDataset, validateConfig, mergeWithDefaults | | @llmbench/core/sdk | evaluate, evaluateQuick |

Related Packages

| Package | Description | |---------|-------------| | @llmbench/cli | CLI tool for running evaluations | | @llmbench/types | TypeScript type definitions | | @llmbench/db | SQLite database layer | | @llmbench/ui | React component library |

License

Apache License 2.0