npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@idriszade/eval

v0.1.7

Published

Pipeline-kit eval foundation — defineEval, runEval, case/scorer/score types

Readme

@idriszade/eval

Braintrust-style eval runner for pipeline-kit. Define test cases, a task function, and one or more scorers; run them in parallel with configurable concurrency; collect per-case EvalResult and a rolled-up EvalSummary. The API surface is a common-denominator design aligned with Braintrust, Inspect AI, Promptfoo, and LangSmith — scorer results are typed Score objects, not opaque strings.

Installation

pnpm add @idriszade/eval

Quick start

import { defineEval } from '@idriszade/eval';
import { exactMatch, llmJudge } from '@idriszade/eval-scorers';

const myEval = defineEval({
  name: 'summarise-v1',
  cases: [
    { input: 'The cat sat on the mat.', expected: 'A cat is on a mat.' },
    { input: 'Water boils at 100°C.', expected: 'Water boils at 100 degrees Celsius.' },
  ],
  task: async (input) => {
    // Call your model or pipeline here.
    return await mySummariser(input);
  },
  scorers: {
    exact: exactMatch(),
    judge: llmJudge({ model: 'gpt-4o', client: myModelClient }),
  },
  concurrency: 4,
});

const summary = await myEval.run();
console.log(summary.passRate);        // fraction where ALL scorers passed
console.log(summary.totalUsage);      // merged UsageAccumulator snapshot

API

defineEval(opts)

Returns a DefinedEval with a .run() method.

interface DefineEvalOpts<I, O> {
  name: string;
  cases: ReadonlyArray<Case<I, O>>;
  task: (input: I) => Promise<O> | O;
  scorers: Record<string, Scorer<I, O>>;
  concurrency?: number;          // default: unbounded (all cases in parallel)
  judge?: { model: string; rubric?: string };  // metadata only — wire llmJudge in scorers
}

runEval(opts)

Lower-level function if you want to skip the DefinedEval wrapper.

Scorer<I, O>

type Scorer<I, O> = (args: {
  input: I;
  output: O;
  expected?: O;
  metadata?: Record<string, unknown>;
}) => Score | Promise<Score>;

EvalSummary shape

interface EvalSummary<I, O> {
  name: string;
  results: ReadonlyArray<EvalResult<I, O>>;
  passRate: number;              // fraction where all scorer scores pass
  totalDurationMs: number;
  totalUsage: ReadonlyMap<string, number>;
}

interface EvalResult<I, O> {
  case: Case<I, O>;
  output?: O;
  scores: Array<{ name: string; score: Score }>;
  usage: ReadonlyMap<string, number>;
  durationMs: number;
  error?: { code: string; message: string };
}

Concurrency

concurrency controls the maximum number of cases running at the same time. Omitting it runs all cases in parallel. Set it to 1 for sequential execution (useful when rate-limiting a model).

defineEval({ ..., concurrency: 5 });

Errors are captured

If task throws, the case gets EvalResult.error and all scorers produce { pass: false, score: 0 }. If a scorer throws, that scorer's Score carries a reason describing the error. Neither propagates as a rejected promise — run() always resolves.

Usage tracking

EvalResult.usage is a snapshot of the UsageAccumulator recorded during that case's task execution. EvalSummary.totalUsage merges all per-case snapshots — it is the union of all recorded keys with values summed across cases.

Wire UsageAccumulator into your task via the kit context to get automatic per-case tracking:

task: async (input, ctx) => {
  const result = await myProcess.execute(input, ctx);
  // ctx.usage is populated inside execute(); the runner reads it after.
  return result.data;
},

License — MIT