npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ai-contract-eval

v0.1.0

Published

A lightweight evaluation layer for AI systems with structured cases, checks, and summaries.

Readme

ai-contract-eval

A lightweight evaluation layer for AI systems built on structured contracts.

This project solves a very common problem in AI applications: teams often judge outputs informally instead of evaluating them against explicit expectations. That makes systems harder to compare, harder to improve, and harder to trust over time.

ai-contract-eval provides a lightweight framework for:

  • defining evaluation cases for AI tasks.
  • running model outputs against explicit expectations.
  • scoring structure, content, and constraints.
  • generating repeatable summaries across individual cases and suites.

It is intentionally small so it can be understood quickly, adopted easily, and extended as systems grow.

It can be used as both a reference implementation and a lightweight standard for evaluating AI interactions.

Why this project exists

Most AI evaluation is still ad hoc.

  • One workflow compares outputs manually.
  • Another relies on a few vague metrics.
  • A third stores examples with no consistent scoring method.
  • A fourth reruns prompts but cannot explain whether the system improved.

The result is weak comparability, poor traceability, and limited confidence in whether a system is getting better or worse.

This package defines a simple evaluation layer that sits on top of structured AI inputs and outputs.

Mental model

Think of the package as a small evaluation boundary:

AI Contract -> Evaluation Case -> Scoring -> Result -> Summary

The evaluation layer sits after the model interaction, turning outputs into structured results that can be compared across runs, prompts, and models.

This does not replace human judgment.

It makes evaluation more explicit, repeatable, and composable.

What is included

  • Evaluation case builder.
  • Rule-based scoring helpers.
  • Single-case evaluation runner.
  • Multi-case suite evaluation runner.
  • Structured result normalization.
  • Example usage demonstrating real-world integration.
  • Test coverage for evaluation and summary behavior.

Install

npm install ai-contract-eval

Example

import {
  createEvaluationCase,
  evaluateCase,
  summarizeSuite
} from "ai-contract-eval";

const testCase = createEvaluationCase({
  name: "summarization-basic",
  task: "summarization",
  input: {
    prompt: "Summarize the article in 2 sentences."
  },
  actual: {
    text: "The article explains how cities are redesigning streets to improve safety and reduce emissions."
  },
  expected: {
    contains: ["cities", "safety"],
    maxLength: 140
  }
});

const result = evaluateCase(testCase);

const suite = summarizeSuite([result]);
console.log(result);
console.log(suite);

Evaluation case contract

An evaluation case looks like this:

{
  "version": "1.0",
  "name": "summarization-basic",
  "task": "summarization",
  "input": {
    "prompt": "Summarize the article in 2 sentences."
  },
  "actual": {
    "text": "The article explains how cities are redesigning streets to improve safety and reduce emissions.",
    "structured": null
  },
  "expected": {
    "contains": ["cities", "safety"],
    "notContains": [],
    "minLength": null,
    "maxLength": 140,
    "structuredKeys": []
  },
  "meta": {
    "traceId": "...",
    "createdAt": "..."
  }
}

Evaluation result contract

An evaluation result looks like this:

{
  "version": "1.0",
  "name": "summarization-basic",
  "task": "summarization",
  "status": "pass",
  "score": 1,
  "checks": [
    {
      "name": "contains:cities",
      "passed": true
    },
    {
      "name": "contains:safety",
      "passed": true
    },
    {
      "name": "maxLength",
      "passed": true
    }
  ],
  "issues": [],
  "meta": {
    "evaluatedAt": "...",
    "traceId": "..."
  }
}

Status model

The evaluation status is intentionally narrow:

  • pass means the case met all required checks.
  • fail means one or more required checks did not pass.
  • error means the case could not be evaluated safely.

This is useful because many AI workflows treat evaluation as a vague quality signal rather than a clear decision boundary.

Quick wrapper example

import {
  createEvaluationCase,
  evaluateCase,
  summarizeSuite
} from "ai-contract-eval";

async function evaluateModelOutput(runModel, prompt) {
  const actual = await runModel(prompt);

  const testCase = createEvaluationCase({
    name: "entity-extraction-basic",
    task: "extraction",
    input: { prompt },
    actual: {
      text: actual.text,
      structured: actual.structured ?? null
    },
    expected: {
      contains: ["Canada"],
      structuredKeys: ["entities"]
    },
    meta: {
      model: actual.model ?? "unknown"
    }
  });

  const result = evaluateCase(testCase);
  return {
    result,
    summary: summarizeSuite([result])
  };
}

Non-Goals

This project does not attempt to:

  • Replace model SDKs or providers.
  • Provide judge-model evaluation frameworks.
  • Define domain-specific quality metrics for every AI task.

It focuses only on defining a consistent, lightweight contract for evaluating AI outputs.

Design Principles

This project is intentionally minimal.

It defines a small, explicit evaluation layer rather than a full platform. The goal is to provide a stable way to measure AI outputs that is easy to understand, easy to adopt, and easy to extend.

The design emphasizes:

  • Simplicity over abstraction.
  • Explicit checks over vague scoring.
  • Repeatability over novelty.
  • Composability over completeness.

This allows the evaluator to be used across different models, prompts, and workflows without introducing unnecessary complexity.

Roadmap

This project is designed as a foundation for more reliable AI systems. Future extensions may include:

  • JSON Schema export for evaluation cases and results.
  • TypeScript types for stronger developer ergonomics.
  • Integration with ai-contract-kit envelopes.
  • Integration with ai-contract-observer logs.
  • Pluggable scoring rules.
  • Golden test fixtures and replayable evaluation suites.

License

MIT