npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

bff-eval

v0.1.5

Published

Evaluation framework for LLM systems - Node.js CLI

Readme

Bolt Foundry Evals

"The user said x. Our assistant replied y. Is that actually useful?"

Overview

Bolt Foundry Evals helps developers create graders to test the outputs of LLMs across multiple underlying base models.

Features

  • Custom Graders: Create specialized graders using a standard DSL to easily create and update criteria and output formats.
  • Multi-Model Evaluation: Grader responses across multiple LLMs simultaneously to compare performance and consistency (powered by Open Router)
  • Parallel Execution: Run evaluations concurrently for faster results across multiple models and iterations
  • Meta Grader Analysis: Calibrate and validate grader quality using ground truth scores to ensure consistent and accurate evaluations

Quickstart

Setup

  1. Get an OpenRouter API Key:

    • Sign up for an account at OpenRouter
    • Generate an API key from your dashboard
    • Set the environment variable:
    export OPENROUTER_API_KEY="your-api-key-here"
  2. Install and run:

    npx bff-eval --help

Run evaluation with sample data:

npx bff-eval --input packages/bolt-foundry/evals/examples/sample-data.jsonl \
         --grader packages/bolt-foundry/evals/examples/json-validator.ts

Running Demos

The quickest way to get started is with the --demo flag:

# Run a specific demo
npx bff-eval --demo json-validator

# Run a random demo
npx bff-eval --demo

Demos are pre-configured examples with graders and sample data. Each demo includes:

  • grader.ts - The evaluation logic
  • samples.jsonl - Test cases with expected scores

Available demos are located in the examples/ directory.

Input data

Provide input data as a file in JSONL format.

{"userMessage": "Extract user info from: 'John Doe, 30, NYC'", "assistantResponse": "{\"name\":\"John Doe\",\"age\":30,\"city\":\"NYC\"}"}
{"userMessage": "Parse address: '123 Main St'", "assistantResponse": "{\"street\":\"123 Main St\"}"}

The types for the input data are:

type Sample = {
  userMessage: string;
  assistantResponse: string;
  id?: string;
  score?: number; // Optional: expected score for meta-evaluation (-3 to 3)
  metadata?: Record<string, unknown>; // your custom metadata to forward along to the reporter
};

type InputSampleFile = Array<Sample>;

Create your grader

Graders let you build your eval logic in a structured way. Read more on our [prompting philosophy], or the [case studies] as to why we've done it this way.

  1. Structure your grader with an initial spec explaining what your grader will do.
  2. Add cards to explain evaluation criteria
  3. Include any variables as context, INCLUDING OUTPUT FORMAT.

To be clear, you SHOULD NOT BE INTERPOLATING ANY STRINGS IN THE SPECS. Use .context builders to safely include variables. [See why].

import { makeGraderDeckBuilder } from "../makeGraderDeckBuilder.ts";

// Create a grader that evaluates JSON outputs
export default makeGraderDeckBuilder("json-validator")
  .spec(
    "You are an expert at evaluating JSON outputs for correctness and completeness.",
  )
  .card(
    "evaluation criteria",
    (c) =>
      c.spec("Check if the output is valid JSON syntax")
        .spec("Verify all required fields are present")
        .spec("Ensure data types match expected schema"),
  )
  .card(
    "scoring guidelines",
    (c) =>
      c.spec(
        "Score 3: Perfect - Strict valid JSON that exactly matches the expected schema",
      )
        .spec(
          "Score 2: Good - Valid JSON but uses relaxed syntax (single quotes, trailing commas, etc)",
        )
        .spec(
          "Score 1: Acceptable - Valid JSON but has missing optional fields",
        )
        .spec(
          "Score -1: Poor - Valid JSON but has hallucinated/extra keys not in the input",
        )
        .spec(
          "Score -3: Failure - Not JSON at all, plain text, or doesn't parse",
        ),
  );

// Note: The makeGraderDeckBuilder automatically:
// - Appends evaluation context (userMessage, assistantResponse, expected)
// - Adds output format requirements (JSON with score and notes)
// - Handles all the boilerplate for grader evaluation

Model Selection

Specify the model to use for evaluation using the --model flag:

npx bff-eval --input data.jsonl --grader grader.ts --model openai/gpt-4o  # Default
npx bff-eval --input data.jsonl --grader grader.ts --model anthropic/claude-3-opus

The evaluation uses OpenRouter API, so any model available on OpenRouter can be used.

Output Formats

Evaluations produce results that look like this:

export interface GradingResult {
  model: string;
  id?: string;
  iteration: number;
  score: -3 | -2 | -1 | 0 | 1 | 2 | 3;
  latencyInMs: number;
  rawOutput: string;
  output: {
    score: number;
    notes?: string;
  };
  sampleMetadata?: Record<string, unknown>;
}

// Example result:
{
  model: "openai/gpt-4o",
  id: "sample-001",
  iteration: 1,
  score: 3,
  latencyInMs: 1234,
  rawOutput: "{\"score\": 3, \"notes\": \"Perfect JSON with correct schema\"}",
  output: {
    score: 3,
    notes: "Perfect JSON with correct schema"
  },
  sampleMetadata: {
    groundTruthScore: 3  // If provided in input
  }
}

Meta Grader Analysis

Bolt Foundry Evals supports "grading the grader" by comparing grader scores against ground truth scores. This calibration feature helps you:

  1. Validate Grader Quality: Ensure your graders score consistently and accurately
  2. Improve Grader Criteria: Identify areas where grader instructions need refinement
  3. Compare Grader Versions: Measure improvements when updating graders

Adding Ground Truth Scores

Include a groundTruthScore field in your input samples:

{"userMessage": "Extract: name=John", "assistantResponse": "{\"name\":\"John\"}", "groundTruthScore": 3}
{"userMessage": "Parse: color=red", "assistantResponse": "{'color': 'red'}", "groundTruthScore": 2}
{"userMessage": "Convert: [email protected]", "assistantResponse": "[email protected]", "groundTruthScore": -3}

Calibration Metrics

When ground truth scores are provided, the eval command reports:

  • Exact Match Rate: Percentage of samples where grader score equals ground truth
  • Within ±1 Accuracy: Percentage of samples within 1 point of ground truth
  • Average Absolute Error: Mean difference between grader and ground truth scores
  • Disagreements: Specific samples where grader and ground truth differ

Example: Improving a JSON Validator

Using calibration data, we improved our JSON validator from 60% to 90% accuracy:

Version 1 (vague criteria):

Exact Match Rate: 60% (6/10)
Average Error: 0.80

Version 2 (precise criteria):

Exact Match Rate: 90% (9/10)
Average Error: 0.30

The key improvements:

  • Clearly distinguished between strict JSON (double quotes) and relaxed syntax (single quotes)
  • Specified exact scoring for different failure modes (-1 for extra keys, -3 for non-JSON)
  • Added precise handling of edge cases (e.g., empty JSON when data expected)