npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

checkmate-factbench

v0.2.0

Published

Checkmate FactBench is a factual accuracy evaluation framework that benchmarks LLMs using the FEVER dataset. It measures how well a model can classify claims (Supported / Refuted / NEI) and cite the correct evidence from Wikipedia.

Downloads

18

Readme

Checkmate FactBench

A CLI tool for benchmarking LLMs on factual accuracy using the FEVER dataset. Evaluate how well language models can classify claims as SUPPORTS, REFUTES, or NOT ENOUGH INFO.

Features

  • 🎯 FEVER Dataset Evaluation: Benchmark multiple LLMs on factual claim verification
  • 📊 Real-time Progress: Live terminal UI showing evaluation progress and metrics
  • 💾 Smart Caching: Avoids re-evaluating the same examples across runs
  • 📈 Detailed Reports: Generates markdown reports with confusion matrices and accuracy metrics
  • Concurrent Evaluation: Configurable concurrency for faster evaluation
  • 🔄 Multiple Models: Evaluate multiple models in a single run

Installation

Prerequisites: This tool requires Bun to be installed.

npm install -g checkmate-factbench

or

bun add -g checkmate-factbench

After installation, you can run the CLI with:

checkmate-factbench --file val/train.jsonl --limit 10

Quick Start

  1. Set your OpenRouter API key:
export OPENROUTER_API_KEY=your_api_key_here

Get your API key from openrouter.ai

  1. Prepare your dataset:

Your dataset should be a JSONL file where each line is a JSON object with:

  • id: Unique identifier
  • claim: The claim to evaluate
  • label: One of "SUPPORTS", "REFUTES", or "NOT ENOUGH INFO"
  • verifiable: (optional) "VERIFIABLE" or "NOT VERIFIABLE"

Example (val/train.jsonl):

{"id": 1, "claim": "Barack Obama was born in Hawaii.", "label": "SUPPORTS", "verifiable": "VERIFIABLE"}
{"id": 2, "claim": "The moon is made of cheese.", "label": "REFUTES", "verifiable": "VERIFIABLE"}
{"id": 3, "claim": "Some unknown fact about X.", "label": "NOT ENOUGH INFO", "verifiable": "NOT VERIFIABLE"}
  1. Run the evaluation:
checkmate-factbench --file val/train.jsonl --limit 10

Usage

checkmate-factbench [options]

Options

  • --file <path> - Path to JSONL dataset file (default: val/train.jsonl)
  • --limit <n> - Number of examples to evaluate per model (default: 10)
  • --models <csv> - Comma-separated OpenRouter model IDs (default: see below)
  • --out <path> - Output markdown report path (optional, defaults to runs/<timestamp>.md)
  • --concurrency <n> - Number of concurrent requests per model (default: 2)

Default Models

If --models is not specified, the following models are evaluated:

  • meta-llama/llama-3.3-70b-instruct:free
  • nousresearch/hermes-3-llama-3.1-405b:free
  • google/gemini-2.0-flash-exp:free
  • google/gemma-3-12b-it:free
  • mistralai/mistral-small-3.1-24b-instruct:free

Examples

Evaluate 50 examples with default models:

checkmate-factbench --file val/train.jsonl --limit 50

Evaluate specific models:

checkmate-factbench --file val/train.jsonl --models "google/gemini-2.0-flash-exp:free,meta-llama/llama-3.3-70b-instruct:free" --limit 20

Custom output path and higher concurrency:

checkmate-factbench --file val/train.jsonl --limit 100 --out results.md --concurrency 5

Output

The CLI generates:

  1. Markdown Report (runs/<timestamp>.md or custom --out path):

    • Summary statistics for each model
    • Confusion matrices
    • Accuracy and invalid rate metrics
    • Per-model breakdowns
  2. Raw Results (runs/<timestamp>/raw/):

    • JSONL files for each model with detailed evaluation results
    • Each file contains all predictions, latencies, and metadata
  3. Cache (.cache/):

    • Cached results to avoid re-evaluating the same examples
    • Speeds up subsequent runs with overlapping datasets

Live UI

During evaluation, you'll see a live terminal UI showing:

  • Current model being evaluated
  • Progress (completed/total examples)
  • Real-time accuracy for each model
  • Output file paths
  • Summary statistics as they're computed

Example output:

Checkmate FactBench — OpenRouter validation
file: val/train.jsonl | limit: 10 | concurrency: 2
models: google/gemini-2.0-flash-exp:free, meta-llama/llama-3.3-70b-instruct:free
labels: SUPPORTS, REFUTES, NOT ENOUGH INFO
---
current: google/gemini-2.0-flash-exp:free (7/10)
outputs: runs/2025-12-12T18-33-54-221Z.md (raw: runs/2025-12-12T18-33-54-221Z/raw/)
---
google/gemini-2.0-flash-exp:free: acc 80.0% (8/10) | invalid 0.0%
meta-llama/llama-3.3-70b-instruct:free: pending…
Running…

Environment Variables

  • OPENROUTER_API_KEY (required) - Your OpenRouter API key

Development

To run from source:

# Install dependencies
bun install

# Run the CLI
OPENROUTER_API_KEY=your_key bun run dev -- --file val/train.jsonl --limit 10

# Build
bun run build

Dataset Format

The JSONL file should contain one JSON object per line:

{
  id: number | string;           // Unique identifier
  claim: string;                  // The claim to evaluate
  label: "SUPPORTS" | "REFUTES" | "NOT ENOUGH INFO";  // Gold label
  verifiable?: "VERIFIABLE" | "NOT VERIFIABLE";       // Optional
}

How It Works

  1. Load Dataset: Reads examples from the specified JSONL file
  2. Check Cache: Looks for previously evaluated examples to avoid redundant API calls
  3. Evaluate Models: For each model:
    • Evaluates cached examples (instant)
    • Evaluates new examples via OpenRouter API
    • Updates cache with new results
  4. Generate Reports: Creates markdown report and saves raw JSONL results
  5. Display Results: Shows live progress and final statistics

Publishing

This package can be published manually or automatically via GitHub Actions.

Manual Publishing

npm publish

Publishing via GitHub Actions

See .github/PUBLISH.md for detailed instructions on setting up automated publishing.

Quick steps:

  1. Create an npm access token with "Publish packages" permission
  2. Add it as NPM_TOKEN secret in GitHub repository settings
  3. Create a version tag: git tag v0.1.0 && git push origin v0.1.0
  4. GitHub Actions will automatically build and publish to npm

License

MIT