checkmate-factbench
v0.2.0
Published
Checkmate FactBench is a factual accuracy evaluation framework that benchmarks LLMs using the FEVER dataset. It measures how well a model can classify claims (Supported / Refuted / NEI) and cite the correct evidence from Wikipedia.
Downloads
18
Maintainers
Readme
Checkmate FactBench
A CLI tool for benchmarking LLMs on factual accuracy using the FEVER dataset. Evaluate how well language models can classify claims as SUPPORTS, REFUTES, or NOT ENOUGH INFO.
Features
- 🎯 FEVER Dataset Evaluation: Benchmark multiple LLMs on factual claim verification
- 📊 Real-time Progress: Live terminal UI showing evaluation progress and metrics
- 💾 Smart Caching: Avoids re-evaluating the same examples across runs
- 📈 Detailed Reports: Generates markdown reports with confusion matrices and accuracy metrics
- ⚡ Concurrent Evaluation: Configurable concurrency for faster evaluation
- 🔄 Multiple Models: Evaluate multiple models in a single run
Installation
Prerequisites: This tool requires Bun to be installed.
npm install -g checkmate-factbenchor
bun add -g checkmate-factbenchAfter installation, you can run the CLI with:
checkmate-factbench --file val/train.jsonl --limit 10Quick Start
- Set your OpenRouter API key:
export OPENROUTER_API_KEY=your_api_key_hereGet your API key from openrouter.ai
- Prepare your dataset:
Your dataset should be a JSONL file where each line is a JSON object with:
id: Unique identifierclaim: The claim to evaluatelabel: One of"SUPPORTS","REFUTES", or"NOT ENOUGH INFO"verifiable: (optional)"VERIFIABLE"or"NOT VERIFIABLE"
Example (val/train.jsonl):
{"id": 1, "claim": "Barack Obama was born in Hawaii.", "label": "SUPPORTS", "verifiable": "VERIFIABLE"}
{"id": 2, "claim": "The moon is made of cheese.", "label": "REFUTES", "verifiable": "VERIFIABLE"}
{"id": 3, "claim": "Some unknown fact about X.", "label": "NOT ENOUGH INFO", "verifiable": "NOT VERIFIABLE"}- Run the evaluation:
checkmate-factbench --file val/train.jsonl --limit 10Usage
checkmate-factbench [options]Options
--file <path>- Path to JSONL dataset file (default:val/train.jsonl)--limit <n>- Number of examples to evaluate per model (default:10)--models <csv>- Comma-separated OpenRouter model IDs (default: see below)--out <path>- Output markdown report path (optional, defaults toruns/<timestamp>.md)--concurrency <n>- Number of concurrent requests per model (default:2)
Default Models
If --models is not specified, the following models are evaluated:
meta-llama/llama-3.3-70b-instruct:freenousresearch/hermes-3-llama-3.1-405b:freegoogle/gemini-2.0-flash-exp:freegoogle/gemma-3-12b-it:freemistralai/mistral-small-3.1-24b-instruct:free
Examples
Evaluate 50 examples with default models:
checkmate-factbench --file val/train.jsonl --limit 50Evaluate specific models:
checkmate-factbench --file val/train.jsonl --models "google/gemini-2.0-flash-exp:free,meta-llama/llama-3.3-70b-instruct:free" --limit 20Custom output path and higher concurrency:
checkmate-factbench --file val/train.jsonl --limit 100 --out results.md --concurrency 5Output
The CLI generates:
Markdown Report (
runs/<timestamp>.mdor custom--outpath):- Summary statistics for each model
- Confusion matrices
- Accuracy and invalid rate metrics
- Per-model breakdowns
Raw Results (
runs/<timestamp>/raw/):- JSONL files for each model with detailed evaluation results
- Each file contains all predictions, latencies, and metadata
Cache (
.cache/):- Cached results to avoid re-evaluating the same examples
- Speeds up subsequent runs with overlapping datasets
Live UI
During evaluation, you'll see a live terminal UI showing:
- Current model being evaluated
- Progress (completed/total examples)
- Real-time accuracy for each model
- Output file paths
- Summary statistics as they're computed
Example output:
Checkmate FactBench — OpenRouter validation
file: val/train.jsonl | limit: 10 | concurrency: 2
models: google/gemini-2.0-flash-exp:free, meta-llama/llama-3.3-70b-instruct:free
labels: SUPPORTS, REFUTES, NOT ENOUGH INFO
---
current: google/gemini-2.0-flash-exp:free (7/10)
outputs: runs/2025-12-12T18-33-54-221Z.md (raw: runs/2025-12-12T18-33-54-221Z/raw/)
---
google/gemini-2.0-flash-exp:free: acc 80.0% (8/10) | invalid 0.0%
meta-llama/llama-3.3-70b-instruct:free: pending…
Running…Environment Variables
OPENROUTER_API_KEY(required) - Your OpenRouter API key
Development
To run from source:
# Install dependencies
bun install
# Run the CLI
OPENROUTER_API_KEY=your_key bun run dev -- --file val/train.jsonl --limit 10
# Build
bun run buildDataset Format
The JSONL file should contain one JSON object per line:
{
id: number | string; // Unique identifier
claim: string; // The claim to evaluate
label: "SUPPORTS" | "REFUTES" | "NOT ENOUGH INFO"; // Gold label
verifiable?: "VERIFIABLE" | "NOT VERIFIABLE"; // Optional
}How It Works
- Load Dataset: Reads examples from the specified JSONL file
- Check Cache: Looks for previously evaluated examples to avoid redundant API calls
- Evaluate Models: For each model:
- Evaluates cached examples (instant)
- Evaluates new examples via OpenRouter API
- Updates cache with new results
- Generate Reports: Creates markdown report and saves raw JSONL results
- Display Results: Shows live progress and final statistics
Publishing
This package can be published manually or automatically via GitHub Actions.
Manual Publishing
npm publishPublishing via GitHub Actions
See .github/PUBLISH.md for detailed instructions on setting up automated publishing.
Quick steps:
- Create an npm access token with "Publish packages" permission
- Add it as
NPM_TOKENsecret in GitHub repository settings - Create a version tag:
git tag v0.1.0 && git push origin v0.1.0 - GitHub Actions will automatically build and publish to npm
License
MIT
