rag-scorer
v0.1.0
Published
Automated RAG evaluation dataset generator - test your RAG system with auto-generated ground truth
Maintainers
Readme
RAG Scorer
Automated RAG evaluation dataset generator. Test your RAG system with auto-generated ground truth questions.
npx rag-scorer ./your-docsHow It Works
- Sample - Randomly samples paragraphs from your document collection
- Generate - Uses LLM to create questions answerable from each paragraph
- Validate - Another LLM pass verifies question quality
- Score Uniqueness - Estimates if the answer is unique to this source (0-1)
- Output - JSON dataset + beautiful HTML report
Quick Start
# Generate evaluation dataset from your docs
npx rag-scorer ./docs
# With options
npx rag-scorer ./docs \
--uniqueness 0.7 \
--files 30 \
--output my-eval-set.json
# Evaluate your RAG system
npx rag-scorer eval ./rag-eval-dataset.json \
--endpoint http://localhost:8000/queryInstallation
npm install -g rag-scorer
# or
pnpm add -g rag-scorerOr just use npx (no install needed):
npx rag-scorer ./docsCommands
generate (default)
Generate an evaluation dataset from your documents.
rag-scorer [generate] <source-folder> [options]
Options:
-o, --output <path> Output JSON path (default: ./rag-eval-dataset.json)
-r, --report <path> Output HTML report path (default: ./rag-eval-report.html)
--no-report Skip HTML report
-f, --files <n> Max files to sample (default: 20)
-p, --paragraphs <n> Max paragraphs per file (default: 5)
-u, --uniqueness <0-1> Min uniqueness threshold (default: 0.5)
--types <list> Question types (default: factual,definitional,procedural)
--api-key <key> Anthropic API key
--model <model> Model to use (default: claude-sonnet-4-20250514)eval
Evaluate your RAG system against a generated dataset.
rag-scorer eval <dataset.json> --endpoint <url> [options]
Options:
-e, --endpoint <url> RAG API endpoint (required)
-o, --output <path> Output results JSON (default: ./rag-eval-results.json)
--api-key <key> Anthropic API key (for answer scoring)Your RAG endpoint should accept POST requests with:
{ "question": "..." }And respond with:
{
"answer": "...",
"sources": [{ "file": "...", "content": "..." }]
}Question Types
- factual - Specific facts, names, numbers, dates
- definitional - "What is X?" style questions
- procedural - "How to..." questions
- comparative - Differences and similarities
- causal - Cause and effect relationships
- temporal - Time and sequence questions
Supported File Types
- Markdown (.md, .mdx)
- Plain text (.txt)
- PDF (.pdf)
- Word documents (.docx)
- HTML (.html)
Output
Dataset JSON
{
"version": "1.0.0",
"generatedAt": "2024-01-15T...",
"questions": [
{
"id": "...",
"question": "When was the feature introduced?",
"expectedAnswer": "Version 2.0, March 2023",
"source": { "file": "changelog.md", "page": 1 },
"questionType": "temporal",
"uniquenessScore": 0.85,
"validation": { "isValid": true, "confidence": 0.92 }
}
],
"stats": {
"totalQuestionsGenerated": 50,
"totalQuestionsValid": 43,
"avgUniquenessScore": 0.78
}
}HTML Report
Beautiful visual report showing all generated questions, sources, and scores.
Programmatic Usage
import { runPipeline, RAGEvaluator } from 'rag-scorer';
// Generate dataset
const dataset = await runPipeline({
sourcePath: './docs',
sampling: { maxFiles: 20 },
generation: { uniquenessThreshold: 0.7 },
});
// Evaluate RAG
const evaluator = new RAGEvaluator();
const results = await evaluator.evaluateDataset(
dataset,
async (question) => {
// Your RAG query function
return { answer: '...', sources: [] };
}
);
console.log(results.summary);Environment Variables
ANTHROPIC_API_KEY- Your Anthropic API key
License
MIT
