embedeval
v2.0.5
Published
Binary evals, trace-centric, error-analysis-first CLI for LLM evaluation. Built on Hamel Husain's principles.
Downloads
126
Maintainers
Readme
EmbedEval v2
Binary evals. Trace-centric. Error-analysis-first.
A command-line tool for evaluating LLM outputs using binary pass/fail judgments, built on Hamel Husain's evaluation principles.
📖 Full LLM Guide | 🚀 Quick Start | 📊 GitHub Pages
Why EmbedEval?
Most teams struggle with LLM evaluation because they:
- ❌ Use complex 1-5 scales (hard to agree on)
- ❌ Skip manual error analysis (miss critical failures)
- ❌ Start with expensive LLM-as-judge (waste money)
- ❌ Build evals before understanding failures (measure wrong things)
EmbedEval fixes this with Hamel Husain's proven approach:
- ✅ Binary only - PASS or FAIL, no debating
- ✅ Error analysis first - Look at traces before automating
- ✅ Cheap evals first - Assertions before LLM-as-judge
- ✅ Trace-centric - Complete session records
- ✅ Single annotator - "Benevolent dictator" model
Quick Start
Install
Option 1: Quick Install (Recommended)
curl -fsSL https://raw.githubusercontent.com/Algiras/embedeval/main/install.sh | bashOption 2: npm (Global)
npm install -g embedevalOption 3: npx (No Install)
npx embedeval <command>3-Command Workflow
# 1. COLLECT - Import your LLM traces
embedeval collect ./production-logs.jsonl --output traces.jsonl
# 2. ANNOTATE - Manual error analysis (30 min for 50-100 traces)
embedeval annotate traces.jsonl --user "[email protected]"
# 3. TAXONOMY - Build failure taxonomy
embedeval taxonomy build --annotations annotations.jsonlThat's it. You'll now see:
Pass Rate: 73%
Top Failure Categories:
1. Hallucination: 12 traces (44%)
2. Incomplete: 8 traces (30%)
3. Wrong Format: 5 traces (19%)Development (Contributors)
If you're developing or contributing to EmbedEval, use the embedeval-dev script:
# Clone the repository
git clone https://github.com/Algiras/embedeval.git
cd embedeval
# Check your dev environment
./embedeval-dev --doctor
# Install dependencies
./embedeval-dev --install-deps
# Build TypeScript
./embedeval-dev --build
# Run CLI commands (no global install needed)
./embedeval-dev collect examples/v2/sample-traces.jsonl
./embedeval-dev view test-traces.jsonl
./embedeval-dev annotate test-traces.jsonl --user "dev@local"
# Development utilities
./embedeval-dev --watch # Watch mode for auto-rebuild
./embedeval-dev --test # Run test suite
./embedeval-dev --lint # Run ESLint
./embedeval-dev --types # TypeScript type check
./embedeval-dev --clean # Clean build artifactsCore Commands
Error Analysis (Primary Workflow)
# Import traces from JSONL
embedeval collect ./logs.jsonl --output traces.jsonl
# Interactive annotation (p=pass, f=fail, s=save)
embedeval annotate traces.jsonl --user "[email protected]"
# Read-only viewer
embedeval view traces.jsonl
# Build failure taxonomy
embedeval taxonomy build --user "[email protected]"
# Display taxonomy
embedeval taxonomy showBinary Evaluation
# Add evaluator (interactive wizard)
embedeval eval add
# List evaluators
embedeval eval list
# Run evaluations
embedeval eval run traces.jsonl --config evals.yaml
# Generate report
embedeval eval report --results results.jsonlSynthetic Data & Export
# Create dimensions template
embedeval generate init
# Generate synthetic traces
embedeval generate create --dimensions dims.yaml --count 50
# Export to Jupyter notebook
embedeval export traces.jsonl --format notebook
# Generate HTML dashboard
embedeval report --traces traces.jsonl --annotations annotations.jsonlKey Principles
1. Binary Only ✓/✗
# GOOD: Clear, fast decisions
evals:
- name: is_accurate
type: llm-judge
binary: true # Only PASS or FAIL
# BAD: Never do this
evals:
- name: quality_score
type: 1_to_5 # Creates disagreement2. Error Analysis First 👀
# Spend 60-80% of time here:
embedeval annotate traces.jsonl --user "[email protected]"
# NOT here (automate only after understanding):
# embedeval eval run traces.jsonl # (do this AFTER annotation)3. Cheap Evals First 💰
evals:
# Run cheap evals first
- name: has_content
type: assertion
check: "response.length > 100"
priority: cheap
# Expensive evals only for complex cases
- name: factual_accuracy
type: llm-judge
priority: expensive4. Single Annotator 👤
# One "benevolent dictator" owns quality:
embedeval annotate traces.jsonl --user "[email protected]"
# Not multiple people voting (causes conflict)Installation Methods
NPM (Recommended)
npm install -g embedevalNPX (No Install)
npx embedeval collect ./logs.jsonlFrom Source
git clone https://github.com/Algiras/embedeval.git
cd embedeval
npm install
npm run build
npm link # Makes 'embedeval' command available globallyData Formats
Trace (JSONL)
One JSON object per line:
{"id": "trace-001", "timestamp": "2026-01-30T10:00:00Z", "query": "What's your refund policy?", "response": "We offer full refunds within 30 days...", "metadata": {"provider": "google", "model": "gemini-1.5-flash", "latency": 180}}Annotation (JSONL)
{"id": "ann-001", "traceId": "trace-001", "annotator": "[email protected]", "timestamp": "2026-01-30T10:05:00Z", "label": "fail", "failureCategory": "hallucination", "notes": "Made up refund time limit"}Eval Config (YAML)
evals:
- id: has_content
type: assertion
priority: cheap
config:
check: "response.length > 50"
- id: accurate
type: llm-judge
priority: expensive
config:
model: gemini-1.5-flash
prompt: "PASS or FAIL: Is this accurate?"
binary: trueExample Workflows
Weekly Evaluation
# Collect week's traces
embedeval collect ./logs/week-$(date +%Y-%m-%d).jsonl
# Sample 100 for annotation
head -n 100 traces.jsonl > sample.jsonl
# Annotate
embedeval annotate sample.jsonl --user "[email protected]"
# Build/update taxonomy
embedeval taxonomy update
# Run all evals
embedeval eval run traces.jsonl --config evals.yaml
# Generate report
embedeval report --traces traces.jsonl --annotations annotations.jsonlAdd Eval for Common Failure
# 1. Build taxonomy to see top failures
embedeval taxonomy build
# 2. Add eval for top category (e.g., hallucination)
embedeval eval add
# Interactive wizard asks for type, model, prompt
# 3. Run the new eval
embedeval eval run traces.jsonl --config evals.yamlGenerate Test Data
# 1. Create dimensions file
embedeval generate init
# Edit dimensions.yaml to define test scenarios
# 2. Generate synthetic traces
embedeval generate create -d dimensions.yaml -n 50
# 3. Run your system on synthetic queries
# (Implementation depends on your system)
# 4. Evaluate
embedeval annotate synthetic-traces.jsonl --user "[email protected]"MCP Server (AI Agents)
For Claude, Cursor, or other MCP clients:
{
"mcpServers": {
"embedeval": {
"command": "npx",
"args": ["embedeval", "mcp-server"],
"env": {
"GEMINI_API_KEY": "your-api-key"
}
}
}
}See LLM.md for detailed agent usage guide.
Deployment
Vercel (One-Click)
GitHub Pages
Already configured. Site updates automatically on push to main.
CI/CD
GitHub Actions workflow included. Runs on every PR:
- Type checking
- Build verification
- CLI command tests
Documentation
- LLM.md - Complete guide for AI agents and LLMs
- GitHub Pages - Visual documentation
- Hamel's Eval FAQ - Methodology reference
Features
- ✅ Interactive Annotation - Terminal UI for fast binary annotation
- ✅ Failure Taxonomy - Auto-categorize failures with axial coding
- ✅ Binary Evaluation - Assertions, regex, code, LLM-as-judge
- ✅ Synthetic Data - Dimension-based test generation
- ✅ Jupyter Export - Statistical analysis notebooks
- ✅ HTML Reports - Shareable dashboards
- ✅ JSONL Storage - Simple, grep-friendly, no databases
- ✅ Zero Infrastructure - No Redis, no queues, no setup
Why v2?
v1 had 88 files with complex A/B testing, genetic algorithms, and BullMQ queues.
v2 has ~20 files with a simple philosophy: look at your traces first.
Before: Infrastructure-heavy, hard to understand
After: Simple CLI, clear workflow, Hamel Husain principles
License
MIT
Built with ❤️ following Hamel Husain's principles. The goal is understanding failures, not perfect metrics. Spend time looking at traces! 👀
