npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

embedeval

v2.0.5

Published

Binary evals, trace-centric, error-analysis-first CLI for LLM evaluation. Built on Hamel Husain's principles.

Downloads

126

Readme

EmbedEval v2

Binary evals. Trace-centric. Error-analysis-first.

CI/CD npm version License: MIT Buy Me A Coffee

A command-line tool for evaluating LLM outputs using binary pass/fail judgments, built on Hamel Husain's evaluation principles.

📖 Full LLM Guide | 🚀 Quick Start | 📊 GitHub Pages


Why EmbedEval?

Most teams struggle with LLM evaluation because they:

  • ❌ Use complex 1-5 scales (hard to agree on)
  • ❌ Skip manual error analysis (miss critical failures)
  • ❌ Start with expensive LLM-as-judge (waste money)
  • ❌ Build evals before understanding failures (measure wrong things)

EmbedEval fixes this with Hamel Husain's proven approach:

  • Binary only - PASS or FAIL, no debating
  • Error analysis first - Look at traces before automating
  • Cheap evals first - Assertions before LLM-as-judge
  • Trace-centric - Complete session records
  • Single annotator - "Benevolent dictator" model

Quick Start

Install

Option 1: Quick Install (Recommended)

curl -fsSL https://raw.githubusercontent.com/Algiras/embedeval/main/install.sh | bash

Option 2: npm (Global)

npm install -g embedeval

Option 3: npx (No Install)

npx embedeval <command>

3-Command Workflow

# 1. COLLECT - Import your LLM traces
embedeval collect ./production-logs.jsonl --output traces.jsonl

# 2. ANNOTATE - Manual error analysis (30 min for 50-100 traces)
embedeval annotate traces.jsonl --user "[email protected]"

# 3. TAXONOMY - Build failure taxonomy
embedeval taxonomy build --annotations annotations.jsonl

That's it. You'll now see:

Pass Rate: 73%

Top Failure Categories:
1. Hallucination: 12 traces (44%)
2. Incomplete: 8 traces (30%)
3. Wrong Format: 5 traces (19%)

Development (Contributors)

If you're developing or contributing to EmbedEval, use the embedeval-dev script:

# Clone the repository
git clone https://github.com/Algiras/embedeval.git
cd embedeval

# Check your dev environment
./embedeval-dev --doctor

# Install dependencies
./embedeval-dev --install-deps

# Build TypeScript
./embedeval-dev --build

# Run CLI commands (no global install needed)
./embedeval-dev collect examples/v2/sample-traces.jsonl
./embedeval-dev view test-traces.jsonl
./embedeval-dev annotate test-traces.jsonl --user "dev@local"

# Development utilities
./embedeval-dev --watch     # Watch mode for auto-rebuild
./embedeval-dev --test      # Run test suite
./embedeval-dev --lint      # Run ESLint
./embedeval-dev --types     # TypeScript type check
./embedeval-dev --clean     # Clean build artifacts

Core Commands

Error Analysis (Primary Workflow)

# Import traces from JSONL
embedeval collect ./logs.jsonl --output traces.jsonl

# Interactive annotation (p=pass, f=fail, s=save)
embedeval annotate traces.jsonl --user "[email protected]"

# Read-only viewer
embedeval view traces.jsonl

# Build failure taxonomy
embedeval taxonomy build --user "[email protected]"

# Display taxonomy
embedeval taxonomy show

Binary Evaluation

# Add evaluator (interactive wizard)
embedeval eval add

# List evaluators
embedeval eval list

# Run evaluations
embedeval eval run traces.jsonl --config evals.yaml

# Generate report
embedeval eval report --results results.jsonl

Synthetic Data & Export

# Create dimensions template
embedeval generate init

# Generate synthetic traces
embedeval generate create --dimensions dims.yaml --count 50

# Export to Jupyter notebook
embedeval export traces.jsonl --format notebook

# Generate HTML dashboard
embedeval report --traces traces.jsonl --annotations annotations.jsonl

Key Principles

1. Binary Only ✓/✗

# GOOD: Clear, fast decisions
evals:
  - name: is_accurate
    type: llm-judge
    binary: true  # Only PASS or FAIL

# BAD: Never do this
evals:
  - name: quality_score
    type: 1_to_5  # Creates disagreement

2. Error Analysis First 👀

# Spend 60-80% of time here:
embedeval annotate traces.jsonl --user "[email protected]"

# NOT here (automate only after understanding):
# embedeval eval run traces.jsonl  # (do this AFTER annotation)

3. Cheap Evals First 💰

evals:
  # Run cheap evals first
  - name: has_content
    type: assertion
    check: "response.length > 100"
    priority: cheap
  
  # Expensive evals only for complex cases
  - name: factual_accuracy
    type: llm-judge
    priority: expensive

4. Single Annotator 👤

# One "benevolent dictator" owns quality:
embedeval annotate traces.jsonl --user "[email protected]"

# Not multiple people voting (causes conflict)

Installation Methods

NPM (Recommended)

npm install -g embedeval

NPX (No Install)

npx embedeval collect ./logs.jsonl

From Source

git clone https://github.com/Algiras/embedeval.git
cd embedeval
npm install
npm run build
npm link  # Makes 'embedeval' command available globally

Data Formats

Trace (JSONL)

One JSON object per line:

{"id": "trace-001", "timestamp": "2026-01-30T10:00:00Z", "query": "What's your refund policy?", "response": "We offer full refunds within 30 days...", "metadata": {"provider": "google", "model": "gemini-1.5-flash", "latency": 180}}

Annotation (JSONL)

{"id": "ann-001", "traceId": "trace-001", "annotator": "[email protected]", "timestamp": "2026-01-30T10:05:00Z", "label": "fail", "failureCategory": "hallucination", "notes": "Made up refund time limit"}

Eval Config (YAML)

evals:
  - id: has_content
    type: assertion
    priority: cheap
    config:
      check: "response.length > 50"
  
  - id: accurate
    type: llm-judge
    priority: expensive
    config:
      model: gemini-1.5-flash
      prompt: "PASS or FAIL: Is this accurate?"
      binary: true

Example Workflows

Weekly Evaluation

# Collect week's traces
embedeval collect ./logs/week-$(date +%Y-%m-%d).jsonl

# Sample 100 for annotation
head -n 100 traces.jsonl > sample.jsonl

# Annotate
embedeval annotate sample.jsonl --user "[email protected]"

# Build/update taxonomy
embedeval taxonomy update

# Run all evals
embedeval eval run traces.jsonl --config evals.yaml

# Generate report
embedeval report --traces traces.jsonl --annotations annotations.jsonl

Add Eval for Common Failure

# 1. Build taxonomy to see top failures
embedeval taxonomy build

# 2. Add eval for top category (e.g., hallucination)
embedeval eval add
# Interactive wizard asks for type, model, prompt

# 3. Run the new eval
embedeval eval run traces.jsonl --config evals.yaml

Generate Test Data

# 1. Create dimensions file
embedeval generate init
# Edit dimensions.yaml to define test scenarios

# 2. Generate synthetic traces
embedeval generate create -d dimensions.yaml -n 50

# 3. Run your system on synthetic queries
# (Implementation depends on your system)

# 4. Evaluate
embedeval annotate synthetic-traces.jsonl --user "[email protected]"

MCP Server (AI Agents)

For Claude, Cursor, or other MCP clients:

{
  "mcpServers": {
    "embedeval": {
      "command": "npx",
      "args": ["embedeval", "mcp-server"],
      "env": {
        "GEMINI_API_KEY": "your-api-key"
      }
    }
  }
}

See LLM.md for detailed agent usage guide.


Deployment

Vercel (One-Click)

Deploy with Vercel

GitHub Pages

Already configured. Site updates automatically on push to main.

CI/CD

GitHub Actions workflow included. Runs on every PR:

  • Type checking
  • Build verification
  • CLI command tests

Documentation


Features

  • Interactive Annotation - Terminal UI for fast binary annotation
  • Failure Taxonomy - Auto-categorize failures with axial coding
  • Binary Evaluation - Assertions, regex, code, LLM-as-judge
  • Synthetic Data - Dimension-based test generation
  • Jupyter Export - Statistical analysis notebooks
  • HTML Reports - Shareable dashboards
  • JSONL Storage - Simple, grep-friendly, no databases
  • Zero Infrastructure - No Redis, no queues, no setup

Why v2?

v1 had 88 files with complex A/B testing, genetic algorithms, and BullMQ queues.

v2 has ~20 files with a simple philosophy: look at your traces first.

Before: Infrastructure-heavy, hard to understand
After: Simple CLI, clear workflow, Hamel Husain principles


License

MIT


Built with ❤️ following Hamel Husain's principles. The goal is understanding failures, not perfect metrics. Spend time looking at traces! 👀