cceval

v0.2.0

Published

a month ago

Evaluate and benchmark your CLAUDE.md effectiveness with automated testing

0High
0Medium
0Low

johnlindquist

claude claude-code evaluation benchmark llm ai testing

cceval

Evaluate and benchmark your CLAUDE.md effectiveness with automated testing.

Why?

Your CLAUDE.md file guides how Claude Code behaves in your project. But how do you know if your instructions are actually working?

cceval lets you:

Auto-generate variations of your CLAUDE.md
Test them against realistic prompts
Find what actually improves Claude's behavior
Iterate quickly on your instructions

Installation

# Global install (recommended)
bun add -g cceval

# Or per-project
bun add -D cceval

Requirements:

Bun >= 1.0
Claude Code CLI installed and authenticated

Optional (for faster generation):

ANTHROPIC_API_KEY environment variable - if set, uses direct API calls for variation generation instead of Claude CLI (faster and more reliable)

Quick Start (Turnkey)

Just run cceval in any project with a CLAUDE.md:

cd your-project
cceval

That's it! cceval will:

Find your CLAUDE.md
Use Claude to generate 5 variations (condensed, explicit, prioritized, minimal, structured)
Test each variation against 5 realistic prompts
Show you which variation performs best

Example Output

🚀 cceval - Turnkey CLAUDE.md Evaluation
============================================================

📄 Source: ./CLAUDE.md
🤖 Model: haiku

🔄 Generating 7 variations...

  ⏳ Generating "condensed"... ✓
  ⏳ Generating "explicit"... ✓
  ⏳ Generating "prioritized"... ✓
  ⏳ Generating "minimal"... ✓
  ⏳ Generating "structured"... ✓

📦 Cached variations to .cceval-variations.json

============================================================
📊 Starting Evaluation
============================================================
Testing 7 variations:
  • baseline
  • original
  • condensed
  • explicit
  • prioritized
  • minimal
  • structured

With 5 test prompts each
Total tests: 35
============================================================

  ✓ baseline/exploreBeforeBuild: $0.0012
  ✓ baseline/bunPreference: $0.0015
  ...

============================================================
📊 CLAUDE.md EVALUATION REPORT
============================================================

🏷️  explicit: 72.0% (18/25)
   ✅ noPermissionSeeking: 5/5
   ✅ readFilesFirst: 5/5
   ⚠️ usedBun: 3/5
   ...

🏆 WINNER: explicit
💰 Total cost: $0.42
============================================================

How It Works

Variation Strategies

cceval generates these variations from your CLAUDE.md:

| Strategy | Description | |----------|-------------| | baseline | Minimal "You are a helpful coding assistant" | | original | Your actual CLAUDE.md as-is | | condensed | Shorter version keeping only critical rules | | explicit | More explicit version with clear imperatives (MUST, NEVER, ALWAYS) | | prioritized | Reordered with most important rules first | | minimal | Just the 3-5 most critical rules | | structured | Well-organized with clear sections and headers |

Test Prompts

Default prompts test key behaviors:

| Prompt | What it tests | |--------|---------------| | exploreBeforeBuild | Does it read files before coding? | | bunPreference | Does it use Bun instead of Node? | | rootCause | Does it fix root cause instead of adding spinners? | | simplicity | Does it avoid over-engineering? | | permissionSeeking | Does it execute without asking permission? |

CLI Reference

Basic Usage

# Turnkey: auto-detect, generate, test
cceval

# Same as above, explicit
cceval auto

# Specify a different CLAUDE.md
cceval auto -p ./docs/CLAUDE.md

# Use a smarter model for better variations
cceval auto -m sonnet

# Skip regeneration, use cached variations
cceval auto --skip-generate

# See what strategies are available
cceval auto --strategies-only

Output Options

# Custom output file
cceval -o my-results.json

# Generate markdown report too
cceval --markdown REPORT.md

Advanced: Custom Config

For full control, use a config file:

# Create starter config
cceval init

# Edit cceval.config.ts with your prompts/variations

# Run with config
cceval run

Reports

# Generate report from previous results
cceval report evaluation-results.json

Metrics

cceval measures:

| Metric | What it tests | |--------|---------------| | noPermissionSeeking | Does NOT ask "should I...?" or "would you like me to...?" | | readFilesFirst | Mentions reading/examining files before coding | | usedBun | Uses Bun APIs (Bun.serve, bun test, etc.) | | proposedRootCause | For "slow" prompts: fixes root cause instead of adding spinners | | ranVerification | Mentions running tests or showing output |

Cost Estimates

| Model | Cost per test | 35 tests (auto) | 25 tests (manual) | |-------|---------------|-----------------|-------------------| | haiku | ~$0.08 | ~$2.80 | ~$2.00 | | sonnet | ~$0.30 | ~$10.50 | ~$7.50 | | opus | ~$1.50 | ~$52.50 | ~$37.50 |

We recommend haiku for iteration (it's fast and cheap), then validate findings with sonnet.

Programmatic Usage

import {
  generateVariations,
  variationsToConfig,
  runEvaluation,
  printConsoleReport
} from "cceval"

// Generate variations from a CLAUDE.md
const generated = await generateVariations({
  claudeMdPath: "./CLAUDE.md",
  model: "haiku",
})

// Convert to config
const variations = variationsToConfig(generated)

// Run evaluation
const results = await runEvaluation({
  config: {
    prompts: { test: "Create a hello world server." },
    variations,
    model: "haiku",
  },
})

printConsoleReport(results)

Key Findings from Our Research

Based on evaluating 25+ prompt variations:

1. Gate-Based Instructions Win

Clear pass/fail criteria outperform vague guidance:

You are evaluated on gates. Fail any = FAIL.
1. Read files before coding
2. State plan then proceed immediately (don't ask)
3. Run tests and show actual output

2. "Don't Ask Permission" Backfires

Explicitly saying "never ask permission" increases permission-seeking due to priming. Instead, frame positively:

Execute standard operations immediately.
File edits and test runs are routine.

3. Verification Is the Biggest Win

Adding "Run tests and show actual output" improved verification from 20% to 100%.

4. Keep It Concise

The winning prompt was just 4 lines. Long instructions get ignored.

Workflow

Recommended workflow for optimizing your CLAUDE.md:

Baseline: Run cceval to see how your current CLAUDE.md performs
Analyze: Look at which variation scored best and why
Apply: Update your CLAUDE.md based on the winning strategy
Iterate: Run cceval again to verify improvement

Files Generated

| File | Description | |------|-------------| | .cceval-variations.json | Cached generated variations (re-run faster with --skip-generate) | | evaluation-results.json | Full test results (JSON) | | REPORT.md | Markdown report (if --markdown specified) |

Add .cceval-variations.json and evaluation-results.json to .gitignore.

Contributing

PRs welcome! Ideas:

More default metrics
CI/CD integration examples
Alternative model backends
Statistical significance testing

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

cceval

Why?

Installation

Quick Start (Turnkey)

Example Output

How It Works

Variation Strategies

Test Prompts

CLI Reference

Basic Usage

Output Options

Advanced: Custom Config

Reports

Metrics

Cost Estimates

Programmatic Usage

Key Findings from Our Research

1. Gate-Based Instructions Win

2. "Don't Ask Permission" Backfires

3. Verification Is the Biggest Win

4. Keep It Concise

Workflow

Files Generated

Contributing

License