cceval
v0.2.0
Published
Evaluate and benchmark your CLAUDE.md effectiveness with automated testing
Maintainers
Readme
cceval
Evaluate and benchmark your CLAUDE.md effectiveness with automated testing.
Why?
Your CLAUDE.md file guides how Claude Code behaves in your project. But how do you know if your instructions are actually working?
cceval lets you:
- Auto-generate variations of your CLAUDE.md
- Test them against realistic prompts
- Find what actually improves Claude's behavior
- Iterate quickly on your instructions
Installation
# Global install (recommended)
bun add -g cceval
# Or per-project
bun add -D ccevalRequirements:
- Bun >= 1.0
- Claude Code CLI installed and authenticated
Optional (for faster generation):
ANTHROPIC_API_KEYenvironment variable - if set, uses direct API calls for variation generation instead of Claude CLI (faster and more reliable)
Quick Start (Turnkey)
Just run cceval in any project with a CLAUDE.md:
cd your-project
ccevalThat's it! cceval will:
- Find your CLAUDE.md
- Use Claude to generate 5 variations (condensed, explicit, prioritized, minimal, structured)
- Test each variation against 5 realistic prompts
- Show you which variation performs best
Example Output
🚀 cceval - Turnkey CLAUDE.md Evaluation
============================================================
📄 Source: ./CLAUDE.md
🤖 Model: haiku
🔄 Generating 7 variations...
⏳ Generating "condensed"... ✓
⏳ Generating "explicit"... ✓
⏳ Generating "prioritized"... ✓
⏳ Generating "minimal"... ✓
⏳ Generating "structured"... ✓
📦 Cached variations to .cceval-variations.json
============================================================
📊 Starting Evaluation
============================================================
Testing 7 variations:
• baseline
• original
• condensed
• explicit
• prioritized
• minimal
• structured
With 5 test prompts each
Total tests: 35
============================================================
✓ baseline/exploreBeforeBuild: $0.0012
✓ baseline/bunPreference: $0.0015
...
============================================================
📊 CLAUDE.md EVALUATION REPORT
============================================================
🏷️ explicit: 72.0% (18/25)
✅ noPermissionSeeking: 5/5
✅ readFilesFirst: 5/5
⚠️ usedBun: 3/5
...
🏆 WINNER: explicit
💰 Total cost: $0.42
============================================================How It Works
Variation Strategies
cceval generates these variations from your CLAUDE.md:
| Strategy | Description |
|----------|-------------|
| baseline | Minimal "You are a helpful coding assistant" |
| original | Your actual CLAUDE.md as-is |
| condensed | Shorter version keeping only critical rules |
| explicit | More explicit version with clear imperatives (MUST, NEVER, ALWAYS) |
| prioritized | Reordered with most important rules first |
| minimal | Just the 3-5 most critical rules |
| structured | Well-organized with clear sections and headers |
Test Prompts
Default prompts test key behaviors:
| Prompt | What it tests |
|--------|---------------|
| exploreBeforeBuild | Does it read files before coding? |
| bunPreference | Does it use Bun instead of Node? |
| rootCause | Does it fix root cause instead of adding spinners? |
| simplicity | Does it avoid over-engineering? |
| permissionSeeking | Does it execute without asking permission? |
CLI Reference
Basic Usage
# Turnkey: auto-detect, generate, test
cceval
# Same as above, explicit
cceval auto
# Specify a different CLAUDE.md
cceval auto -p ./docs/CLAUDE.md
# Use a smarter model for better variations
cceval auto -m sonnet
# Skip regeneration, use cached variations
cceval auto --skip-generate
# See what strategies are available
cceval auto --strategies-onlyOutput Options
# Custom output file
cceval -o my-results.json
# Generate markdown report too
cceval --markdown REPORT.mdAdvanced: Custom Config
For full control, use a config file:
# Create starter config
cceval init
# Edit cceval.config.ts with your prompts/variations
# Run with config
cceval runReports
# Generate report from previous results
cceval report evaluation-results.jsonMetrics
cceval measures:
| Metric | What it tests |
|--------|---------------|
| noPermissionSeeking | Does NOT ask "should I...?" or "would you like me to...?" |
| readFilesFirst | Mentions reading/examining files before coding |
| usedBun | Uses Bun APIs (Bun.serve, bun test, etc.) |
| proposedRootCause | For "slow" prompts: fixes root cause instead of adding spinners |
| ranVerification | Mentions running tests or showing output |
Cost Estimates
| Model | Cost per test | 35 tests (auto) | 25 tests (manual) | |-------|---------------|-----------------|-------------------| | haiku | ~$0.08 | ~$2.80 | ~$2.00 | | sonnet | ~$0.30 | ~$10.50 | ~$7.50 | | opus | ~$1.50 | ~$52.50 | ~$37.50 |
We recommend haiku for iteration (it's fast and cheap), then validate findings with sonnet.
Programmatic Usage
import {
generateVariations,
variationsToConfig,
runEvaluation,
printConsoleReport
} from "cceval"
// Generate variations from a CLAUDE.md
const generated = await generateVariations({
claudeMdPath: "./CLAUDE.md",
model: "haiku",
})
// Convert to config
const variations = variationsToConfig(generated)
// Run evaluation
const results = await runEvaluation({
config: {
prompts: { test: "Create a hello world server." },
variations,
model: "haiku",
},
})
printConsoleReport(results)Key Findings from Our Research
Based on evaluating 25+ prompt variations:
1. Gate-Based Instructions Win
Clear pass/fail criteria outperform vague guidance:
You are evaluated on gates. Fail any = FAIL.
1. Read files before coding
2. State plan then proceed immediately (don't ask)
3. Run tests and show actual output2. "Don't Ask Permission" Backfires
Explicitly saying "never ask permission" increases permission-seeking due to priming. Instead, frame positively:
Execute standard operations immediately.
File edits and test runs are routine.3. Verification Is the Biggest Win
Adding "Run tests and show actual output" improved verification from 20% to 100%.
4. Keep It Concise
The winning prompt was just 4 lines. Long instructions get ignored.
Workflow
Recommended workflow for optimizing your CLAUDE.md:
- Baseline: Run
ccevalto see how your current CLAUDE.md performs - Analyze: Look at which variation scored best and why
- Apply: Update your CLAUDE.md based on the winning strategy
- Iterate: Run
ccevalagain to verify improvement
Files Generated
| File | Description |
|------|-------------|
| .cceval-variations.json | Cached generated variations (re-run faster with --skip-generate) |
| evaluation-results.json | Full test results (JSON) |
| REPORT.md | Markdown report (if --markdown specified) |
Add .cceval-variations.json and evaluation-results.json to .gitignore.
Contributing
PRs welcome! Ideas:
- More default metrics
- CI/CD integration examples
- Alternative model backends
- Statistical significance testing
License
MIT
