npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

youbencha

v0.1.2-beta

Published

A friendly, developer-first CLI framework for evaluating agentic coding tools

Downloads

315

Readme

youBencha

A friendly, developer-first CLI framework for evaluating agentic coding tools.

What is youBencha?

youBencha is a testing and benchmarking framework designed to help developers evaluate and compare AI-powered coding agents. It provides:

  • Agent-agnostic architecture - Test any agent through pluggable adapters
  • Flexible evaluation - Use built-in evaluators or create custom ones
  • Reproducible results - Standardized logging and comprehensive result bundles
  • Developer-friendly CLI - Simple commands for running evaluations and generating reports

Requirements

  • Node.js 20+ - youBencha requires Node.js version 20 or higher
  • Git - For cloning repositories during evaluation
  • Agent CLI - At least one of:
    • GitHub Copilot CLI - For copilot-cli agent type
    • Claude Code CLI - For claude-code agent type (install via npm install -g @anthropic-ai/claude-code)

Installation

# Install globally
npm install -g youbencha

# Or install locally in your project
npm install --save-dev youbencha

Install from Local Build

If you're developing youBencha locally:

# Build the project
npm run build

# Link globally to use the yb command
npm link

This creates a global symlink to your local package, making the yb command available system-wide. Any changes you make require rebuilding (npm run build) to take effect.

To unlink later:

npm unlink -g youbencha

Quick Start

New to youBencha? Check out the Getting Started Guide for a detailed walkthrough.

1. Install

npm install -g youbencha

2. Create a test case configuration

youBencha supports both YAML and JSON formats for configuration files.

Option A: YAML format (testcase.yaml)

name: "README Comment Addition"
description: "Tests the agent's ability to add a helpful comment explaining the repository purpose"

repo: https://github.com/youbencha/hello-world.git
branch: main

agent:
  type: copilot-cli
  config:
    prompt: "Add a comment to README explaining what this repository is about"

evaluators:
  - name: git-diff
  - name: agentic-judge
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        readme_modified: "README.md was modified. Score 1 if true, 0 if false."
        helpful_comment_added: "A helpful comment was added to README.md. Score 1 if true, 0 if false."

Option B: JSON format (testcase.json)

{
  "name": "README Comment Addition",
  "description": "Tests the agent's ability to add a helpful comment explaining the repository purpose",
  "repo": "https://github.com/youbencha/hello-world.git
",
  "branch": "main",
  "agent": {
    "type": "copilot-cli",
    "config": {
      "prompt": "Add a comment to README explaining what this repository is about"
    }
  },
  "evaluators": [
    { "name": "git-diff" },
    {
      "name": "agentic-judge",
      "config": {
        "type": "copilot-cli",
        "agent_name": "agentic-judge",
        "assertions": {
          "readme_modified": "README.md was modified. Score 1 if true, 0 if false.",
          "helpful_comment_added": "A helpful comment was added to README.md. Score 1 if true, 0 if false."
        }
      }
    }
  ]
}

💡 Tip: Both formats support the same features and are validated using the same schema. Choose the format that best fits your workflow or existing tooling.

📁 Prompt Files: Instead of inline prompts, you can load them from external files using prompt_file: ./path/to/prompt.md. See the Prompt Files Guide for details.

3. Run the evaluation

# Using YAML format
yb run -c testcase.yaml

# Or using JSON format
yb run -c testcase.json

# See examples directory for more configurations
yb run -c examples/testcase-simple.yaml
yb run -c examples/testcase-simple.json

The workspace is kept by default for inspection. Add --delete-workspace to clean up after completion.

4. View results

yb report --from .youbencha-workspace/run-*/artifacts/results.json

That's it! youBencha will clone the repo, run the agent, evaluate the output, and generate a comprehensive report.


What Makes youBencha Different?

  • Agent-Agnostic: Works with any AI coding agent through pluggable adapters
  • Reproducible: Standardized logging captures complete execution context
  • Flexible Evaluation: Use built-in evaluators or create custom ones
  • Developer-Friendly: Clear error messages, helpful CLI, extensive examples
  • Comprehensive Reports: From metrics to human-readable insights

Commands

yb run

Run a test case with agent execution.

yb run -c <config-file>

yb eval

Run evaluators on existing directories without executing an agent. Useful for re-evaluating outputs, testing evaluators, or evaluating manual changes.

yb eval -c <eval-config-file>

Use cases:

  • Re-evaluate agent outputs with different evaluator configurations
  • Evaluate manual code changes using youBencha's evaluators
  • Test custom evaluators during development
  • CI/CD integration with other tools
  • Comparative analysis of multiple outputs

See the Eval Command Guide for detailed documentation.

yb report

Generate a report from evaluation results.

yb report --from <results-file> [--format <format>] [--output <path>]

Options:
  --from <path>      Path to results JSON file (required)
  --format <format>  Report format: json, markdown (default: markdown)
  --output <path>    Output path (optional)

yb suggest-testcase

Generate test case suggestions using AI agent interaction.

yb suggest-testcase --agent <type> --output-dir <path> [--agent-file <path>]

Options:
  --agent <type>           Agent tool to use (e.g., copilot-cli) (required)
  --output-dir <path>      Path to successful agent output folder (required)
  --agent-file <path>      Custom agent file (default: agents/suggest-testcase.agent.md)
  --save <path>            Path to save generated test case (optional)

Interactive Workflow:

The suggest-testcase command launches an interactive AI agent session that:

  1. Analyzes your agent's output folder
  2. Asks about your baseline/source for comparison
  3. Requests your original instructions/intent
  4. Detects patterns in the changes (auth, tests, API, docs, etc.)
  5. Recommends appropriate evaluators with reasoning
  6. Generates a complete test case configuration

Example Session:

$ yb suggest-testcase --agent copilot-cli --output-dir ./my-feature

🤖 Launching interactive agent session...

Agent: What branch should I use as the baseline for comparison?
You: main

Agent: What were the original instructions you gave to the agent?
You: Add JWT authentication with rate limiting and comprehensive error handling

Agent: I've analyzed the changes and detected:
- Authentication/security code patterns
- New test files added
- Error handling patterns

Here's your suggested testcase.yaml:

[Generated test case configuration with reasoning]

To use this test case:
1. Save as 'testcase.yaml' in your project
2. Run: yb run -c testcase.yaml
3. Review evaluation results

Use Cases:

  • After successful agent work - Generate test case for validation
  • Quality assurance - Ensure agent followed best practices
  • Documentation - Understand what evaluations are appropriate
  • Learning - See how different changes map to evaluators

Expected Reference Comparison

youBencha supports comparing agent outputs against an expected reference branch. This is useful when you have a "correct" or "ideal" implementation to compare against.

Configuration

Add an expected reference to your test case configuration:

name: "Feature Implementation"
description: "Tests the agent's ability to implement a feature matching the reference implementation"

repo: https://github.com/youbencha/hello-world.git

branch: main
expected_source: branch
expected: feature/completed  # The reference branch

agent:
  type: copilot-cli
  config:
    prompt: "Implement the feature"

evaluators:
  - name: expected-diff
    config:
      threshold: 0.80  # Require 80% similarity to pass

Threshold Guidelines

The threshold determines how similar the agent output must be to the expected reference:

  • 1.0 (100%) - Exact match (very strict)
  • 0.9-0.99 - Very similar with minor differences (strict)
  • 0.7-0.89 - Mostly similar with moderate differences (balanced)
  • <0.7 - Significantly different (lenient)

Recommended thresholds:

  • 0.95+ for generated files (e.g., migrations, configs)
  • 0.80-0.90 for implementation code
  • 0.70-0.80 for creative tasks with multiple valid solutions

Use Cases

1. Test-Driven Development

expected: tests-implemented
# Compare agent implementation against expected test-driven approach

2. Refactoring Verification

expected: refactored-solution
# Ensure agent refactoring matches expected improvements

3. Bug Fix Validation

expected: bug-fixed
# Compare agent's bug fix with known correct fix

Interpretation

The expected-diff evaluator provides:

  • Aggregate Similarity: Overall similarity score (0.0 to 1.0)
  • File-level Details: Individual similarity for each file
  • Status Counts: matched, changed, added, removed files

Example report section:

### expected-diff

| Metric | Value |
|--------|-------|
| Aggregate Similarity | 85.0% |
| Threshold | 80.0% |
| Files Matched | 5 |
| Files Changed | 2 |
| Files Added | 0 |
| Files Removed | 0 |

#### File-level Details

| File | Similarity | Status |
|------|-----------|--------|
| src/main.ts | 75.0% | 🔄 changed |
| src/utils.ts | 100.0% | ✓ matched |

Built-in Evaluators

git-diff

Analyzes Git changes made by the agent with assertion-based pass/fail thresholds.

Metrics: files_changed, lines_added, lines_removed, total_changes, change_entropy

Supported Assertions:

  • max_files_changed - Maximum number of files that can be changed
  • max_lines_added - Maximum number of lines that can be added
  • max_lines_removed - Maximum number of lines that can be removed
  • max_total_changes - Maximum total changes (additions + deletions)
  • min_change_entropy - Minimum entropy (enforces distributed changes)
  • max_change_entropy - Maximum entropy (enforces focused changes)

Example:

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 5
        max_lines_added: 100
        max_change_entropy: 2.0  # Keep changes focused

expected-diff

Compares agent output against expected reference branch.

Metrics: aggregate_similarity, threshold, files_matched, files_changed, files_added, files_removed, file_similarities

Requires: expected_source and expected configured in test case

agentic-judge

Uses an AI agent to evaluate code quality based on custom assertions. The agent reads files, searches for patterns, and makes judgments like a human reviewer.

Features:

  • Evaluable assertions as pass/fail
  • Supports multiple independent judges for different areas
  • Each judge maintains focused context (1-3 assertions recommended)

Metrics: Custom metrics based on your assertions

Multiple Judges: You can define multiple agentic-judge evaluators to break down evaluation into focused areas:

evaluators:
  # Judge 1: Error Handling
  - name: agentic-judge-error-handling
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        has_try_catch: "Code includes try-catch blocks. Score 1 if present, 0 if absent."
        errors_logged: "Errors are properly logged. Score 1 if logged, 0 if not."
  
  # Judge 2: Documentation
  - name: agentic-judge-documentation
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        functions_documented: "Functions have JSDoc. Score 1 if documented, 0 if not."

Naming Convention: Use agentic-judge-<focus-area> or agentic-judge:<focus-area> to create specialized judges.

See: docs/multiple-agentic-judges.md for detailed guide

Development

Setup

# Clone repository
git clone https://github.com/yourusername/youbencha.git
cd youbencha

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

# Run with coverage
npm test -- --coverage

Post-Evaluators: Exporting and Analyzing Results

Post-evaluations run after evaluation completes, enabling you to export results to external systems, run custom analysis, or trigger downstream workflows.

Available Post-Evaluators

1. Database Export - Append results to JSONL file for time-series analysis

post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./results-history.jsonl
      include_full_bundle: true
      append: true

2. Webhook - POST results to HTTP endpoint

post_evaluation:
  - name: webhook
    config:
      url: ${SLACK_WEBHOOK_URL}
      method: POST
      headers:
        Content-Type: "application/json"
      retry_on_failure: true
      timeout_ms: 5000

3. Custom Script - Execute custom analysis or integration

post_evaluation:
  - name: script
    config:
      command: ./scripts/notify-slack.sh
      args:
        - "${RESULTS_PATH}"
      env:
        SLACK_WEBHOOK_URL: "${SLACK_WEBHOOK_URL}"
      timeout_ms: 30000

Value Propositions

Single Result: Immediate feedback on one evaluation

  • Quick validation during prompt engineering
  • Debugging agent failures
  • Understanding scope of changes

Suite of Results: Cross-test comparison

  • Identify difficult tasks
  • Compare agent configurations
  • Aggregate metrics and pass rates

Results Over Time: Regression detection and trends

  • Track performance changes across model/prompt updates
  • Cost optimization and ROI tracking
  • Long-term quality trends

Example Scripts

See examples/scripts/ for ready-to-use scripts:

  • notify-slack.sh - Post results to Slack
  • analyze-trends.sh - Analyze time-series data
  • detect-regression.sh - Compare last two runs

Documentation

Project Structure

src/
  adapters/      - Agent adapters
  cli/           - CLI commands
  core/          - Core orchestration logic
  evaluators/    - Built-in evaluators
  post-evaluations/ - Post-evaluation exporters
  lib/           - Utility libraries
  reporters/     - Report generators
  schemas/       - Zod schemas for validation
tests/
  contract/      - Contract tests
  integration/   - Integration tests
  unit/          - Unit tests

Architecture

youBencha follows a pluggable architecture:

  • Agent-Agnostic: Agent-specific logic isolated in adapters
  • Pluggable Evaluators: Add new evaluators without core changes
  • Reproducible: Complete execution context captured
  • youBencha Log Compliance: Normalized logging format across agents

Security Considerations

⚠️ Important Security Notes

Before running evaluations:

  1. Test case configurations execute code: Only run test case configurations from trusted sources
  2. Agent file system access: Agents have full access to the workspace directory
  3. Isolation strongly recommended: Run evaluations in containers or VMs for untrusted code
  4. Repository cloning: Validates repository URLs but exercise caution with private repos

Trusted Execution Environments

We recommend running youBencha in isolated environments:

# Docker example
docker run -it --rm \
  -v $(pwd):/workspace \
  -w /workspace \
  node:20 \
  npx youbencha run -c testcase.yaml

# Or use dedicated CI/CD runners

Reporting Security Issues

Please report security vulnerabilities via GitHub Security Advisories or email [email protected]. Do not open public issues for security vulnerabilities.

For more details, see SECURITY.md.

License

MIT