youbencha

v0.1.5-beta

Published

a month ago

A friendly, developer-first CLI framework for evaluating agentic coding tools

0High
0Medium
0Low

youbencha-cli

agent evaluation testing cli benchmark ai copilot automation code-generation llm

youBencha

A friendly, developer-first CLI framework for evaluating agentic coding tools.

What is youBencha?

youBencha is a testing and benchmarking framework designed to help developers evaluate and compare AI-powered coding agents. It provides:

Agent-agnostic architecture - Test any agent through pluggable adapters
Flexible evaluation - Use built-in evaluators or create custom ones
Reproducible results - Standardized logging and comprehensive result bundles
Developer-friendly CLI - Simple commands for running evaluations and generating reports

Requirements

Node.js 20+ - youBencha requires Node.js version 20 or higher
Git - For cloning repositories during evaluation
Agent CLI - At least one of:
- GitHub Copilot CLI - For copilot-cli agent type
- Claude Code CLI - For claude-code agent type (install via npm install -g @anthropic-ai/claude-code)

Installation

# Install globally
npm install -g youbencha

# Or install locally in your project
npm install --save-dev youbencha

Install from Local Build

If you're developing youBencha locally:

# Build the project
npm run build

# Link globally to use the yb command
npm link

This creates a global symlink to your local package, making the yb command available system-wide. Any changes you make require rebuilding (npm run build) to take effect.

To unlink later:

npm unlink -g youbencha

Quick Start

New to youBencha? Check out the Getting Started Guide for a detailed walkthrough.

1. Install

npm install -g youbencha

2. Create a test case configuration

youBencha supports both YAML and JSON formats for configuration files.

Option A: YAML format (testcase.yaml)

name: "README Comment Addition"
description: "Tests the agent's ability to add a helpful comment explaining the repository purpose"

repo: https://github.com/youbencha/hello-world.git
branch: main

agent:
  type: copilot-cli
  config:
    prompt: "Add a comment to README explaining what this repository is about"

evaluators:
  - name: git-diff
  - name: agentic-judge
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        readme_modified: "README.md was modified. Score 1 if true, 0 if false."
        helpful_comment_added: "A helpful comment was added to README.md. Score 1 if true, 0 if false."

Option B: JSON format (testcase.json)

{
  "name": "README Comment Addition",
  "description": "Tests the agent's ability to add a helpful comment explaining the repository purpose",
  "repo": "https://github.com/youbencha/hello-world.git
",
  "branch": "main",
  "agent": {
    "type": "copilot-cli",
    "config": {
      "prompt": "Add a comment to README explaining what this repository is about"
    }
  },
  "evaluators": [
    { "name": "git-diff" },
    {
      "name": "agentic-judge",
      "config": {
        "type": "copilot-cli",
        "agent_name": "agentic-judge",
        "assertions": {
          "readme_modified": "README.md was modified. Score 1 if true, 0 if false.",
          "helpful_comment_added": "A helpful comment was added to README.md. Score 1 if true, 0 if false."
        }
      }
    }
  ]
}

💡 Tip: Both formats support the same features and are validated using the same schema. Choose the format that best fits your workflow or existing tooling.

📁 Prompt Files: Instead of inline prompts, you can load them from external files using prompt_file: ./path/to/prompt.md. See the Prompt Files Guide for details.

3. Run the evaluation

# Using YAML format
yb run -c testcase.yaml

# Or using JSON format
yb run -c testcase.json

# See examples directory for more configurations
yb run -c examples/testcase-simple.yaml
yb run -c examples/testcase-simple.json

The workspace is kept by default for inspection. Add --delete-workspace to clean up after completion.

4. View results

yb report --from .youbencha-workspace/run-*/artifacts/results.json

That's it! youBencha will clone the repo, run the agent, evaluate the output, and generate a comprehensive report.

What Makes youBencha Different?

Agent-Agnostic: Works with any AI coding agent through pluggable adapters
Reproducible: Standardized logging captures complete execution context
Flexible Evaluation: Use built-in evaluators or create custom ones
Developer-Friendly: Clear error messages, helpful CLI, extensive examples
Comprehensive Reports: From metrics to human-readable insights

Configuration

youBencha supports configuration files to customize default behavior and define reusable variables.

Quick Start

# Create project-level config (.youbencharc)
yb config init

# Create user-level config (~/.youbencharc)
yb config init --global

# View current configuration
yb config list

# Set a configuration value
yb config set workspace_dir /tmp/my-workspace

Key Features

Default Workspace Location: Set default workspace and output directories
Variable Substitution: Define reusable variables for test case configs
Default Timeouts: Configure default timeouts for agents and operations
Environment-Specific Settings: Separate user and project configurations

See the Configuration Guide for complete documentation.

Commands

`yb run`

Run a test case with agent execution.

yb run -c <config-file>

`yb eval`

Run evaluators on existing directories without executing an agent. Useful for re-evaluating outputs, testing evaluators, or evaluating manual changes.

yb eval -c <eval-config-file>

Use cases:

Re-evaluate agent outputs with different evaluator configurations
Evaluate manual code changes using youBencha's evaluators
Test custom evaluators during development
CI/CD integration with other tools
Comparative analysis of multiple outputs

See the Eval Command Guide for detailed documentation.

`yb report`

Generate a report from evaluation results.

yb report --from <results-file> [--format <format>] [--output <path>]

Options:
  --from <path>      Path to results JSON file (required)
  --format <format>  Report format: json, markdown (default: markdown)
  --output <path>    Output path (optional)

`yb config`

Manage youBencha configuration files.

# Initialize configuration
yb config init                      # Create .youbencharc in current directory
yb config init --global             # Create ~/.youbencharc in home directory

# View configuration
yb config list                      # Show all settings
yb config get workspace_dir         # Get specific setting
yb config get agent.timeout_ms      # Get nested setting (dot notation)

# Modify configuration
yb config set workspace_dir /tmp/ws # Set a value
yb config set log_level debug       # Values auto-convert to correct type
yb config unset workspace_dir       # Remove a setting

Use Cases:

Set default workspace location for all projects
Define reusable variables for test case configurations
Configure default timeouts and concurrency settings
Separate user preferences from project settings

See the Configuration Guide for complete documentation.

`yb suggest-testcase`

Generate test case suggestions using AI agent interaction.

yb suggest-testcase --agent <type> --output-dir <path> [--agent-file <path>]

Options:
  --agent <type>           Agent tool to use (e.g., copilot-cli) (required)
  --output-dir <path>      Path to successful agent output folder (required)
  --agent-file <path>      Custom agent file (default: agents/suggest-testcase.agent.md)
  --save <path>            Path to save generated test case (optional)

Interactive Workflow:

The suggest-testcase command launches an interactive AI agent session that:

Analyzes your agent's output folder
Asks about your baseline/source for comparison
Requests your original instructions/intent
Detects patterns in the changes (auth, tests, API, docs, etc.)
Recommends appropriate evaluators with reasoning
Generates a complete test case configuration

Example Session:

$ yb suggest-testcase --agent copilot-cli --output-dir ./my-feature

🤖 Launching interactive agent session...

Agent: What branch should I use as the baseline for comparison?
You: main

Agent: What were the original instructions you gave to the agent?
You: Add JWT authentication with rate limiting and comprehensive error handling

Agent: I've analyzed the changes and detected:
- Authentication/security code patterns
- New test files added
- Error handling patterns

Here's your suggested testcase.yaml:

[Generated test case configuration with reasoning]

To use this test case:
1. Save as 'testcase.yaml' in your project
2. Run: yb run -c testcase.yaml
3. Review evaluation results

Use Cases:

After successful agent work - Generate test case for validation
Quality assurance - Ensure agent followed best practices
Documentation - Understand what evaluations are appropriate
Learning - See how different changes map to evaluators

Expected Reference Comparison

youBencha supports comparing agent outputs against an expected reference branch. This is useful when you have a "correct" or "ideal" implementation to compare against.

Configuration

Add an expected reference to your test case configuration:

name: "Feature Implementation"
description: "Tests the agent's ability to implement a feature matching the reference implementation"

repo: https://github.com/youbencha/hello-world.git

branch: main
expected_source: branch
expected: feature/completed  # The reference branch

agent:
  type: copilot-cli
  config:
    prompt: "Implement the feature"

evaluators:
  - name: expected-diff
    config:
      threshold: 0.80  # Require 80% similarity to pass

Threshold Guidelines

The threshold determines how similar the agent output must be to the expected reference:

1.0 (100%) - Exact match (very strict)
0.9-0.99 - Very similar with minor differences (strict)
0.7-0.89 - Mostly similar with moderate differences (balanced)
<0.7 - Significantly different (lenient)

Recommended thresholds:

0.95+ for generated files (e.g., migrations, configs)
0.80-0.90 for implementation code
0.70-0.80 for creative tasks with multiple valid solutions

Use Cases

1. Test-Driven Development

expected: tests-implemented
# Compare agent implementation against expected test-driven approach

2. Refactoring Verification

expected: refactored-solution
# Ensure agent refactoring matches expected improvements

3. Bug Fix Validation

expected: bug-fixed
# Compare agent's bug fix with known correct fix

Interpretation

The expected-diff evaluator provides:

Aggregate Similarity: Overall similarity score (0.0 to 1.0)
File-level Details: Individual similarity for each file
Status Counts: matched, changed, added, removed files

Example report section:

### expected-diff

| Metric | Value |
|--------|-------|
| Aggregate Similarity | 85.0% |
| Threshold | 80.0% |
| Files Matched | 5 |
| Files Changed | 2 |
| Files Added | 0 |
| Files Removed | 0 |

#### File-level Details

| File | Similarity | Status |
|------|-----------|--------|
| src/main.ts | 75.0% | 🔄 changed |
| src/utils.ts | 100.0% | ✓ matched |

Built-in Evaluators

git-diff

Analyzes Git changes made by the agent with assertion-based pass/fail thresholds.

Metrics: files_changed, lines_added, lines_removed, total_changes, change_entropy

Supported Assertions:

max_files_changed - Maximum number of files that can be changed
max_lines_added - Maximum number of lines that can be added
max_lines_removed - Maximum number of lines that can be removed
max_total_changes - Maximum total changes (additions + deletions)
min_change_entropy - Minimum entropy (enforces distributed changes)
max_change_entropy - Maximum entropy (enforces focused changes)

Example:

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 5
        max_lines_added: 100
        max_change_entropy: 2.0  # Keep changes focused

expected-diff

Compares agent output against expected reference branch.

Metrics: aggregate_similarity, threshold, files_matched, files_changed, files_added, files_removed, file_similarities

Requires: expected_source and expected configured in test case

agentic-judge

Uses an AI agent to evaluate code quality based on custom assertions. The agent reads files, searches for patterns, and makes judgments like a human reviewer.

Features:

Evaluable assertions as pass/fail
Supports multiple independent judges for different areas
Each judge maintains focused context (1-3 assertions recommended)

Metrics: Custom metrics based on your assertions

Multiple Judges: You can define multiple agentic-judge evaluators to break down evaluation into focused areas:

evaluators:
  # Judge 1: Error Handling
  - name: agentic-judge-error-handling
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        has_try_catch: "Code includes try-catch blocks. Score 1 if present, 0 if absent."
        errors_logged: "Errors are properly logged. Score 1 if logged, 0 if not."
  
  # Judge 2: Documentation
  - name: agentic-judge-documentation
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        functions_documented: "Functions have JSDoc. Score 1 if documented, 0 if not."

Naming Convention: Use agentic-judge-<focus-area> or agentic-judge:<focus-area> to create specialized judges.

See: docs/multiple-agentic-judges.md for detailed guide

Development

Setup

# Clone repository
git clone https://github.com/yourusername/youbencha.git
cd youbencha

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

# Run with coverage
npm test -- --coverage

Post-Evaluators: Exporting and Analyzing Results

Post-evaluations run after evaluation completes, enabling you to export results to external systems, run custom analysis, or trigger downstream workflows.

Available Post-Evaluators

1. Database Export - Append results to JSONL file for time-series analysis

post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./results-history.jsonl
      include_full_bundle: true
      append: true

2. Webhook - POST results to HTTP endpoint

post_evaluation:
  - name: webhook
    config:
      url: ${SLACK_WEBHOOK_URL}
      method: POST
      headers:
        Content-Type: "application/json"
      retry_on_failure: true
      timeout_ms: 5000

3. Custom Script - Execute custom analysis or integration

post_evaluation:
  - name: script
    config:
      command: ./scripts/notify-slack.sh
      args:
        - "${RESULTS_PATH}"
      env:
        SLACK_WEBHOOK_URL: "${SLACK_WEBHOOK_URL}"
      timeout_ms: 30000

Value Propositions

Single Result: Immediate feedback on one evaluation

Quick validation during prompt engineering
Debugging agent failures
Understanding scope of changes

Suite of Results: Cross-test comparison

Identify difficult tasks
Compare agent configurations
Aggregate metrics and pass rates

Results Over Time: Regression detection and trends

Track performance changes across model/prompt updates
Cost optimization and ROI tracking
Long-term quality trends

Example Scripts

See examples/scripts/ for ready-to-use scripts:

notify-slack.sh - Post results to Slack
analyze-trends.sh - Analyze time-series data
detect-regression.sh - Compare last two runs

Documentation

Getting Started Guide - Comprehensive walkthrough for new users
Post-Evaluation Guide - Complete reference for post-evaluation hooks
Analyzing Results Guide - Analysis patterns and best practices
Prompt Files Guide - Loading prompts from external files
Reusable Evaluators Guide - Sharing evaluator configurations
Multiple Agentic Judges Guide - Using multiple focused evaluators
Claude Code Adapter - Using Claude Code as an agent

Project Structure

src/
  adapters/      - Agent adapters
  cli/           - CLI commands
  core/          - Core orchestration logic
  evaluators/    - Built-in evaluators
  post-evaluations/ - Post-evaluation exporters
  lib/           - Utility libraries
  reporters/     - Report generators
  schemas/       - Zod schemas for validation
tests/
  contract/      - Contract tests
  integration/   - Integration tests
  unit/          - Unit tests

Architecture

youBencha follows a pluggable architecture:

Agent-Agnostic: Agent-specific logic isolated in adapters
Pluggable Evaluators: Add new evaluators without core changes
Reproducible: Complete execution context captured
youBencha Log Compliance: Normalized logging format across agents

Security Considerations

⚠️ Important Security Notes

Before running evaluations:

Test case configurations execute code: Only run test case configurations from trusted sources
Agent file system access: Agents have full access to the workspace directory
Isolation strongly recommended: Run evaluations in containers or VMs for untrusted code
Repository cloning: Validates repository URLs but exercise caution with private repos

Trusted Execution Environments

We recommend running youBencha in isolated environments:

# Docker example
docker run -it --rm \
  -v $(pwd):/workspace \
  -w /workspace \
  node:20 \
  npx youbencha run -c testcase.yaml

# Or use dedicated CI/CD runners

Reporting Security Issues

Please report security vulnerabilities via GitHub Security Advisories or email [email protected]. Do not open public issues for security vulnerabilities.

For more details, see SECURITY.md.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

youBencha

What is youBencha?

Requirements

Installation

Install from Local Build

Quick Start

1. Install

2. Create a test case configuration

3. Run the evaluation

4. View results

What Makes youBencha Different?

Configuration

Quick Start

Key Features

Commands

yb run

yb eval

yb report

yb config

yb suggest-testcase

Expected Reference Comparison

Configuration

Threshold Guidelines

Use Cases

Interpretation

Built-in Evaluators

git-diff

expected-diff

agentic-judge

Development

Setup

Post-Evaluators: Exporting and Analyzing Results

Available Post-Evaluators

Value Propositions

Example Scripts

Documentation

Project Structure

Architecture

Security Considerations

⚠️ Important Security Notes

Trusted Execution Environments

Reporting Security Issues

License

`yb run`

`yb eval`

`yb report`

`yb config`

`yb suggest-testcase`