sip-benchmark

v1.3.6

Published

3 months ago

Internal CLI tool for evaluating Sip AI persona agent performance

0High
0Medium
0Low

chufenghuang

benchmark evaluation cli sip-ai

Sip Benchmark Tool

Internal CLI tool for benchmarking the Sip AI persona evaluation agent.

This standalone tool allows you to evaluate and compare different versions of the IdeaEvaluationAgent by running systematic tests and measuring performance improvements.

Features

🎯 Systematic Testing: Create persistent test suites from real YouTube channels
📊 Statistical Analysis: P-values, Cohen's d, Olympic-trimmed means
🤖 LLM-as-Judge: Automated evaluation of agent predictions
🔄 Persona Regeneration: Test with fresh personas on each run
📈 Progress Tracking: Real-time progress bars and detailed logs
📝 Comprehensive Reports: Markdown + JSON reports with deployment recommendations

Installation

Option 1: npm (Global Installation)

cd /path/to/sip-benchmark
npm install
npm link

Now you can use sip-benchmark from anywhere!

Option 2: npm (Local to Web App)

cd /path/to/sip-ai
npm install --save-dev /path/to/sip-benchmark

Then use via npx:

npx sip-benchmark <command>

Option 3: Direct Execution (No Installation)

cd /path/to/sip-ai
node /path/to/sip-benchmark/bin/cli.js <command>

Requirements

Sip-AI Web App: This tool must be run from within the Sip-AI web app directory
Environment Setup: .env.local must be configured with:
- POSTGRES_URL - Database connection
- OPENAI_API_KEY - For agent and embeddings
- YOUTUBE_API_KEY - For scraping channels
- Other API keys as needed (Anthropic, Gemini, Langfuse, etc.)
Dependencies: Web app dependencies must be installed (npm install)

Usage

Quick Start (Interactive Menu)

The easiest way to use the tool is with the interactive menu:

cd /path/to/sip-ai
sip-benchmark menu

This launches a user-friendly guided interface with options for:

Quick Test (paste URL → auto-create → run → report)
Create Benchmark Suite
Run Benchmark
View Reports
List Benchmarks

Command-Line Interface

1. Create a Benchmark Suite

sip-benchmark setup \
  --name "My Benchmark" \
  --channels "UCxxx,UCyyy" \
  --cases-per-tier 5

Options:

--name (required): Benchmark name
--channels (required): Comma-separated channel IDs or YouTube URLs
--cases-per-tier (optional): Test cases per performance tier (default: 5)
--description (optional): Benchmark description

What it does:

Samples high-performing (top 25%) and low-performing (bottom 25%) videos
Generates test cases with ground truth metrics (view percentile, engagement, sentiment)
Creates idea descriptions from video title + description + transcript
Stores test cases in database for reuse

2. Run a Benchmark

sip-benchmark run --benchmark-id <uuid>

Options:

--benchmark-id (required): Benchmark UUID to run
--model (optional): Model to use (default: from env or gpt-4.1-2025-04-14)
--temperature (optional): Temperature (default: 0.7)
--skip-persona-regen (optional): Use cached personas instead of regenerating

What it does:

Regenerates personas for all channels (by default)
Runs IdeaEvaluationAgent (Steps 3-6) on all test cases
Evaluates predictions with LLM judge (3 dimensions)
Calculates statistics (mean, SD, p-value vs baseline)
Generates markdown + JSON reports in reports/ directory

3. List Benchmarks

sip-benchmark list

Shows all benchmarks with test case counts and run history.

4. View Report

sip-benchmark report --run-id <uuid>

Displays a summary of a benchmark run with statistics and recommendations.

How It Works

Architecture

The tool is a lightweight wrapper that:

Detects the Sip-AI web app directory
Loads .env.local environment variables
Validates required environment variables
Calls the web app's npm run benchmark:* scripts
Uses the exact production code for testing (no mocking)

Test Flow

┌─────────────────────────────────────────────────┐
│ 1. Setup: Sample Videos + Generate Test Cases  │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 2. Run: Regenerate Personas (optional)         │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 3. Run: Execute Agent on All Test Cases        │
│    - Topic Fit Score (Step 3)                  │
│    - Audience Resonance (Step 4)               │
│    - Risk Assessment (Step 5)                  │
│    - Final Report (Step 6)                     │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 4. Judge: Evaluate Predictions vs Reality      │
│    - Directional Accuracy (40%)                │
│    - Verdict Alignment (40%)                   │
│    - Internal Consistency (20%)                │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 5. Analyze: Calculate Statistics               │
│    - Olympic-trimmed mean                      │
│    - Two-sample t-test vs baseline             │
│    - Cohen's d effect size                     │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 6. Report: Generate Markdown + JSON            │
│    - Statistics summary                        │
│    - Deployment recommendation                 │
│    - Individual test case results              │
└─────────────────────────────────────────────────┘

Workflow Example

Scenario: Test a New Prompt Version

# 1. Create benchmark suite (one-time setup)
cd /path/to/sip-ai
sip-benchmark setup \
  --name "Tech Channels Q1 2025" \
  --channels "UCxxx,UCyyy,UCzzz" \
  --cases-per-tier 10

# Output: Benchmark ID: abc123...

# 2. Run baseline (with current production prompts)
sip-benchmark run --benchmark-id abc123

# Output:
#   Mean Score: 7.23
#   Baseline established!

# 3. Edit prompts in Langfuse (label: "experimental")
# Visit Langfuse UI → Update prompts → Label as "experimental"

# 4. Run comparison (with new experimental prompts)
LANGFUSE_PROMPT_LABEL=experimental sip-benchmark run --benchmark-id abc123

# Output:
#   Mean Score: 8.45
#   P-value: 0.0023 (significant improvement!)
#   Cohen's d: 0.82 (large effect size)
#   Recommendation: DEPLOY - Significant improvement detected

# 5. View full report
sip-benchmark report --run-id <latest-run-id>

# 6. If satisfied, promote prompts to production in Langfuse UI

Configuration

Environment Variables

The tool uses the web app's .env.local automatically. Required variables:

# Database (Required)
POSTGRES_URL=postgres://...

# LLM APIs (Required)
OPENAI_API_KEY=sk-...
YOUTUBE_API_KEY=...

# Optional (uses defaults if not set)
LLM_MODEL=gpt-4.1-2025-04-14
MAML_MODEL=gpt-4o
ANTHROPIC_API_KEY=...
GOOGLE_GEMINI_API_KEY=...

# Langfuse (Optional - for prompt management)
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_PRODUCTION_PUBLIC_KEY=pk-...
LANGFUSE_PRODUCTION_SECRET_KEY=sk-...
LANGFUSE_PROMPT_LABEL=experimental

Benchmark Mode

When running benchmarks, the tool automatically sets:

BENCHMARK_MODE=true

This suppresses verbose logs for cleaner progress bars during parallel execution.

Performance & Costs

Speed (with parallel execution)

Setup: ~5 min per channel (one-time)
Run (18 test cases, 3 channels):
- With persona regen: ~3-6 minutes
- Without persona regen: ~2-4 minutes

Costs (OpenAI API)

Setup: $0.60-$1.20 per channel (one-time)
Run with personas: $4-$7 for 30 test cases
Run without personas: $2-$4.50 for 30 test cases

Troubleshooting

Error: "Not in a Sip-AI project directory"

Solution: Navigate to the web app directory first:

cd /path/to/sip-ai
sip-benchmark <command>

Error: "Missing required environment variables"

Solution: Ensure .env.local exists and contains:

POSTGRES_URL
OPENAI_API_KEY
YOUTUBE_API_KEY

Error: "Failed to import backend modules"

Solutions:

Ensure web app dependencies are installed: npm install
Check that TypeScript files exist in backend/ directory
Verify you're using Node.js 18+

Progress bars not showing

Solution: Ensure your terminal supports ANSI escape codes. Most modern terminals (iTerm2, Terminal.app, Windows Terminal) work fine.

Advanced Usage

Custom Model/Temperature

sip-benchmark run \
  --benchmark-id abc123 \
  --model gpt-4o \
  --temperature 0.5

Skip Persona Regeneration (Quick Debugging)

sip-benchmark run \
  --benchmark-id abc123 \
  --skip-persona-regen

This skips persona generation and uses cached personas, saving ~2-3 minutes per run.

View All Benchmarks

sip-benchmark list

Export Report Data

Reports are saved in reports/ directory:

benchmark-<run-id>.md - Human-readable markdown
benchmark-<run-id>.json - Machine-readable JSON

Development

Project Structure

sip-benchmark/
├── bin/
│   └── cli.js              # Executable entry point
├── src/
│   ├── cli.ts              # Commander.js CLI interface
│   ├── backend-importer.ts # Dynamic backend loader
│   ├── types.ts            # Type definitions
│   └── commands/
│       ├── run.ts          # Run benchmark command
│       ├── setup.ts        # Setup benchmark command
│       ├── list.ts         # List benchmarks command
│       ├── menu.ts         # Interactive menu command
│       └── report.ts       # View report command
├── package.json
├── tsconfig.json
└── README.md

Building

npm run build

Compiles TypeScript to dist/ directory.

Testing Locally

cd /path/to/sip-ai
npm run build  # Build the tool first
node /path/to/sip-benchmark/bin/cli.js list

Publishing (Optional)

Publish to npm

cd /path/to/sip-benchmark
npm login
npm publish

Then colleagues can install:

npm install -g sip-benchmark

Publish to Homebrew

Create a formula
Submit to homebrew-core or your own tap

Support

For issues or questions:

Check this README first
Review web app's scripts/benchmark/README.md for detailed benchmark system docs
Contact the team

License

MIT - Internal tool for Sip AI team