sip-benchmark
v1.3.6
Published
Internal CLI tool for evaluating Sip AI persona agent performance
Downloads
3,658
Maintainers
Readme
Sip Benchmark Tool
Internal CLI tool for benchmarking the Sip AI persona evaluation agent.
This standalone tool allows you to evaluate and compare different versions of the IdeaEvaluationAgent by running systematic tests and measuring performance improvements.
Features
- 🎯 Systematic Testing: Create persistent test suites from real YouTube channels
- 📊 Statistical Analysis: P-values, Cohen's d, Olympic-trimmed means
- 🤖 LLM-as-Judge: Automated evaluation of agent predictions
- 🔄 Persona Regeneration: Test with fresh personas on each run
- 📈 Progress Tracking: Real-time progress bars and detailed logs
- 📝 Comprehensive Reports: Markdown + JSON reports with deployment recommendations
Installation
Option 1: npm (Global Installation)
cd /path/to/sip-benchmark
npm install
npm linkNow you can use sip-benchmark from anywhere!
Option 2: npm (Local to Web App)
cd /path/to/sip-ai
npm install --save-dev /path/to/sip-benchmarkThen use via npx:
npx sip-benchmark <command>Option 3: Direct Execution (No Installation)
cd /path/to/sip-ai
node /path/to/sip-benchmark/bin/cli.js <command>Requirements
- Sip-AI Web App: This tool must be run from within the Sip-AI web app directory
- Environment Setup:
.env.localmust be configured with:POSTGRES_URL- Database connectionOPENAI_API_KEY- For agent and embeddingsYOUTUBE_API_KEY- For scraping channels- Other API keys as needed (Anthropic, Gemini, Langfuse, etc.)
- Dependencies: Web app dependencies must be installed (
npm install)
Usage
Quick Start (Interactive Menu)
The easiest way to use the tool is with the interactive menu:
cd /path/to/sip-ai
sip-benchmark menuThis launches a user-friendly guided interface with options for:
- Quick Test (paste URL → auto-create → run → report)
- Create Benchmark Suite
- Run Benchmark
- View Reports
- List Benchmarks
Command-Line Interface
1. Create a Benchmark Suite
sip-benchmark setup \
--name "My Benchmark" \
--channels "UCxxx,UCyyy" \
--cases-per-tier 5Options:
--name(required): Benchmark name--channels(required): Comma-separated channel IDs or YouTube URLs--cases-per-tier(optional): Test cases per performance tier (default: 5)--description(optional): Benchmark description
What it does:
- Samples high-performing (top 25%) and low-performing (bottom 25%) videos
- Generates test cases with ground truth metrics (view percentile, engagement, sentiment)
- Creates idea descriptions from video title + description + transcript
- Stores test cases in database for reuse
2. Run a Benchmark
sip-benchmark run --benchmark-id <uuid>Options:
--benchmark-id(required): Benchmark UUID to run--model(optional): Model to use (default: from env or gpt-4.1-2025-04-14)--temperature(optional): Temperature (default: 0.7)--skip-persona-regen(optional): Use cached personas instead of regenerating
What it does:
- Regenerates personas for all channels (by default)
- Runs IdeaEvaluationAgent (Steps 3-6) on all test cases
- Evaluates predictions with LLM judge (3 dimensions)
- Calculates statistics (mean, SD, p-value vs baseline)
- Generates markdown + JSON reports in
reports/directory
3. List Benchmarks
sip-benchmark listShows all benchmarks with test case counts and run history.
4. View Report
sip-benchmark report --run-id <uuid>Displays a summary of a benchmark run with statistics and recommendations.
How It Works
Architecture
The tool is a lightweight wrapper that:
- Detects the Sip-AI web app directory
- Loads
.env.localenvironment variables - Validates required environment variables
- Calls the web app's
npm run benchmark:*scripts - Uses the exact production code for testing (no mocking)
Test Flow
┌─────────────────────────────────────────────────┐
│ 1. Setup: Sample Videos + Generate Test Cases │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 2. Run: Regenerate Personas (optional) │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 3. Run: Execute Agent on All Test Cases │
│ - Topic Fit Score (Step 3) │
│ - Audience Resonance (Step 4) │
│ - Risk Assessment (Step 5) │
│ - Final Report (Step 6) │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 4. Judge: Evaluate Predictions vs Reality │
│ - Directional Accuracy (40%) │
│ - Verdict Alignment (40%) │
│ - Internal Consistency (20%) │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 5. Analyze: Calculate Statistics │
│ - Olympic-trimmed mean │
│ - Two-sample t-test vs baseline │
│ - Cohen's d effect size │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ 6. Report: Generate Markdown + JSON │
│ - Statistics summary │
│ - Deployment recommendation │
│ - Individual test case results │
└─────────────────────────────────────────────────┘Workflow Example
Scenario: Test a New Prompt Version
# 1. Create benchmark suite (one-time setup)
cd /path/to/sip-ai
sip-benchmark setup \
--name "Tech Channels Q1 2025" \
--channels "UCxxx,UCyyy,UCzzz" \
--cases-per-tier 10
# Output: Benchmark ID: abc123...
# 2. Run baseline (with current production prompts)
sip-benchmark run --benchmark-id abc123
# Output:
# Mean Score: 7.23
# Baseline established!
# 3. Edit prompts in Langfuse (label: "experimental")
# Visit Langfuse UI → Update prompts → Label as "experimental"
# 4. Run comparison (with new experimental prompts)
LANGFUSE_PROMPT_LABEL=experimental sip-benchmark run --benchmark-id abc123
# Output:
# Mean Score: 8.45
# P-value: 0.0023 (significant improvement!)
# Cohen's d: 0.82 (large effect size)
# Recommendation: DEPLOY - Significant improvement detected
# 5. View full report
sip-benchmark report --run-id <latest-run-id>
# 6. If satisfied, promote prompts to production in Langfuse UIConfiguration
Environment Variables
The tool uses the web app's .env.local automatically. Required variables:
# Database (Required)
POSTGRES_URL=postgres://...
# LLM APIs (Required)
OPENAI_API_KEY=sk-...
YOUTUBE_API_KEY=...
# Optional (uses defaults if not set)
LLM_MODEL=gpt-4.1-2025-04-14
MAML_MODEL=gpt-4o
ANTHROPIC_API_KEY=...
GOOGLE_GEMINI_API_KEY=...
# Langfuse (Optional - for prompt management)
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_PRODUCTION_PUBLIC_KEY=pk-...
LANGFUSE_PRODUCTION_SECRET_KEY=sk-...
LANGFUSE_PROMPT_LABEL=experimentalBenchmark Mode
When running benchmarks, the tool automatically sets:
BENCHMARK_MODE=trueThis suppresses verbose logs for cleaner progress bars during parallel execution.
Performance & Costs
Speed (with parallel execution)
- Setup: ~5 min per channel (one-time)
- Run (18 test cases, 3 channels):
- With persona regen: ~3-6 minutes
- Without persona regen: ~2-4 minutes
Costs (OpenAI API)
- Setup: $0.60-$1.20 per channel (one-time)
- Run with personas: $4-$7 for 30 test cases
- Run without personas: $2-$4.50 for 30 test cases
Troubleshooting
Error: "Not in a Sip-AI project directory"
Solution: Navigate to the web app directory first:
cd /path/to/sip-ai
sip-benchmark <command>Error: "Missing required environment variables"
Solution: Ensure .env.local exists and contains:
POSTGRES_URLOPENAI_API_KEYYOUTUBE_API_KEY
Error: "Failed to import backend modules"
Solutions:
- Ensure web app dependencies are installed:
npm install - Check that TypeScript files exist in
backend/directory - Verify you're using Node.js 18+
Progress bars not showing
Solution: Ensure your terminal supports ANSI escape codes. Most modern terminals (iTerm2, Terminal.app, Windows Terminal) work fine.
Advanced Usage
Custom Model/Temperature
sip-benchmark run \
--benchmark-id abc123 \
--model gpt-4o \
--temperature 0.5Skip Persona Regeneration (Quick Debugging)
sip-benchmark run \
--benchmark-id abc123 \
--skip-persona-regenThis skips persona generation and uses cached personas, saving ~2-3 minutes per run.
View All Benchmarks
sip-benchmark listExport Report Data
Reports are saved in reports/ directory:
benchmark-<run-id>.md- Human-readable markdownbenchmark-<run-id>.json- Machine-readable JSON
Development
Project Structure
sip-benchmark/
├── bin/
│ └── cli.js # Executable entry point
├── src/
│ ├── cli.ts # Commander.js CLI interface
│ ├── backend-importer.ts # Dynamic backend loader
│ ├── types.ts # Type definitions
│ └── commands/
│ ├── run.ts # Run benchmark command
│ ├── setup.ts # Setup benchmark command
│ ├── list.ts # List benchmarks command
│ ├── menu.ts # Interactive menu command
│ └── report.ts # View report command
├── package.json
├── tsconfig.json
└── README.mdBuilding
npm run buildCompiles TypeScript to dist/ directory.
Testing Locally
cd /path/to/sip-ai
npm run build # Build the tool first
node /path/to/sip-benchmark/bin/cli.js listPublishing (Optional)
Publish to npm
cd /path/to/sip-benchmark
npm login
npm publishThen colleagues can install:
npm install -g sip-benchmarkPublish to Homebrew
- Create a formula
- Submit to
homebrew-coreor your own tap
Support
For issues or questions:
- Check this README first
- Review web app's
scripts/benchmark/README.mdfor detailed benchmark system docs - Contact the team
License
MIT - Internal tool for Sip AI team
