npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

sip-benchmark

v1.3.6

Published

Internal CLI tool for evaluating Sip AI persona agent performance

Downloads

3,658

Readme

Sip Benchmark Tool

Internal CLI tool for benchmarking the Sip AI persona evaluation agent.

This standalone tool allows you to evaluate and compare different versions of the IdeaEvaluationAgent by running systematic tests and measuring performance improvements.


Features

  • 🎯 Systematic Testing: Create persistent test suites from real YouTube channels
  • 📊 Statistical Analysis: P-values, Cohen's d, Olympic-trimmed means
  • 🤖 LLM-as-Judge: Automated evaluation of agent predictions
  • 🔄 Persona Regeneration: Test with fresh personas on each run
  • 📈 Progress Tracking: Real-time progress bars and detailed logs
  • 📝 Comprehensive Reports: Markdown + JSON reports with deployment recommendations

Installation

Option 1: npm (Global Installation)

cd /path/to/sip-benchmark
npm install
npm link

Now you can use sip-benchmark from anywhere!

Option 2: npm (Local to Web App)

cd /path/to/sip-ai
npm install --save-dev /path/to/sip-benchmark

Then use via npx:

npx sip-benchmark <command>

Option 3: Direct Execution (No Installation)

cd /path/to/sip-ai
node /path/to/sip-benchmark/bin/cli.js <command>

Requirements

  1. Sip-AI Web App: This tool must be run from within the Sip-AI web app directory
  2. Environment Setup: .env.local must be configured with:
    • POSTGRES_URL - Database connection
    • OPENAI_API_KEY - For agent and embeddings
    • YOUTUBE_API_KEY - For scraping channels
    • Other API keys as needed (Anthropic, Gemini, Langfuse, etc.)
  3. Dependencies: Web app dependencies must be installed (npm install)

Usage

Quick Start (Interactive Menu)

The easiest way to use the tool is with the interactive menu:

cd /path/to/sip-ai
sip-benchmark menu

This launches a user-friendly guided interface with options for:

  • Quick Test (paste URL → auto-create → run → report)
  • Create Benchmark Suite
  • Run Benchmark
  • View Reports
  • List Benchmarks

Command-Line Interface

1. Create a Benchmark Suite

sip-benchmark setup \
  --name "My Benchmark" \
  --channels "UCxxx,UCyyy" \
  --cases-per-tier 5

Options:

  • --name (required): Benchmark name
  • --channels (required): Comma-separated channel IDs or YouTube URLs
  • --cases-per-tier (optional): Test cases per performance tier (default: 5)
  • --description (optional): Benchmark description

What it does:

  • Samples high-performing (top 25%) and low-performing (bottom 25%) videos
  • Generates test cases with ground truth metrics (view percentile, engagement, sentiment)
  • Creates idea descriptions from video title + description + transcript
  • Stores test cases in database for reuse

2. Run a Benchmark

sip-benchmark run --benchmark-id <uuid>

Options:

  • --benchmark-id (required): Benchmark UUID to run
  • --model (optional): Model to use (default: from env or gpt-4.1-2025-04-14)
  • --temperature (optional): Temperature (default: 0.7)
  • --skip-persona-regen (optional): Use cached personas instead of regenerating

What it does:

  • Regenerates personas for all channels (by default)
  • Runs IdeaEvaluationAgent (Steps 3-6) on all test cases
  • Evaluates predictions with LLM judge (3 dimensions)
  • Calculates statistics (mean, SD, p-value vs baseline)
  • Generates markdown + JSON reports in reports/ directory

3. List Benchmarks

sip-benchmark list

Shows all benchmarks with test case counts and run history.

4. View Report

sip-benchmark report --run-id <uuid>

Displays a summary of a benchmark run with statistics and recommendations.


How It Works

Architecture

The tool is a lightweight wrapper that:

  1. Detects the Sip-AI web app directory
  2. Loads .env.local environment variables
  3. Validates required environment variables
  4. Calls the web app's npm run benchmark:* scripts
  5. Uses the exact production code for testing (no mocking)

Test Flow

┌─────────────────────────────────────────────────┐
│ 1. Setup: Sample Videos + Generate Test Cases  │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 2. Run: Regenerate Personas (optional)         │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 3. Run: Execute Agent on All Test Cases        │
│    - Topic Fit Score (Step 3)                  │
│    - Audience Resonance (Step 4)               │
│    - Risk Assessment (Step 5)                  │
│    - Final Report (Step 6)                     │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 4. Judge: Evaluate Predictions vs Reality      │
│    - Directional Accuracy (40%)                │
│    - Verdict Alignment (40%)                   │
│    - Internal Consistency (20%)                │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 5. Analyze: Calculate Statistics               │
│    - Olympic-trimmed mean                      │
│    - Two-sample t-test vs baseline             │
│    - Cohen's d effect size                     │
└─────────────────┬───────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────┐
│ 6. Report: Generate Markdown + JSON            │
│    - Statistics summary                        │
│    - Deployment recommendation                 │
│    - Individual test case results              │
└─────────────────────────────────────────────────┘

Workflow Example

Scenario: Test a New Prompt Version

# 1. Create benchmark suite (one-time setup)
cd /path/to/sip-ai
sip-benchmark setup \
  --name "Tech Channels Q1 2025" \
  --channels "UCxxx,UCyyy,UCzzz" \
  --cases-per-tier 10

# Output: Benchmark ID: abc123...

# 2. Run baseline (with current production prompts)
sip-benchmark run --benchmark-id abc123

# Output:
#   Mean Score: 7.23
#   Baseline established!

# 3. Edit prompts in Langfuse (label: "experimental")
# Visit Langfuse UI → Update prompts → Label as "experimental"

# 4. Run comparison (with new experimental prompts)
LANGFUSE_PROMPT_LABEL=experimental sip-benchmark run --benchmark-id abc123

# Output:
#   Mean Score: 8.45
#   P-value: 0.0023 (significant improvement!)
#   Cohen's d: 0.82 (large effect size)
#   Recommendation: DEPLOY - Significant improvement detected

# 5. View full report
sip-benchmark report --run-id <latest-run-id>

# 6. If satisfied, promote prompts to production in Langfuse UI

Configuration

Environment Variables

The tool uses the web app's .env.local automatically. Required variables:

# Database (Required)
POSTGRES_URL=postgres://...

# LLM APIs (Required)
OPENAI_API_KEY=sk-...
YOUTUBE_API_KEY=...

# Optional (uses defaults if not set)
LLM_MODEL=gpt-4.1-2025-04-14
MAML_MODEL=gpt-4o
ANTHROPIC_API_KEY=...
GOOGLE_GEMINI_API_KEY=...

# Langfuse (Optional - for prompt management)
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_PRODUCTION_PUBLIC_KEY=pk-...
LANGFUSE_PRODUCTION_SECRET_KEY=sk-...
LANGFUSE_PROMPT_LABEL=experimental

Benchmark Mode

When running benchmarks, the tool automatically sets:

BENCHMARK_MODE=true

This suppresses verbose logs for cleaner progress bars during parallel execution.


Performance & Costs

Speed (with parallel execution)

  • Setup: ~5 min per channel (one-time)
  • Run (18 test cases, 3 channels):
    • With persona regen: ~3-6 minutes
    • Without persona regen: ~2-4 minutes

Costs (OpenAI API)

  • Setup: $0.60-$1.20 per channel (one-time)
  • Run with personas: $4-$7 for 30 test cases
  • Run without personas: $2-$4.50 for 30 test cases

Troubleshooting

Error: "Not in a Sip-AI project directory"

Solution: Navigate to the web app directory first:

cd /path/to/sip-ai
sip-benchmark <command>

Error: "Missing required environment variables"

Solution: Ensure .env.local exists and contains:

  • POSTGRES_URL
  • OPENAI_API_KEY
  • YOUTUBE_API_KEY

Error: "Failed to import backend modules"

Solutions:

  1. Ensure web app dependencies are installed: npm install
  2. Check that TypeScript files exist in backend/ directory
  3. Verify you're using Node.js 18+

Progress bars not showing

Solution: Ensure your terminal supports ANSI escape codes. Most modern terminals (iTerm2, Terminal.app, Windows Terminal) work fine.


Advanced Usage

Custom Model/Temperature

sip-benchmark run \
  --benchmark-id abc123 \
  --model gpt-4o \
  --temperature 0.5

Skip Persona Regeneration (Quick Debugging)

sip-benchmark run \
  --benchmark-id abc123 \
  --skip-persona-regen

This skips persona generation and uses cached personas, saving ~2-3 minutes per run.

View All Benchmarks

sip-benchmark list

Export Report Data

Reports are saved in reports/ directory:

  • benchmark-<run-id>.md - Human-readable markdown
  • benchmark-<run-id>.json - Machine-readable JSON

Development

Project Structure

sip-benchmark/
├── bin/
│   └── cli.js              # Executable entry point
├── src/
│   ├── cli.ts              # Commander.js CLI interface
│   ├── backend-importer.ts # Dynamic backend loader
│   ├── types.ts            # Type definitions
│   └── commands/
│       ├── run.ts          # Run benchmark command
│       ├── setup.ts        # Setup benchmark command
│       ├── list.ts         # List benchmarks command
│       ├── menu.ts         # Interactive menu command
│       └── report.ts       # View report command
├── package.json
├── tsconfig.json
└── README.md

Building

npm run build

Compiles TypeScript to dist/ directory.

Testing Locally

cd /path/to/sip-ai
npm run build  # Build the tool first
node /path/to/sip-benchmark/bin/cli.js list

Publishing (Optional)

Publish to npm

cd /path/to/sip-benchmark
npm login
npm publish

Then colleagues can install:

npm install -g sip-benchmark

Publish to Homebrew

  1. Create a formula
  2. Submit to homebrew-core or your own tap

Support

For issues or questions:

  1. Check this README first
  2. Review web app's scripts/benchmark/README.md for detailed benchmark system docs
  3. Contact the team

License

MIT - Internal tool for Sip AI team