npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

praisonaibench

v0.1.0

Published

Multi-provider LLM Benchmarking Framework - Evaluate code generation with OpenAI, Anthropic, Google, xAI, Mistral, Groq

Downloads

104

Readme

PraisonAI Bench TypeScript

🚀 A powerful multi-provider LLM benchmarking framework

Benchmark any LLM with automatic code evaluation, TypeScript/HTML execution, and comprehensive reports. Supports OpenAI, Anthropic, Google, xAI, Mistral, and Groq via Vercel AI SDK.

Node.js 16+ License: MIT TypeScript npm version

🎯 Features

| Feature | Description | |---------|-------------| | 🤖 Multi-Provider | OpenAI, Anthropic, Google, xAI, Mistral, Groq | | 📊 Multi-Stage Evaluation | Syntax validation, code execution, output comparison | | 💰 Cost & Token Tracking | Automatic token usage and cost calculation | | 📈 HTML Reports | Beautiful dashboard reports with charts | | ⚡ Parallel Execution | Run tests concurrently | | 🔌 Plugin System | Extensible evaluators for any language (docs) | | 🎯 Test Suites | YAML/JSON test suite support | | 🔄 Retry Logic | Automatic retries with exponential backoff |

🤖 Supported Providers

| Provider | Models | Env Variable | |----------|--------|--------------| | OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini | OPENAI_API_KEY | | Anthropic | claude-3-5-sonnet-latest, claude-3-opus-latest, claude-3-haiku | ANTHROPIC_API_KEY | | Google | gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash | GOOGLE_GENERATIVE_AI_API_KEY | | xAI | grok-beta, grok-2-1212 | XAI_API_KEY | | Mistral | mistral-large-latest, mistral-medium-latest | MISTRAL_API_KEY | | Groq | llama-3.1-70b-versatile, mixtral-8x7b-32768 | GROQ_API_KEY |

📊 Evaluation System

| Stage | Points | Description | |-------|--------|-------------| | Syntax Validation | 30 | TypeScript compiler API parsing | | Code Execution | 40 | Safe ts-node subprocess execution | | Output Comparison | 30 | Fuzzy matching with expected output | | Total | 100 | Pass threshold: ≥70 |

🚀 Quick Start

Installation

# Install globally
npm install -g praisonaibench

# Or install locally
npm install praisonaibench

Set API Keys

# OpenAI (default)
export OPENAI_API_KEY=your_openai_key

# Or use other providers
export ANTHROPIC_API_KEY=your_anthropic_key
export GOOGLE_GENERATIVE_AI_API_KEY=your_google_key
export GROQ_API_KEY=your_groq_key

Run Your First Test

# Single test with OpenAI (default)
praisonaibench --test "Write TypeScript code that prints Hello World"

# With specific model
praisonaibench --test "Calculate factorial of 5" --model gpt-4o-mini

# Use Anthropic Claude
praisonaibench --test "Write a hello world" --model claude-3-5-sonnet-latest

# Use Google Gemini
praisonaibench --test "Write a hello world" --model gemini-1.5-flash

# Cross-model comparison
praisonaibench --cross-model "Write hello world" --models openai/gpt-4o,anthropic/claude-3-5-sonnet-latest

# Run test suite with report
praisonaibench --suite tests.yaml --report

# List available providers
praisonaibench --list-providers

Verify Installation

# Check version
node -e "const { TypeScriptEvaluator } = require('./dist'); console.log('Plugin loaded successfully!');"

Configuration

Create a .env file (or copy from .env.example):

# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here

# Default model
DEFAULT_MODEL=gpt-4o-mini

# Execution timeout (seconds)
TYPESCRIPT_EXECUTION_TIMEOUT=5

Basic Usage

Create a test suite file tests.yaml:

tests:
  - name: "hello_world"
    language: "typescript"
    prompt: "Write TypeScript code that prints 'Hello World'"
    expected: "Hello World"
  
  - name: "calculate_factorial"
    language: "typescript"
    prompt: "Write a TypeScript function that calculates factorial of 5"
    expected: "120"

Run the benchmarks:

praisonaibench --suite tests.yaml --model gpt-4o-mini

📊 Evaluation System

Scoring Breakdown

The evaluator uses a three-stage assessment system:

| Stage | Points | Description | |-------|--------|-------------| | Syntax Validation | 30 | TypeScript compiler API parsing, import detection | | Code Execution | 40 | Safe ts-node subprocess execution, error capture | | Output Comparison | 30 | Fuzzy matching with expected output | | Total | 100 | Combined score |

Pass Threshold: 70/100 points

Scoring Examples

Example 1: Perfect Score (100/100)

// Code: console.log("Hello World")
// Expected: "Hello World"

✅ Syntax: 30 points (valid TypeScript)
✅ Execution: 40 points (runs successfully)
✅ Output: 30 points (exact match)
─────────────────────────
Total: 100/100 ✅ PASSED

Example 2: Partial Score (70/100)

// Code: console.log("Hello")
// Expected: "Hello World"

✅ Syntax: 30 points (valid TypeScript)
✅ Execution: 40 points (runs successfully)
⚠️  Output: 0 points (different output)
─────────────────────────
Total: 70/100 ✅ PASSED

Example 3: Failure (30/100)

// Code: console.log(undefinedVar)
// Expected: "Hello World"

✅ Syntax: 30 points (valid syntax)
❌ Execution: 0 points (ReferenceError)
❌ Output: 0 points (didn't execute)
─────────────────────────
Total: 30/100 ❌ FAILED

📖 Usage Guide

TypeScript API

import { TypeScriptEvaluator } from 'praisonaibench';

// Create evaluator
const evaluator = new TypeScriptEvaluator(5); // 5 second timeout

// Evaluate code
const result = await evaluator.evaluate(
  'console.log("Hello World")',
  "hello_test",
  "Write TypeScript code that prints Hello World",
  "Hello World"
);

// Check results
console.log(`Score: ${result.score}/100`);
console.log(`Passed: ${result.passed}`);

// View feedback
for (const item of result.feedback) {
  console.log(`${item.level}: ${item.message}`);
}

// Access details
console.log(`Output: ${result.details.output}`);
console.log(`Score breakdown:`, result.details.score_breakdown);

Test Suite Format

Simple Test

tests:
  - name: "basic_math"
    language: "typescript"
    prompt: "Calculate 15 * 23 and print the result"
    expected: "345"

Advanced Test

tests:
  - name: "fibonacci"
    language: "typescript"
    prompt: |
      Write a TypeScript function that calculates the nth Fibonacci number.
      Calculate and print the 10th Fibonacci number.
    expected: "55"

Test Without Expected Output

tests:
  - name: "creative_code"
    language: "typescript"
    prompt: "Write a TypeScript class for a simple calculator"
    # No expected field - evaluation based on syntax and execution only

🎨 Features

Security Features

  • Subprocess Isolation - Code runs in separate process via ts-node
  • Timeout Protection - Configurable execution timeout (default: 5s)
  • Resource Limits - Prevents infinite loops and resource exhaustion
  • Error Handling - Graceful handling of all error types

Code Extraction

Automatically extracts code from various formats:

// Supports typescript code blocks
`​`​`typescript
console.log("Hello")
`​`​`

// Supports ts code blocks
`​`​`ts
console.log("Hello")
`​`​`

// Supports generic code blocks
`​`​`
console.log("Hello")
`​`​`

// Supports raw code
console.log('Hello')

Output Comparison

Smart fuzzy matching algorithm:

  • Exact match: 30/30 points
  • High similarity (>80%): 25-29 points
  • Medium similarity (50-80%): 15-24 points
  • Low similarity (<50%): 0-14 points

Features:

  • Case-insensitive comparison
  • Whitespace normalisation
  • Substring matching (e.g., "345" in "The answer is 345")

Detailed Feedback

{
  score: 85,
  passed: true,
  feedback: [
    { level: "success", message: "✅ Valid TypeScript syntax" },
    { level: "info", message: "📦 Imports: fs, path" },
    { level: "success", message: "✅ Code executed successfully" },
    { level: "info", message: "📤 Output: Hello World" },
    { level: "warning", message: "⚠️  Output partially matches expected" }
  ],
  details: {
    extracted_code: "console.log('Hello World')",
    executed: true,
    output: "Hello World",
    similarity: 0.95,
    score_breakdown: {
      syntax: 30,
      execution: 40,
      output_match: 28
    }
  }
}

📚 Examples

Example 1: Hello World

tests:
  - name: "hello_world"
    language: "typescript"
    prompt: "Write TypeScript code that prints 'Hello World'"
    expected: "Hello World"

Example 2: Factorial Function

tests:
  - name: "factorial"
    language: "typescript"
    prompt: |
      Write a TypeScript function that calculates the factorial of a number.
      Calculate factorial(5) and print the result.
    expected: "120"

Example 3: Interface Usage

tests:
  - name: "interface_test"
    language: "typescript"
    prompt: |
      Define a Person interface with name and age properties.
      Create a person and print their name.
    expected: "Alice"

More examples available in:

  • examples/simple_tests.yaml - Basic TypeScript tests
  • examples/advanced_tests.yaml - Complex TypeScript challenges
  • examples/algorithm_tests.yaml - Algorithm implementations

🧪 Testing

Run Unit Tests

# Install dependencies
npm install

# Run all tests
npm test

# Run with coverage
npm run test:coverage

Test Coverage

The plugin includes comprehensive tests:

  • Unit Tests (tests/evaluator.test.ts)

    • Code extraction
    • Syntax validation
    • Code execution
    • Output comparison
    • Error handling
    • Timeout protection
  • Integration Tests (tests/integration.test.ts)

    • Plugin interface compatibility
    • Multiple test scenarios
    • Concurrent evaluations
    • Large output handling
    • Import support

🔧 Configuration

Environment Variables

# Required
OPENAI_API_KEY=your_api_key_here

# Optional
DEFAULT_MODEL=gpt-4o-mini
TYPESCRIPT_EXECUTION_TIMEOUT=5
TS_NODE_EXECUTABLE=/path/to/ts-node  # Leave empty for npx

Programmatic Configuration

import { TypeScriptEvaluator } from 'praisonaibench';

// Custom timeout
const evaluator = new TypeScriptEvaluator(10);

// Custom ts-node path
const evaluator = new TypeScriptEvaluator(
  5,
  "/usr/local/bin/ts-node"
);

🏗️ Architecture

Plugin Structure

praisonaibench/
├── src/
│   ├── index.ts              # Plugin exports
│   ├── evaluator.ts          # Main evaluator class
│   └── version.ts            # Version info
├── tests/
│   ├── evaluator.test.ts     # Unit tests
│   └── integration.test.ts   # Integration tests
├── examples/
│   ├── simple_tests.yaml
│   ├── advanced_tests.yaml
│   └── algorithm_tests.yaml
├── package.json              # Project configuration
├── tsconfig.json             # TypeScript config
├── .env                      # Configuration
└── README.md                 # This file

Class Hierarchy

BaseEvaluator (interface)
    └── TypeScriptEvaluator
        ├── getLanguage() → 'typescript'
        ├── getFileExtension() → 'ts'
        └── evaluate(code, testName, prompt, expected) → Promise<EvaluationResult>

🤝 Contributing

Contributions are welcome! Here's how:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes
  4. Run tests: npm test
  5. Submit a pull request

Development Setup

# Clone repository
git clone https://github.com/MervinPraison/PraisonAIBench-TypeScript
cd praisonaibench

# Install dependencies
npm install

# Run tests
npm test

# Build
npm run build

📄 License

MIT License - see LICENSE file for details.

🔗 Links

📞 Support

🎉 Acknowledgements

Built with ❤️ for the PraisonAI Bench community.

Special thanks to:

  • PraisonAI - For the amazing benchmarking framework
  • Contributors and testers
  • The TypeScript community

Ready to benchmark TypeScript code generation? Install the plugin and start testing! 🚀