praisonaibench

v0.1.0

Published

a month ago

Multi-provider LLM Benchmarking Framework - Evaluate code generation with OpenAI, Anthropic, Google, xAI, Mistral, Groq

Downloads

104

0High
0Medium
0Low

mervinpraison

praisonaibench typescript llm benchmark openai anthropic google gemini claude gpt code-evaluation testing ai vercel-ai-sdk

PraisonAI Bench TypeScript

🚀 A powerful multi-provider LLM benchmarking framework

Benchmark any LLM with automatic code evaluation, TypeScript/HTML execution, and comprehensive reports. Supports OpenAI, Anthropic, Google, xAI, Mistral, and Groq via Vercel AI SDK.

🎯 Features

| Feature | Description | |---------|-------------| | 🤖 Multi-Provider | OpenAI, Anthropic, Google, xAI, Mistral, Groq | | 📊 Multi-Stage Evaluation | Syntax validation, code execution, output comparison | | 💰 Cost & Token Tracking | Automatic token usage and cost calculation | | 📈 HTML Reports | Beautiful dashboard reports with charts | | ⚡ Parallel Execution | Run tests concurrently | | 🔌 Plugin System | Extensible evaluators for any language (docs) | | 🎯 Test Suites | YAML/JSON test suite support | | 🔄 Retry Logic | Automatic retries with exponential backoff |

🤖 Supported Providers

| Provider | Models | Env Variable | |----------|--------|--------------| | OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini | OPENAI_API_KEY | | Anthropic | claude-3-5-sonnet-latest, claude-3-opus-latest, claude-3-haiku | ANTHROPIC_API_KEY | | Google | gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash | GOOGLE_GENERATIVE_AI_API_KEY | | xAI | grok-beta, grok-2-1212 | XAI_API_KEY | | Mistral | mistral-large-latest, mistral-medium-latest | MISTRAL_API_KEY | | Groq | llama-3.1-70b-versatile, mixtral-8x7b-32768 | GROQ_API_KEY |

📊 Evaluation System

| Stage | Points | Description | |-------|--------|-------------| | Syntax Validation | 30 | TypeScript compiler API parsing | | Code Execution | 40 | Safe ts-node subprocess execution | | Output Comparison | 30 | Fuzzy matching with expected output | | Total | 100 | Pass threshold: ≥70 |

🚀 Quick Start

Installation

# Install globally
npm install -g praisonaibench

# Or install locally
npm install praisonaibench

Set API Keys

# OpenAI (default)
export OPENAI_API_KEY=your_openai_key

# Or use other providers
export ANTHROPIC_API_KEY=your_anthropic_key
export GOOGLE_GENERATIVE_AI_API_KEY=your_google_key
export GROQ_API_KEY=your_groq_key

Run Your First Test

# Single test with OpenAI (default)
praisonaibench --test "Write TypeScript code that prints Hello World"

# With specific model
praisonaibench --test "Calculate factorial of 5" --model gpt-4o-mini

# Use Anthropic Claude
praisonaibench --test "Write a hello world" --model claude-3-5-sonnet-latest

# Use Google Gemini
praisonaibench --test "Write a hello world" --model gemini-1.5-flash

# Cross-model comparison
praisonaibench --cross-model "Write hello world" --models openai/gpt-4o,anthropic/claude-3-5-sonnet-latest

# Run test suite with report
praisonaibench --suite tests.yaml --report

# List available providers
praisonaibench --list-providers

Verify Installation

# Check version
node -e "const { TypeScriptEvaluator } = require('./dist'); console.log('Plugin loaded successfully!');"

Configuration

Create a .env file (or copy from .env.example):

# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here

# Default model
DEFAULT_MODEL=gpt-4o-mini

# Execution timeout (seconds)
TYPESCRIPT_EXECUTION_TIMEOUT=5

Basic Usage

Create a test suite file tests.yaml:

tests:
  - name: "hello_world"
    language: "typescript"
    prompt: "Write TypeScript code that prints 'Hello World'"
    expected: "Hello World"
  
  - name: "calculate_factorial"
    language: "typescript"
    prompt: "Write a TypeScript function that calculates factorial of 5"
    expected: "120"

Run the benchmarks:

praisonaibench --suite tests.yaml --model gpt-4o-mini

📊 Evaluation System

Scoring Breakdown

The evaluator uses a three-stage assessment system:

| Stage | Points | Description | |-------|--------|-------------| | Syntax Validation | 30 | TypeScript compiler API parsing, import detection | | Code Execution | 40 | Safe ts-node subprocess execution, error capture | | Output Comparison | 30 | Fuzzy matching with expected output | | Total | 100 | Combined score |

Pass Threshold: 70/100 points

Scoring Examples

Example 1: Perfect Score (100/100)

// Code: console.log("Hello World")
// Expected: "Hello World"

✅ Syntax: 30 points (valid TypeScript)
✅ Execution: 40 points (runs successfully)
✅ Output: 30 points (exact match)
─────────────────────────
Total: 100/100 ✅ PASSED

Example 2: Partial Score (70/100)

// Code: console.log("Hello")
// Expected: "Hello World"

✅ Syntax: 30 points (valid TypeScript)
✅ Execution: 40 points (runs successfully)
⚠️  Output: 0 points (different output)
─────────────────────────
Total: 70/100 ✅ PASSED

Example 3: Failure (30/100)

// Code: console.log(undefinedVar)
// Expected: "Hello World"

✅ Syntax: 30 points (valid syntax)
❌ Execution: 0 points (ReferenceError)
❌ Output: 0 points (didn't execute)
─────────────────────────
Total: 30/100 ❌ FAILED

📖 Usage Guide

TypeScript API

import { TypeScriptEvaluator } from 'praisonaibench';

// Create evaluator
const evaluator = new TypeScriptEvaluator(5); // 5 second timeout

// Evaluate code
const result = await evaluator.evaluate(
  'console.log("Hello World")',
  "hello_test",
  "Write TypeScript code that prints Hello World",
  "Hello World"
);

// Check results
console.log(`Score: ${result.score}/100`);
console.log(`Passed: ${result.passed}`);

// View feedback
for (const item of result.feedback) {
  console.log(`${item.level}: ${item.message}`);
}

// Access details
console.log(`Output: ${result.details.output}`);
console.log(`Score breakdown:`, result.details.score_breakdown);

Test Suite Format

Simple Test

tests:
  - name: "basic_math"
    language: "typescript"
    prompt: "Calculate 15 * 23 and print the result"
    expected: "345"

Advanced Test

tests:
  - name: "fibonacci"
    language: "typescript"
    prompt: |
      Write a TypeScript function that calculates the nth Fibonacci number.
      Calculate and print the 10th Fibonacci number.
    expected: "55"

Test Without Expected Output

tests:
  - name: "creative_code"
    language: "typescript"
    prompt: "Write a TypeScript class for a simple calculator"
    # No expected field - evaluation based on syntax and execution only

🎨 Features

Security Features

✅ Subprocess Isolation - Code runs in separate process via ts-node
✅ Timeout Protection - Configurable execution timeout (default: 5s)
✅ Resource Limits - Prevents infinite loops and resource exhaustion
✅ Error Handling - Graceful handling of all error types

Code Extraction

Automatically extracts code from various formats:

// Supports typescript code blocks
```typescript
console.log("Hello")
```

// Supports ts code blocks
```ts
console.log("Hello")
```

// Supports generic code blocks
```
console.log("Hello")
```

// Supports raw code
console.log('Hello')

Output Comparison

Smart fuzzy matching algorithm:

Exact match: 30/30 points
High similarity (>80%): 25-29 points
Medium similarity (50-80%): 15-24 points
Low similarity (<50%): 0-14 points

Features:

Case-insensitive comparison
Whitespace normalisation
Substring matching (e.g., "345" in "The answer is 345")

Detailed Feedback

{
  score: 85,
  passed: true,
  feedback: [
    { level: "success", message: "✅ Valid TypeScript syntax" },
    { level: "info", message: "📦 Imports: fs, path" },
    { level: "success", message: "✅ Code executed successfully" },
    { level: "info", message: "📤 Output: Hello World" },
    { level: "warning", message: "⚠️  Output partially matches expected" }
  ],
  details: {
    extracted_code: "console.log('Hello World')",
    executed: true,
    output: "Hello World",
    similarity: 0.95,
    score_breakdown: {
      syntax: 30,
      execution: 40,
      output_match: 28
    }
  }
}

📚 Examples

Example 1: Hello World

tests:
  - name: "hello_world"
    language: "typescript"
    prompt: "Write TypeScript code that prints 'Hello World'"
    expected: "Hello World"

Example 2: Factorial Function

tests:
  - name: "factorial"
    language: "typescript"
    prompt: |
      Write a TypeScript function that calculates the factorial of a number.
      Calculate factorial(5) and print the result.
    expected: "120"

Example 3: Interface Usage

tests:
  - name: "interface_test"
    language: "typescript"
    prompt: |
      Define a Person interface with name and age properties.
      Create a person and print their name.
    expected: "Alice"

More examples available in:

examples/simple_tests.yaml - Basic TypeScript tests
examples/advanced_tests.yaml - Complex TypeScript challenges
examples/algorithm_tests.yaml - Algorithm implementations

🧪 Testing

Run Unit Tests

# Install dependencies
npm install

# Run all tests
npm test

# Run with coverage
npm run test:coverage

Test Coverage

The plugin includes comprehensive tests:

✅ Unit Tests (tests/evaluator.test.ts)
- Code extraction
- Syntax validation
- Code execution
- Output comparison
- Error handling
- Timeout protection
✅ Integration Tests (tests/integration.test.ts)
- Plugin interface compatibility
- Multiple test scenarios
- Concurrent evaluations
- Large output handling
- Import support

🔧 Configuration

Environment Variables

# Required
OPENAI_API_KEY=your_api_key_here

# Optional
DEFAULT_MODEL=gpt-4o-mini
TYPESCRIPT_EXECUTION_TIMEOUT=5
TS_NODE_EXECUTABLE=/path/to/ts-node  # Leave empty for npx

Programmatic Configuration

import { TypeScriptEvaluator } from 'praisonaibench';

// Custom timeout
const evaluator = new TypeScriptEvaluator(10);

// Custom ts-node path
const evaluator = new TypeScriptEvaluator(
  5,
  "/usr/local/bin/ts-node"
);

🏗️ Architecture

Plugin Structure

praisonaibench/
├── src/
│   ├── index.ts              # Plugin exports
│   ├── evaluator.ts          # Main evaluator class
│   └── version.ts            # Version info
├── tests/
│   ├── evaluator.test.ts     # Unit tests
│   └── integration.test.ts   # Integration tests
├── examples/
│   ├── simple_tests.yaml
│   ├── advanced_tests.yaml
│   └── algorithm_tests.yaml
├── package.json              # Project configuration
├── tsconfig.json             # TypeScript config
├── .env                      # Configuration
└── README.md                 # This file

Class Hierarchy

BaseEvaluator (interface)
    └── TypeScriptEvaluator
        ├── getLanguage() → 'typescript'
        ├── getFileExtension() → 'ts'
        └── evaluate(code, testName, prompt, expected) → Promise<EvaluationResult>

🤝 Contributing

Contributions are welcome! Here's how:

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes
Run tests: npm test
Submit a pull request

Development Setup

# Clone repository
git clone https://github.com/MervinPraison/PraisonAIBench-TypeScript
cd praisonaibench

# Install dependencies
npm install

# Run tests
npm test

# Build
npm run build

📄 License

MIT License - see LICENSE file for details.

🔗 Links

📞 Support

Issues: GitHub Issues
Documentation: PraisonAI Bench Docs
Community: Join the discussion on GitHub

🎉 Acknowledgements

Built with ❤️ for the PraisonAI Bench community.

Special thanks to:

PraisonAI - For the amazing benchmarking framework
Contributors and testers
The TypeScript community

Ready to benchmark TypeScript code generation? Install the plugin and start testing! 🚀

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

PraisonAI Bench TypeScript

🎯 Features

🤖 Supported Providers

📊 Evaluation System

🚀 Quick Start

Installation

Set API Keys

Run Your First Test

Verify Installation

Configuration

Basic Usage

📊 Evaluation System

Scoring Breakdown

Scoring Examples

Example 1: Perfect Score (100/100)

Example 2: Partial Score (70/100)

Example 3: Failure (30/100)

📖 Usage Guide

TypeScript API

Test Suite Format

Simple Test

Advanced Test

Test Without Expected Output

🎨 Features

Security Features

Code Extraction

Output Comparison

Detailed Feedback

📚 Examples

Example 1: Hello World

Example 2: Factorial Function

Example 3: Interface Usage

🧪 Testing

Run Unit Tests

Test Coverage

🔧 Configuration

Environment Variables

Programmatic Configuration

🏗️ Architecture

Plugin Structure

Class Hierarchy

🤝 Contributing

Development Setup

📄 License

🔗 Links

📞 Support

🎉 Acknowledgements