praisonaibench
v0.1.0
Published
Multi-provider LLM Benchmarking Framework - Evaluate code generation with OpenAI, Anthropic, Google, xAI, Mistral, Groq
Downloads
104
Maintainers
Readme
PraisonAI Bench TypeScript
🚀 A powerful multi-provider LLM benchmarking framework
Benchmark any LLM with automatic code evaluation, TypeScript/HTML execution, and comprehensive reports. Supports OpenAI, Anthropic, Google, xAI, Mistral, and Groq via Vercel AI SDK.
🎯 Features
| Feature | Description | |---------|-------------| | 🤖 Multi-Provider | OpenAI, Anthropic, Google, xAI, Mistral, Groq | | 📊 Multi-Stage Evaluation | Syntax validation, code execution, output comparison | | 💰 Cost & Token Tracking | Automatic token usage and cost calculation | | 📈 HTML Reports | Beautiful dashboard reports with charts | | ⚡ Parallel Execution | Run tests concurrently | | 🔌 Plugin System | Extensible evaluators for any language (docs) | | 🎯 Test Suites | YAML/JSON test suite support | | 🔄 Retry Logic | Automatic retries with exponential backoff |
🤖 Supported Providers
| Provider | Models | Env Variable |
|----------|--------|--------------|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini | OPENAI_API_KEY |
| Anthropic | claude-3-5-sonnet-latest, claude-3-opus-latest, claude-3-haiku | ANTHROPIC_API_KEY |
| Google | gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash | GOOGLE_GENERATIVE_AI_API_KEY |
| xAI | grok-beta, grok-2-1212 | XAI_API_KEY |
| Mistral | mistral-large-latest, mistral-medium-latest | MISTRAL_API_KEY |
| Groq | llama-3.1-70b-versatile, mixtral-8x7b-32768 | GROQ_API_KEY |
📊 Evaluation System
| Stage | Points | Description | |-------|--------|-------------| | Syntax Validation | 30 | TypeScript compiler API parsing | | Code Execution | 40 | Safe ts-node subprocess execution | | Output Comparison | 30 | Fuzzy matching with expected output | | Total | 100 | Pass threshold: ≥70 |
🚀 Quick Start
Installation
# Install globally
npm install -g praisonaibench
# Or install locally
npm install praisonaibenchSet API Keys
# OpenAI (default)
export OPENAI_API_KEY=your_openai_key
# Or use other providers
export ANTHROPIC_API_KEY=your_anthropic_key
export GOOGLE_GENERATIVE_AI_API_KEY=your_google_key
export GROQ_API_KEY=your_groq_keyRun Your First Test
# Single test with OpenAI (default)
praisonaibench --test "Write TypeScript code that prints Hello World"
# With specific model
praisonaibench --test "Calculate factorial of 5" --model gpt-4o-mini
# Use Anthropic Claude
praisonaibench --test "Write a hello world" --model claude-3-5-sonnet-latest
# Use Google Gemini
praisonaibench --test "Write a hello world" --model gemini-1.5-flash
# Cross-model comparison
praisonaibench --cross-model "Write hello world" --models openai/gpt-4o,anthropic/claude-3-5-sonnet-latest
# Run test suite with report
praisonaibench --suite tests.yaml --report
# List available providers
praisonaibench --list-providersVerify Installation
# Check version
node -e "const { TypeScriptEvaluator } = require('./dist'); console.log('Plugin loaded successfully!');"Configuration
Create a .env file (or copy from .env.example):
# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here
# Default model
DEFAULT_MODEL=gpt-4o-mini
# Execution timeout (seconds)
TYPESCRIPT_EXECUTION_TIMEOUT=5Basic Usage
Create a test suite file tests.yaml:
tests:
- name: "hello_world"
language: "typescript"
prompt: "Write TypeScript code that prints 'Hello World'"
expected: "Hello World"
- name: "calculate_factorial"
language: "typescript"
prompt: "Write a TypeScript function that calculates factorial of 5"
expected: "120"Run the benchmarks:
praisonaibench --suite tests.yaml --model gpt-4o-mini📊 Evaluation System
Scoring Breakdown
The evaluator uses a three-stage assessment system:
| Stage | Points | Description | |-------|--------|-------------| | Syntax Validation | 30 | TypeScript compiler API parsing, import detection | | Code Execution | 40 | Safe ts-node subprocess execution, error capture | | Output Comparison | 30 | Fuzzy matching with expected output | | Total | 100 | Combined score |
Pass Threshold: 70/100 points
Scoring Examples
Example 1: Perfect Score (100/100)
// Code: console.log("Hello World")
// Expected: "Hello World"
✅ Syntax: 30 points (valid TypeScript)
✅ Execution: 40 points (runs successfully)
✅ Output: 30 points (exact match)
─────────────────────────
Total: 100/100 ✅ PASSEDExample 2: Partial Score (70/100)
// Code: console.log("Hello")
// Expected: "Hello World"
✅ Syntax: 30 points (valid TypeScript)
✅ Execution: 40 points (runs successfully)
⚠️ Output: 0 points (different output)
─────────────────────────
Total: 70/100 ✅ PASSEDExample 3: Failure (30/100)
// Code: console.log(undefinedVar)
// Expected: "Hello World"
✅ Syntax: 30 points (valid syntax)
❌ Execution: 0 points (ReferenceError)
❌ Output: 0 points (didn't execute)
─────────────────────────
Total: 30/100 ❌ FAILED📖 Usage Guide
TypeScript API
import { TypeScriptEvaluator } from 'praisonaibench';
// Create evaluator
const evaluator = new TypeScriptEvaluator(5); // 5 second timeout
// Evaluate code
const result = await evaluator.evaluate(
'console.log("Hello World")',
"hello_test",
"Write TypeScript code that prints Hello World",
"Hello World"
);
// Check results
console.log(`Score: ${result.score}/100`);
console.log(`Passed: ${result.passed}`);
// View feedback
for (const item of result.feedback) {
console.log(`${item.level}: ${item.message}`);
}
// Access details
console.log(`Output: ${result.details.output}`);
console.log(`Score breakdown:`, result.details.score_breakdown);Test Suite Format
Simple Test
tests:
- name: "basic_math"
language: "typescript"
prompt: "Calculate 15 * 23 and print the result"
expected: "345"Advanced Test
tests:
- name: "fibonacci"
language: "typescript"
prompt: |
Write a TypeScript function that calculates the nth Fibonacci number.
Calculate and print the 10th Fibonacci number.
expected: "55"Test Without Expected Output
tests:
- name: "creative_code"
language: "typescript"
prompt: "Write a TypeScript class for a simple calculator"
# No expected field - evaluation based on syntax and execution only🎨 Features
Security Features
- ✅ Subprocess Isolation - Code runs in separate process via ts-node
- ✅ Timeout Protection - Configurable execution timeout (default: 5s)
- ✅ Resource Limits - Prevents infinite loops and resource exhaustion
- ✅ Error Handling - Graceful handling of all error types
Code Extraction
Automatically extracts code from various formats:
// Supports typescript code blocks
```typescript
console.log("Hello")
```
// Supports ts code blocks
```ts
console.log("Hello")
```
// Supports generic code blocks
```
console.log("Hello")
```
// Supports raw code
console.log('Hello')Output Comparison
Smart fuzzy matching algorithm:
- Exact match: 30/30 points
- High similarity (>80%): 25-29 points
- Medium similarity (50-80%): 15-24 points
- Low similarity (<50%): 0-14 points
Features:
- Case-insensitive comparison
- Whitespace normalisation
- Substring matching (e.g., "345" in "The answer is 345")
Detailed Feedback
{
score: 85,
passed: true,
feedback: [
{ level: "success", message: "✅ Valid TypeScript syntax" },
{ level: "info", message: "📦 Imports: fs, path" },
{ level: "success", message: "✅ Code executed successfully" },
{ level: "info", message: "📤 Output: Hello World" },
{ level: "warning", message: "⚠️ Output partially matches expected" }
],
details: {
extracted_code: "console.log('Hello World')",
executed: true,
output: "Hello World",
similarity: 0.95,
score_breakdown: {
syntax: 30,
execution: 40,
output_match: 28
}
}
}📚 Examples
Example 1: Hello World
tests:
- name: "hello_world"
language: "typescript"
prompt: "Write TypeScript code that prints 'Hello World'"
expected: "Hello World"Example 2: Factorial Function
tests:
- name: "factorial"
language: "typescript"
prompt: |
Write a TypeScript function that calculates the factorial of a number.
Calculate factorial(5) and print the result.
expected: "120"Example 3: Interface Usage
tests:
- name: "interface_test"
language: "typescript"
prompt: |
Define a Person interface with name and age properties.
Create a person and print their name.
expected: "Alice"More examples available in:
examples/simple_tests.yaml- Basic TypeScript testsexamples/advanced_tests.yaml- Complex TypeScript challengesexamples/algorithm_tests.yaml- Algorithm implementations
🧪 Testing
Run Unit Tests
# Install dependencies
npm install
# Run all tests
npm test
# Run with coverage
npm run test:coverageTest Coverage
The plugin includes comprehensive tests:
✅ Unit Tests (
tests/evaluator.test.ts)- Code extraction
- Syntax validation
- Code execution
- Output comparison
- Error handling
- Timeout protection
✅ Integration Tests (
tests/integration.test.ts)- Plugin interface compatibility
- Multiple test scenarios
- Concurrent evaluations
- Large output handling
- Import support
🔧 Configuration
Environment Variables
# Required
OPENAI_API_KEY=your_api_key_here
# Optional
DEFAULT_MODEL=gpt-4o-mini
TYPESCRIPT_EXECUTION_TIMEOUT=5
TS_NODE_EXECUTABLE=/path/to/ts-node # Leave empty for npxProgrammatic Configuration
import { TypeScriptEvaluator } from 'praisonaibench';
// Custom timeout
const evaluator = new TypeScriptEvaluator(10);
// Custom ts-node path
const evaluator = new TypeScriptEvaluator(
5,
"/usr/local/bin/ts-node"
);🏗️ Architecture
Plugin Structure
praisonaibench/
├── src/
│ ├── index.ts # Plugin exports
│ ├── evaluator.ts # Main evaluator class
│ └── version.ts # Version info
├── tests/
│ ├── evaluator.test.ts # Unit tests
│ └── integration.test.ts # Integration tests
├── examples/
│ ├── simple_tests.yaml
│ ├── advanced_tests.yaml
│ └── algorithm_tests.yaml
├── package.json # Project configuration
├── tsconfig.json # TypeScript config
├── .env # Configuration
└── README.md # This fileClass Hierarchy
BaseEvaluator (interface)
└── TypeScriptEvaluator
├── getLanguage() → 'typescript'
├── getFileExtension() → 'ts'
└── evaluate(code, testName, prompt, expected) → Promise<EvaluationResult>🤝 Contributing
Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes
- Run tests:
npm test - Submit a pull request
Development Setup
# Clone repository
git clone https://github.com/MervinPraison/PraisonAIBench-TypeScript
cd praisonaibench
# Install dependencies
npm install
# Run tests
npm test
# Build
npm run build📄 License
MIT License - see LICENSE file for details.
🔗 Links
- npm Package - Install from npm
- PraisonAI Bench - Main project
- Plugin System Documentation
- Issue Tracker
📞 Support
- Issues: GitHub Issues
- Documentation: PraisonAI Bench Docs
- Community: Join the discussion on GitHub
🎉 Acknowledgements
Built with ❤️ for the PraisonAI Bench community.
Special thanks to:
- PraisonAI - For the amazing benchmarking framework
- Contributors and testers
- The TypeScript community
Ready to benchmark TypeScript code generation? Install the plugin and start testing! 🚀
