eva-ts

v1.0.2

Published

8 months ago

A TypeScript evaluation framework for running concurrent evaluations with progress tracking and result persistence

0High
0Medium
0Low

shelden

evaluation testing scoring concurrent typescript framework benchmark metrics

eva

A powerful TypeScript evaluation framework for running concurrent evaluations with progress tracking and result persistence. Perfect for testing AI models, APIs, data processing pipelines, and any system that needs systematic evaluation against expected results.

✨ Features

🚀 Concurrent Processing - Configurable concurrency limits for optimal performance
📊 Progress Tracking - Visual progress bars with real-time updates
📁 Result Persistence - JSONL output for detailed analysis
🔍 Flexible Scoring - Support for string matching, JSON comparison, numerical analysis
🎯 Type Safety - Full TypeScript support with generic types
⚡ Async Support - Handle async data providers, tasks, and scorers
🛡️ Error Handling - Robust error handling in concurrent execution

📦 Installation

npm install eva-ts

🚀 Quick Start

import { Eval } from "eva-ts";
import type { DataItem, BaseScore } from "eva-ts";

// Define your evaluation types
interface MyInput {
  question: string;
}

interface MyExpected {
  answer: string;
}

interface MyOutput {
  response: string;
}

interface MyScore extends BaseScore {
  name: string;
  value: number;
}

// Create and run evaluation
const evaluation = new Eval<MyInput, MyExpected, MyOutput, MyScore>({
  // Provide test data
  dataProvider: () => [
    {
      input: { question: "What is 2+2?" },
      expected: { answer: "4" },
    },
    {
      input: { question: "What is the capital of France?" },
      expected: { answer: "Paris" },
    },
  ],

  // Define the task to evaluate
  taskFn: async ({ data }) => {
    // Your system under test (e.g., API call, model inference)
    const response = await myAIModel.generate(data.input.question);
    return { response };
  },

  // Define scoring functions
  scorers: [
    ({ output, data }) => ({
      name: "exact-match",
      value:
        output.response.toLowerCase() === data.expected?.answer.toLowerCase()
          ? 1
          : 0,
    }),
    ({ output, data }) => ({
      name: "contains-answer",
      value: output.response
        .toLowerCase()
        .includes(data.expected?.answer.toLowerCase() || "")
        ? 1
        : 0,
    }),
  ],

  // Configuration
  config: {
    name: "ai-model-evaluation",
    maxConcurrency: 3,
    outputDir: "./results",
  },
});

// Run the evaluation
const results = await evaluation.evaluate();
console.log(`Completed ${results.scores.length} evaluations`);

📚 Core Concepts

DataItem

Represents a single evaluation case:

interface DataItem<Input, Expected> {
  input: Input; // The input to your system
  expected?: Expected; // Expected output (optional)
  metadata?: Record<string, unknown>; // Additional context
}

Scorers

Functions that evaluate output quality:

type Scorer<Output, Score> = ({
  output,
  data,
}: {
  output: Output;
  data: DataItem<Input, Expected>;
}) => Score | Promise<Score>;

Configuration

interface EvalConfig {
  name: string; // Evaluation name
  maxConcurrency: number; // Concurrent task limit
  outputDir?: string; // Optional JSONL output directory
  projectName?: string; // Optional database project name
  evalDescription?: string; // Optional evaluation description
}

🗄️ Database Setup (Optional)

Eva supports optional database persistence for evaluation runs and results. We recommend using Supabase for easy setup and management.

Using Supabase (Recommended)

Create a Supabase project:
- Go to supabase.com and create a new project
- Note your project URL and password from the project settings
Set environment variable:
```
export DATABASE_URL="postgresql://postgres:[PASSWORD]@[HOST]:6543/postgres?pgbouncer=true&connection_limit=1"
```
Replace [PASSWORD] and [HOST] with your Supabase credentials. Use port 6543 for connection pooling (recommended) or 5432 for direct connection.

Configure your evaluation:

const evaluation = new Eval({
  config: {
    name: "my-evaluation",
    maxConcurrency: 3,
    projectName: "my-project", // Required for database storage
    evalDescription: "Testing my AI model", // Optional description
  },
  // ... rest of configuration
});

Database Schema

When database config is provided, Eva automatically creates a hierarchical structure:

Projects → Evaluation Names → Evaluation Runs → Results
Stores input, expected output, actual output, scores, and metadata in JSONB format
Supports querying, filtering, and statistical analysis of evaluation data

Alternative Database Setup

Eva uses Drizzle ORM and supports any PostgreSQL database. For other providers, simply set the DATABASE_URL environment variable to your PostgreSQL connection string.

🎯 Scoring Examples

String Matching

// Exact string match
({ output, data }) => ({
  name: 'exact-match',
  value: output.text === data.expected?.text ? 1 : 0
})

// Fuzzy string matching
({ output, data }) => ({
  name: 'similarity',
  value: calculateStringSimilarity(output.text, data.expected?.text || '')
})

JSON Comparison

// Deep JSON equality
({ output, data }) => ({
  name: 'json-match',
  value: JSON.stringify(output.data) === JSON.stringify(data.expected?.data) ? 1 : 0
})

// Field-specific validation
({ output, data }) => ({
  name: 'has-required-fields',
  value: output.data.id && output.data.name ? 1 : 0
})

Numerical Analysis

// Absolute error
({ output, data }) => ({
  name: 'absolute-error',
  value: Math.abs(output.value - (data.expected?.value || 0))
})

// Relative error
({ output, data }) => ({
  name: 'relative-error',
  value: data.expected?.value
    ? Math.abs(output.value - data.expected.value) / Math.abs(data.expected.value)
    : 0
})

🔧 Advanced Usage

Async Data Provider

const evaluation = new Eval({
  dataProvider: async () => {
    const response = await fetch("/api/test-cases");
    return await response.json();
  },
  // ... rest of configuration
});

Custom Metadata Scoring

const scorers = [
  // Score based on input context
  ({ output, data }) => ({
    name: "difficulty-adjusted",
    value:
      data.metadata?.difficulty === "hard"
        ? output.score * 2 // Double points for hard questions
        : output.score,
  }),
];

Error Handling

const taskFn = async ({ data }) => {
  try {
    return await riskyApiCall(data.input);
  } catch (error) {
    return { error: error.message, success: false };
  }
};

const scorers = [
  ({ output }) => ({
    name: "success-rate",
    value: output.success ? 1 : 0,
  }),
];

📊 Output Format

When outputDir is specified, Eva generates JSONL files with detailed results:

{"scores":[{"name":"exact-match","value":1}],"index":0,"input":{"question":"What is 2+2?"},"expected":{"answer":"4"},"metadata":{},"output":{"response":"4"}}
{"scores":[{"name":"exact-match","value":0}],"index":1,"input":{"question":"Capital of France?"},"expected":{"answer":"Paris"},"metadata":{},"output":{"response":"The capital is Paris"}}

🏗️ Architecture

Eva is built with performance and flexibility in mind:

Concurrent Execution: Uses p-limit for controlled concurrency
Progress Tracking: Real-time progress bars via cli-progress
Type Safety: Full TypeScript generics support
Memory Efficient: Streams results to disk for large evaluations
Error Resilient: Continues evaluation even if individual tasks fail

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Run tests (npm test)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with TypeScript for type safety
Uses cli-progress for beautiful progress bars
Powered by p-limit for concurrency control

Made with ❤️ by Lilac Labs