@peacethinking/llm-eval

v0.1.0

Published

4 months ago

Benchmark any LLM on conflict resolution — the MMLU for PeaceTech

0High
0Medium
0Low

sommeralex

benchmark llm evaluation conflict-resolution mediation ai peacetech peacethinking

@peacethinking/llm-eval

Benchmark any LLM on conflict resolution tasks. The MMLU for PeaceTech.

A deterministic evaluation framework that scores LLM responses on structured conflict resolution tasks. No AI is used in scoring — all evaluation is pure schema validation + heuristics.

Install

npm install @peacethinking/llm-eval

CLI Usage

# Benchmark a single model
npx peaceeval --provider openai --model gpt-4o

# Compare two models
npx peaceeval --compare openai:gpt-4o anthropic:claude-sonnet-4-5-20250929

# Filter by category
npx peaceeval --provider openai --model gpt-4o --category workplace

# Filter by difficulty
npx peaceeval --provider openai --model gpt-4o --difficulty hard

# Output JSON
npx peaceeval --provider openai --model gpt-4o --output results.json

Programmatic Usage

import { benchmark, compare, createOpenAIProvider } from '@peacethinking/llm-eval';

// Benchmark a model
const results = await benchmark(
  createOpenAIProvider({ model: 'gpt-4o' }),
  { category: 'workplace', difficulty: 'medium' }
);

console.log(results.summary);
// { avgScore: 78.5, schemaCompliance: 95, fairness: 72, ... }

// Compare two models
const comparison = await compare([
  createOpenAIProvider({ model: 'gpt-4o' }),
  createAnthropicProvider({ model: 'claude-sonnet-4-5-20250929' }),
]);

console.log(comparison.leaderboard);

Scoring Dimensions

| Dimension | Weight | What it measures | |---|---|---| | Schema Compliance | 25% | Valid JSON, matches Zod VerdictSchema | | Score Plausibility | 20% | Scores sum to 100, dimensions have variation, confidence is reasonable | | Reframing Quality | 20% | Substantive what-if scenarios, perspective shifts, bridge message | | Fairness | 20% | Balanced verdicts for ambiguous cases, outcome matches scores | | Language Compliance | 15% | Response is in the requested language |

Test Scenarios

20 anonymized scenarios across:

5 categories: workplace, family, neighbor, rental, partnership
3 difficulties: easy, medium, hard
3 languages: English, German, Turkish

Providers

Built-in adapters for:

| Provider | Factory | Env Variable | |---|---|---| | OpenAI | createOpenAIProvider() | OPENAI_API_KEY | | Anthropic | createAnthropicProvider() | ANTHROPIC_API_KEY | | Google | createGoogleProvider() | GOOGLE_API_KEY | | Ollama | createOllamaProvider() | (local, no key) |

Custom Provider

import type { LLMProvider } from '@peacethinking/llm-eval';

const myProvider: LLMProvider = {
  name: 'my-provider',
  async complete(messages, options) {
    // Your implementation
    return { text: '...', latencyMs: 123 };
  },
};

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@peacethinking/llm-eval

Install

CLI Usage

Programmatic Usage

Scoring Dimensions

Test Scenarios

Providers

Custom Provider

License