@peacethinking/llm-eval
v0.1.0
Published
Benchmark any LLM on conflict resolution — the MMLU for PeaceTech
Maintainers
Readme
@peacethinking/llm-eval
Benchmark any LLM on conflict resolution tasks. The MMLU for PeaceTech.
A deterministic evaluation framework that scores LLM responses on structured conflict resolution tasks. No AI is used in scoring — all evaluation is pure schema validation + heuristics.
Install
npm install @peacethinking/llm-evalCLI Usage
# Benchmark a single model
npx peaceeval --provider openai --model gpt-4o
# Compare two models
npx peaceeval --compare openai:gpt-4o anthropic:claude-sonnet-4-5-20250929
# Filter by category
npx peaceeval --provider openai --model gpt-4o --category workplace
# Filter by difficulty
npx peaceeval --provider openai --model gpt-4o --difficulty hard
# Output JSON
npx peaceeval --provider openai --model gpt-4o --output results.jsonProgrammatic Usage
import { benchmark, compare, createOpenAIProvider } from '@peacethinking/llm-eval';
// Benchmark a model
const results = await benchmark(
createOpenAIProvider({ model: 'gpt-4o' }),
{ category: 'workplace', difficulty: 'medium' }
);
console.log(results.summary);
// { avgScore: 78.5, schemaCompliance: 95, fairness: 72, ... }// Compare two models
const comparison = await compare([
createOpenAIProvider({ model: 'gpt-4o' }),
createAnthropicProvider({ model: 'claude-sonnet-4-5-20250929' }),
]);
console.log(comparison.leaderboard);Scoring Dimensions
| Dimension | Weight | What it measures | |---|---|---| | Schema Compliance | 25% | Valid JSON, matches Zod VerdictSchema | | Score Plausibility | 20% | Scores sum to 100, dimensions have variation, confidence is reasonable | | Reframing Quality | 20% | Substantive what-if scenarios, perspective shifts, bridge message | | Fairness | 20% | Balanced verdicts for ambiguous cases, outcome matches scores | | Language Compliance | 15% | Response is in the requested language |
Test Scenarios
20 anonymized scenarios across:
- 5 categories: workplace, family, neighbor, rental, partnership
- 3 difficulties: easy, medium, hard
- 3 languages: English, German, Turkish
Providers
Built-in adapters for:
| Provider | Factory | Env Variable |
|---|---|---|
| OpenAI | createOpenAIProvider() | OPENAI_API_KEY |
| Anthropic | createAnthropicProvider() | ANTHROPIC_API_KEY |
| Google | createGoogleProvider() | GOOGLE_API_KEY |
| Ollama | createOllamaProvider() | (local, no key) |
Custom Provider
import type { LLMProvider } from '@peacethinking/llm-eval';
const myProvider: LLMProvider = {
name: 'my-provider',
async complete(messages, options) {
// Your implementation
return { text: '...', latencyMs: 123 };
},
};License
MIT
