@var-ia/eval
v0.1.1
Published
Evaluation harness for ground truth validation and benchmark pages
Readme
@var-ia/eval
Generic evaluation harness for L2 model quality — benchmarks, calibration, L3 ground truth validation.
Exports
Harness
createEvalHarness()— returns anEvalHarnesswithevaluate(),benchmarkPages(), andcomputeScores()EvalHarness— interface for running test cases against evidence eventsEvalTestCase— a single benchmark case (page, revision range, expected events)EvalResult— per-test result with precision, matches, misses, false positivesEvalScoreSummary— aggregate scores across all tests
L2 Benchmark
runL2Benchmark()— run L2 interpretation benchmark across a synthetic datasetbuildL2Dataset()— construct a benchmark dataset of test casesprintBenchmarkResult()— format benchmark results for display
Calibration
computeCalibration()— compute calibration scores for model interpretations against expected labelsExpectedInterpretation— expected interpretation for a calibration case
L3 Ground Truth
validateAgainstGroundTruth()— validate L1 events against L3 outcome labelsGROUND_TRUTH_LABELS— built-in ground truth labelsgetGroundTruthById()/getGroundTruthForPage()— lookup helpersOutcomeLabel— L3 ground truth label typeL3ValidationResult/L3ValidationSummary— validation result types
License
AGPL-3.0
