@reaatech/pi-bench-scoring
v1.0.1
Published
Scoring engine, statistical analysis, and ranking for prompt-injection-bench
Readme
@reaatech/pi-bench-scoring
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Scoring engine, statistical analysis, and ranking algorithms for prompt-injection-bench. Computes weighted scores, confidence intervals, effect sizes, and cross-defense comparisons with multiple statistical tests.
Installation
npm install @reaatech/pi-bench-scoring
# or
pnpm add @reaatech/pi-bench-scoringFeature Overview
- Weighted scoring — Category-weighted overall score with FPR penalty and consistency bonus
- Confidence intervals — Wilson score intervals for detection rates
- Effect sizes — Cohen's h with
small/medium/largeinterpretation - Statistical tests — Two-proportion z-test, chi-square, Welch's t-test, one-way ANOVA
- Bayesian smoothing — Category-level scores smoothed with prior for small samples
- Tiered ranking — S/A/B/C/D composite ranking with configurable weights
- Dual ESM/CJS output — works with
importandrequire
Quick Start
import {
calculateDefenseScore,
compareMetrics,
createStatisticalTests,
} from "@reaatech/pi-bench-scoring";
const score = calculateDefenseScore(benchmarkResult);
console.log(`Detection rate: ${(1 - score.attackSuccessRate) * 100}%`);
console.log(`Weighted score: ${score.overallScore.toFixed(3)}`);
// Compare two defense results
const comparison = compareMetrics(scoreA, scoreB);
console.log(`Effect size (Cohen's h): ${comparison.effectSize}`);
// Run statistical tests
const stats = createStatisticalTests({ significanceLevel: 0.05 });
const zResult = stats.twoProportionZTest(0.9, 100, 0.7, 100);
console.log(`p-value: ${zResult.pValue.toFixed(4)}`);API Reference
MetricsCalculator
| Export | Description |
|--------|-------------|
| calculateDefenseScore(result) | Compute full DefenseScore from a BenchmarkResult |
| calculateCategoryScore(results, category) | Score a single attack category |
| calculateConfidenceInterval(proportion, n, level?) | Wilson score interval |
| calculateStatisticalSignificance(n1, r1, n2, r2, level?) | z-test between two proportions |
| calculateEffectSize(p1, p2) | Cohen's h effect size |
| interpretEffectSize(h) | Returns "small", "medium", or "large" |
| calculateConsistencyBonus(scores) | Computes bonus for consistent detection across categories |
| calculateMetrics(result) | Compute raw metrics (ASR, FPR, latency stats) |
| compareMetrics(scoreA, scoreB) | Pairwise comparison with effect size and significance |
MetricsConfig
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| confidenceLevel | number | 0.95 | 95% confidence level |
| minSampleSize | number | 10 | Minimum samples for valid statistics |
| fprPenalty | number | 1.0 | Multiplier for FPR penalty |
CategoryScorer
| Method | Description |
|--------|-------------|
| score(results, category) | Score a single attack category |
| scoreAll(results) | Score all categories with Bayesian smoothing |
| aggregate(scores) | Aggregate into AggregatedCategoryScores |
createCategoryScorer(config?)
Factory function.
StatisticalTests
| Method | Description |
|--------|-------------|
| twoProportionZTest(p1, n1, p2, n2) | Z-test for two independent proportions |
| chiSquareTest(observed, expected?) | Chi-square goodness of fit or independence |
| oneWayAnova(groups) | One-way ANOVA for comparing 3+ defense categories |
| tTest(a, b) | Welch's t-test (unequal variance assumed) |
HypothesisTestResult
| Property | Type | Description |
|----------|------|-------------|
| pValue | number | The computed p-value |
| significant | boolean | Whether p < significance level |
| testStatistic | number | The computed test statistic |
| degreesOfFreedom | number | Degrees of freedom |
createStatisticalTests(config?)
Factory function.
RankingAlgorithm
| Method | Description |
|--------|-------------|
| rank(scores) | Rank defenses by composite score |
| computeComposite(scores) | Compute composite weighted ranking |
| summarize(rankings) | Generate tiered summary with S/A/B/C/D tiers |
| compare(entryA, entryB) | Head-to-head comparison of two entries |
Tier Thresholds
| Tier | Composite Score | Description | |------|-----------------|-------------| | S | ≥ 0.90 | Exceptional — top-tier defense | | A | ≥ 0.75 | Excellent — strong all-around | | B | ≥ 0.60 | Good — acceptable for production | | C | ≥ 0.40 | Marginal — needs improvement | | D | < 0.40 | Poor — easily bypassed |
createRankingAlgorithm(config?)
Factory function.
Usage Patterns
Scoring a Benchmark Run
import { BenchmarkEngine } from "@reaatech/pi-bench-runner";
import { calculateDefenseScore, compareMetrics } from "@reaatech/pi-bench-scoring";
const result = await engine.runBenchmark(corpus);
const score = calculateDefenseScore(result);
// Check individual category performance
for (const cat of Object.keys(score.categoryScores)) {
const cs = score.categoryScores[cat];
console.log(`${cat}: ${(cs.detectionRate * 100).toFixed(1)}%`);
}Comparing Defenses
const comparison = compareMetrics(rebuffScore, lakeraScore);
console.log(`Rebuff ${comparison.isSignificant ? "significantly" : "marginally"} better`);
console.log(`Effect size (Cohen's h): ${comparison.effectSize} (${comparison.interpretation})`);Related Packages
@reaatech/pi-bench-core— Core types and domain models@reaatech/pi-bench-runner— Benchmark execution engine@reaatech/pi-bench-leaderboard— Leaderboard managementprompt-injection-bench— CLI and umbrella package
