@reaatech/pi-bench-scoring

v1.0.1

Published

6 days ago

Scoring engine, statistical analysis, and ranking for prompt-injection-bench

0High
0Medium
0Low

reaatech

@reaatech/pi-bench-scoring

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Scoring engine, statistical analysis, and ranking algorithms for prompt-injection-bench. Computes weighted scores, confidence intervals, effect sizes, and cross-defense comparisons with multiple statistical tests.

Installation

npm install @reaatech/pi-bench-scoring
# or
pnpm add @reaatech/pi-bench-scoring

Feature Overview

Weighted scoring — Category-weighted overall score with FPR penalty and consistency bonus
Confidence intervals — Wilson score intervals for detection rates
Effect sizes — Cohen's h with small / medium / large interpretation
Statistical tests — Two-proportion z-test, chi-square, Welch's t-test, one-way ANOVA
Bayesian smoothing — Category-level scores smoothed with prior for small samples
Tiered ranking — S/A/B/C/D composite ranking with configurable weights
Dual ESM/CJS output — works with import and require

Quick Start

import {
  calculateDefenseScore,
  compareMetrics,
  createStatisticalTests,
} from "@reaatech/pi-bench-scoring";

const score = calculateDefenseScore(benchmarkResult);
console.log(`Detection rate: ${(1 - score.attackSuccessRate) * 100}%`);
console.log(`Weighted score: ${score.overallScore.toFixed(3)}`);

// Compare two defense results
const comparison = compareMetrics(scoreA, scoreB);
console.log(`Effect size (Cohen's h): ${comparison.effectSize}`);

// Run statistical tests
const stats = createStatisticalTests({ significanceLevel: 0.05 });
const zResult = stats.twoProportionZTest(0.9, 100, 0.7, 100);
console.log(`p-value: ${zResult.pValue.toFixed(4)}`);

API Reference

MetricsCalculator

| Export | Description | |--------|-------------| | calculateDefenseScore(result) | Compute full DefenseScore from a BenchmarkResult | | calculateCategoryScore(results, category) | Score a single attack category | | calculateConfidenceInterval(proportion, n, level?) | Wilson score interval | | calculateStatisticalSignificance(n1, r1, n2, r2, level?) | z-test between two proportions | | calculateEffectSize(p1, p2) | Cohen's h effect size | | interpretEffectSize(h) | Returns "small", "medium", or "large" | | calculateConsistencyBonus(scores) | Computes bonus for consistent detection across categories | | calculateMetrics(result) | Compute raw metrics (ASR, FPR, latency stats) | | compareMetrics(scoreA, scoreB) | Pairwise comparison with effect size and significance |

`MetricsConfig`

| Property | Type | Default | Description | |----------|------|---------|-------------| | confidenceLevel | number | 0.95 | 95% confidence level | | minSampleSize | number | 10 | Minimum samples for valid statistics | | fprPenalty | number | 1.0 | Multiplier for FPR penalty |

CategoryScorer

| Method | Description | |--------|-------------| | score(results, category) | Score a single attack category | | scoreAll(results) | Score all categories with Bayesian smoothing | | aggregate(scores) | Aggregate into AggregatedCategoryScores |

`createCategoryScorer(config?)`

Factory function.

StatisticalTests

| Method | Description | |--------|-------------| | twoProportionZTest(p1, n1, p2, n2) | Z-test for two independent proportions | | chiSquareTest(observed, expected?) | Chi-square goodness of fit or independence | | oneWayAnova(groups) | One-way ANOVA for comparing 3+ defense categories | | tTest(a, b) | Welch's t-test (unequal variance assumed) |

HypothesisTestResult

| Property | Type | Description | |----------|------|-------------| | pValue | number | The computed p-value | | significant | boolean | Whether p < significance level | | testStatistic | number | The computed test statistic | | degreesOfFreedom | number | Degrees of freedom |

`createStatisticalTests(config?)`

Factory function.

RankingAlgorithm

| Method | Description | |--------|-------------| | rank(scores) | Rank defenses by composite score | | computeComposite(scores) | Compute composite weighted ranking | | summarize(rankings) | Generate tiered summary with S/A/B/C/D tiers | | compare(entryA, entryB) | Head-to-head comparison of two entries |

Tier Thresholds

| Tier | Composite Score | Description | |------|-----------------|-------------| | S | ≥ 0.90 | Exceptional — top-tier defense | | A | ≥ 0.75 | Excellent — strong all-around | | B | ≥ 0.60 | Good — acceptable for production | | C | ≥ 0.40 | Marginal — needs improvement | | D | < 0.40 | Poor — easily bypassed |

`createRankingAlgorithm(config?)`

Factory function.

Usage Patterns

Scoring a Benchmark Run

import { BenchmarkEngine } from "@reaatech/pi-bench-runner";
import { calculateDefenseScore, compareMetrics } from "@reaatech/pi-bench-scoring";

const result = await engine.runBenchmark(corpus);
const score = calculateDefenseScore(result);

// Check individual category performance
for (const cat of Object.keys(score.categoryScores)) {
  const cs = score.categoryScores[cat];
  console.log(`${cat}: ${(cs.detectionRate * 100).toFixed(1)}%`);
}

Comparing Defenses

const comparison = compareMetrics(rebuffScore, lakeraScore);
console.log(`Rebuff ${comparison.isSignificant ? "significantly" : "marginally"} better`);
console.log(`Effect size (Cohen's h): ${comparison.effectSize} (${comparison.interpretation})`);

Related Packages

@reaatech/pi-bench-core — Core types and domain models
@reaatech/pi-bench-runner — Benchmark execution engine
@reaatech/pi-bench-leaderboard — Leaderboard management
prompt-injection-bench — CLI and umbrella package

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@reaatech/pi-bench-scoring

Installation

Feature Overview

Quick Start

API Reference

MetricsCalculator

MetricsConfig

CategoryScorer

createCategoryScorer(config?)

StatisticalTests

HypothesisTestResult

createStatisticalTests(config?)

RankingAlgorithm

Tier Thresholds

createRankingAlgorithm(config?)

Usage Patterns

Scoring a Benchmark Run

Comparing Defenses

Related Packages

License

`MetricsConfig`

`createCategoryScorer(config?)`

`createStatisticalTests(config?)`

`createRankingAlgorithm(config?)`