npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ppef

v1.5.0

Published

Portable Programmatic Evaluation Framework - Claim-driven, deterministic evaluation for experiments

Downloads

1,081

Readme

PPEF - Portable Programmatic Evaluation Framework

A claim-driven, deterministic evaluation framework for experiments. PPEF provides a structured approach to testing and validating software components through reusable test cases, statistical aggregation, and claim-based evaluation.

Published npm package with dual ESM/CJS output. Single runtime dependency: commander.

Features

  • Type-safe: Strict TypeScript with generic SUT, Case, and Evaluator abstractions
  • Registry: Centralized registries for Systems Under Test (SUTs) and evaluation cases with role/tag filtering
  • Execution: Deterministic execution with worker threads, checkpointing, memory monitoring, and binary SUT support
  • Statistical: Mann-Whitney U test, Cohen's d, confidence intervals
  • Aggregation: Summary stats, pairwise comparisons, and rankings across runs
  • Evaluation: Four built-in evaluators — claims, robustness, metrics, and exploratory
  • Rendering: LaTeX table generation for thesis integration
  • CLI: Five commands for running, validating, planning, aggregating, and evaluating experiments

Installation

# Install as a dependency
pnpm add ppef

# Or use locally for development
git clone https://github.com/Mearman/ppef.git
cd ppef
pnpm install
pnpm build

Development

pnpm install              # Install dependencies
pnpm build                # TypeScript compile + CJS wrapper generation
pnpm typecheck            # Type-check only (tsc --noEmit)
pnpm lint                 # ESLint + Prettier with auto-fix
pnpm test                 # Run all tests with coverage (c8 + tsx + Node native test runner)

Run a single test file:

npx tsx --test src/path/to/file.test.ts

CLI (after build):

ppef experiment.json   # Run experiment (default command)
ppef run config.json   # Explicit run command
ppef validate          # Validate configuration
ppef plan              # Dry-run execution plan
ppef aggregate         # Post-process results
ppef evaluate          # Run evaluators on results

Quick Start

Create a minimal experiment with three files and a config:

experiment.json

{
  "experiment": {
    "name": "string-length",
    "description": "Compare string length implementations"
  },
  "executor": {
    "repetitions": 3
  },
  "suts": [
    {
      "id": "builtin-length",
      "module": "./sut.mjs",
      "exportName": "createSut",
      "registration": {
        "name": "Built-in .length",
        "version": "1.0.0",
        "role": "primary"
      }
    }
  ],
  "cases": [
    {
      "id": "hello-world",
      "module": "./case.mjs",
      "exportName": "createCase"
    }
  ],
  "metricsExtractor": {
    "module": "./metrics.mjs",
    "exportName": "extract"
  },
  "output": {
    "path": "./results"
  }
}

sut.mjs — System Under Test factory

export function createSut() {
  return {
    id: "builtin-length",
    config: {},
    run: async (input) => ({ length: input.text.length }),
  };
}

case.mjs — Test case definition

export function createCase() {
  return {
    case: {
      caseId: "hello-world",
      caseClass: "basic",
      name: "Hello World",
      version: "1.0.0",
      inputs: { text: "hello world" },
    },
    getInput: async () => ({ text: "hello world" }),
    getInputs: () => ({ text: "hello world" }),
  };
}

metrics.mjs — Metrics extractor

export function extract(result) {
  return { length: result.length ?? 0 };
}

Run it:

npx ppef experiment.json

Workflows

The typical pipeline chains CLI commands: validate, run, aggregate, then evaluate.

ppef validate config.json
    → ppef run config.json
        → ppef aggregate results.json
            → ppef evaluate aggregates.json -t claims -c claims.json

1. Validate Configuration

Check an experiment config for errors before running:

ppef validate experiment.json

2. Preview Execution Plan

See what would run without executing (SUTs x cases x repetitions):

ppef plan experiment.json

3. Run an Experiment

Execute all SUTs against all cases with worker thread isolation:

ppef run experiment.json
ppef run experiment.json -o ./output -j 4 --verbose
ppef run experiment.json --unsafe-in-process  # No worker isolation (debugging only)

The output directory contains a results JSON and (by default) an aggregates JSON.

4. Aggregate Results

Compute summary statistics, pairwise comparisons, and rankings from raw results:

ppef aggregate results.json
ppef aggregate results.json -o aggregates.json --compute-comparisons

5. Evaluate Results

Run evaluators against aggregated (or raw) results. Each evaluator type takes a JSON config file.

Claims — Test Explicit Hypotheses

Test whether SUT A outperforms baseline B on a given metric with statistical significance:

ppef evaluate aggregates.json -t claims -c claims.json -v

claims.json:

{
  "claims": [
    {
      "claimId": "C001",
      "description": "Primary has greater accuracy than baseline",
      "sut": "primary-sut",
      "baseline": "baseline-sut",
      "metric": "accuracy",
      "direction": "greater",
      "scope": "global"
    }
  ],
  "significanceLevel": 0.05
}

Metrics — Threshold, Baseline, and Range Criteria

Evaluate metrics against fixed thresholds, baselines, or target ranges:

ppef evaluate aggregates.json -t metrics -c metrics-config.json

metrics-config.json:

{
  "criteria": [
    {
      "criterionId": "exec-time",
      "description": "Execution time under 1000ms",
      "type": "threshold",
      "metric": "executionTime",
      "sut": "*",
      "threshold": { "operator": "lt", "value": 1000 }
    },
    {
      "criterionId": "f1-range",
      "description": "F1 score in [0.8, 1.0]",
      "type": "target-range",
      "metric": "f1Score",
      "sut": "*",
      "targetRange": { "min": 0.8, "max": 1.0, "minInclusive": true, "maxInclusive": true }
    }
  ]
}

Robustness — Sensitivity Under Perturbations

Measure how performance degrades under perturbations at varying intensity levels:

ppef evaluate results.json -t robustness -c robustness-config.json

robustness-config.json:

{
  "metrics": ["executionTime", "accuracy"],
  "perturbations": ["edge-removal", "noise", "seed-shift"],
  "intensityLevels": [0.1, 0.2, 0.3, 0.4, 0.5],
  "runsPerLevel": 10
}

Output Formats

All evaluators support JSON and LaTeX output:

ppef evaluate aggregates.json -t claims -c claims.json -f latex
ppef evaluate aggregates.json -t metrics -c metrics.json -f json -o results.json

Inline Evaluators

Evaluator configs can be embedded directly in the experiment config via the optional evaluators field, making the config self-contained:

{
  "experiment": { "name": "my-experiment" },
  "executor": { "repetitions": 10 },
  "suts": [ ... ],
  "cases": [ ... ],
  "metricsExtractor": { ... },
  "output": { "path": "./results" },
  "evaluators": [
    {
      "type": "claims",
      "config": {
        "claims": [ ... ]
      }
    }
  ]
}

JSON Schema Validation

Experiment configs can reference the generated schema for IDE autocompletion:

{
  "$schema": "./ppef.schema.json",
  "experiment": { ... }
}

Standalone evaluator configs reference schema $defs:

{
  "$schema": "./ppef.schema.json#/$defs/ClaimsEvaluatorConfig",
  "claims": [ ... ]
}

Cross-Language Specification

PPEF is designed for cross-language interoperability. A Python runner can produce results consumable by the TypeScript aggregator, and vice versa.

The specification lives in spec/ and comprises three layers:

| Layer | Location | Purpose | |-------|----------|---------| | JSON Schema | ppef.schema.json | Machine-readable type definitions for all input and output types | | Conformance Vectors | spec/conformance/ | Pinned input/output pairs that any implementation must reproduce | | Prose Specification | spec/README.md | Execution semantics, module contracts, statistical algorithms |

All output types are available as $defs in the schema, enabling validation from any language:

ppef.schema.json#/$defs/EvaluationResult
ppef.schema.json#/$defs/ResultBatch
ppef.schema.json#/$defs/AggregationOutput
ppef.schema.json#/$defs/ClaimEvaluationSummary
ppef.schema.json#/$defs/MetricsEvaluationSummary
ppef.schema.json#/$defs/RobustnessAnalysisOutput
ppef.schema.json#/$defs/ExploratoryEvaluationSummary

Run ID generation uses RFC 8785 (JSON Canonicalization Scheme) for deterministic cross-language hashing. Libraries exist for Python (jcs), Rust (serde_jcs), Go (go-jcs), and others.

Architecture

Data Flow Pipeline

SUTs + Cases (Registries)
    → Executor (runs SUTs against cases, deterministic runIds)
    → EvaluationResult (canonical schema)
    → ResultCollector (validates + filters)
    → Aggregation Pipeline (summary stats, comparisons, rankings)
    → Evaluators (claims, robustness, metrics, exploratory)
    → Renderers (LaTeX tables for thesis)

Module Map (src/)

| Module | Purpose | |--------|---------| | types/ | All canonical type definitions (result, sut, case, claims, evaluator, aggregate, perturbation) | | registry/ | SUTRegistry and CaseRegistry — generic registries with role/tag filtering | | executor/ | Orchestrator with worker threads, checkpointing, memory monitoring, binary SUT support | | collector/ | Result aggregation and JSON schema validation | | statistical/ | Mann-Whitney U test, Cohen's d, confidence intervals | | aggregation/ | computeSummaryStats(), computeComparison(), computeRankings(), pipeline | | evaluators/ | Four built-in evaluators + extensible registry (see below) | | claims/ | Claim type definitions | | robustness/ | Perturbation configs and robustness metric types | | renderers/ | LaTeX table renderer | | cli/ | Five commands with config loading, module loading, output writing |

Key Abstractions

SUT (SUT<TInputs, TResult>): Generic System Under Test. Has id, config, and run(inputs). Roles: primary, baseline, oracle.

CaseDefinition (CaseDefinition<TInput, TInputs>): Two-phase resource factory — getInput() loads a resource once, getInputs() returns algorithm-specific inputs.

Evaluator (Evaluator<TConfig, TInput, TOutput>): Extensible evaluation with validateConfig(), evaluate(), summarize(). Four built-in types:

  • ClaimsEvaluator — tests explicit hypotheses with statistical significance
  • RobustnessEvaluator — sensitivity analysis under perturbations
  • MetricsEvaluator — multi-criterion threshold/baseline/target-range evaluation
  • ExploratoryEvaluator — hypothesis-free analysis (rankings, pairwise comparisons, correlations, case-class effects)

EvaluationResult: Canonical output schema capturing run identity (deterministic SHA-256 runId), correctness, metrics, output artefacts, and provenance.

Subpath Exports

Each module is independently importable:

import { SUTRegistry } from 'ppef/registry';
import { EvaluationResult } from 'ppef/types';
import { computeSummaryStats } from 'ppef/aggregation';

Available subpaths: ppef/types, ppef/registry, ppef/executor, ppef/collector, ppef/statistical, ppef/aggregation, ppef/evaluators, ppef/claims, ppef/robustness, ppef/renderers.

Conventions

  • TypeScript strict mode, ES2023 target, ES modules
  • Node.js native test runner (node:test + node:assert) — not Vitest/Jest
  • Coverage via c8 (text + html + json-summary in ./coverage/)
  • Conventional commits enforced via commitlint + husky
  • Semantic release from main branch
  • No any types — use unknown with type guards
  • Executor produces deterministic runId via SHA-256 hash of RFC 8785 (JCS) canonicalized inputs

License

MIT