@agentpatterns/pench

v0.0.1

Published

12 days ago

Benchmark framework for evaluating AI coding agent architecture patterns

0High
0Medium
0Low

ericjohnolson

benchmark ai agent cli architecture ddd

PENCH

An Agentic Pattern Benchmark Tool

/'pɑ̃.ʃe/ (pahn-shay) — French: to lean toward, to favor — as in "pencher pour," to show preference for one thing over another

The Idea

Christopher Alexander saw architecture not as a fixed blueprint but as a living process — a pattern language where each pattern captures a recurring problem and a field-tested response, and the real art lies in how patterns compose into wholes that are greater than their parts. Good architecture isn't engineered from first principles; it emerges through the repeated application of patterns that have proven themselves across many contexts.

The same principle applies to code architecture in the age of AI agents. When you give an agent a set of architectural patterns — hexagonal boundaries, domain-driven design, BDD specs, fitness functions — the resulting code is shaped profoundly by those choices. But unlike human developers who build intuition over years, agents are nondeterministic. The same prompt, the same patterns, the same codebase can produce different results each time. This is not a flaw — it's a feature. It means we can measure how well patterns perform as generative constraints, statistically, across many runs.

pench treats code architecture as an evolutionary discipline. Instead of debating which patterns are "best practice," you run them through a benchmark harness and let the evidence speak. Which patterns produce code that passes more tests? Which ones keep fitness functions green through five rounds of amendments? Which ones resist the entropic drift toward coupling and complexity?

This is Alexander's quality-without-a-name applied to the probabilistic world of AI-generated code: patterns that consistently produce living, evolvable systems score higher than patterns that produce rigid, fragile ones. The nondeterminism of agents becomes the experimental apparatus — one generation tells you nothing, fifty tell you the truth.

Weigh the alternatives. Measure the difference. Lean toward what works.

Why pench?

Benchmark architecture patterns as agent guardrails — DDD, hexagonal, onion, clean architecture, ports-and-adapters — measured by how they constrain nondeterministic code generation
Evolutionary architecture scoring — testability, fitness function compliance, coupling metrics, regression resistance, and consistency across runs
Statistical rigor for probabilistic output — one generation tells you nothing; fifty tell you the truth
Compare scaffolding strategies — BDD vs TDD, strict linting vs relaxed, monolith vs modular — on your actual codebase
Pattern language exploration — discover which compositions of patterns produce emergent quality, not just individual pattern compliance

Getting started

Prerequisites

Node.js >= 18
Bun >= 1.0
Claude CLI (claude command available on PATH)

Install from npm

npm install -g pench

Install from source

git clone https://github.com/agentpatterns/pench.git
cd pench
bun install

Run locally without installing

If you've cloned the repo and want to run pench without a global install:

bun install
bun link          # registers the "pench" binary locally
pench --help      # now available on PATH

# when done, unregister with:
bun unlink

Or run commands directly via bun without linking:

bun run src/cli.ts run --harness ./my-bench --dry-run

Full workflow

# 1. Scaffold a new benchmark harness (creates <name>/ with DDD skeleton)
pench init my-bench --output ./my-bench

# 1b. Or scaffold with example specs pre-populated (seat-map domain)
pench init my-bench --include-examples

# 1c. Or with a specific example domain
pench init my-bench --include-examples smoke

# 2. Check the harness is structurally complete before running
pench validate ./my-bench

# 3. Write your Gherkin feature specs in my-bench/features/
#    and amendments in my-bench/amendments/

# 4. Run the benchmark (Phase 1 scaffold + Phase 2 evolve)
pench run --harness ./my-bench --runs 3

# Dry-run mode — prints all commands without executing
pench run --harness ./my-bench --dry-run

# 5. Score the results
pench score results/my-bench/run-001

# 6. Aggregate across multiple runs
pench aggregate results/my-bench

Architecture

For complete repository structure, benchmark phases, scoring formula, and bench anatomy, see docs/architecture.md.

How It Works

pench evaluates AI coding agents across two phases:

Phase 1: Scaffold

The agent receives Gherkin feature specifications and must scaffold a complete working implementation from scratch. The benchmark suite defines bounded contexts with cross-cutting concerns. The agent must produce code that passes all acceptance tests and architecture fitness functions.

Phase 2: Evolve

After scaffolding, the agent receives a series of amendments — incremental changes to the specification. For each amendment, the agent must evolve the codebase to satisfy new requirements without regressing previously passing tests or violating architectural constraints.

Scoring

The overall score combines both phases:

Overall = Phase 1 composite × 0.3 + weightedMean(Phase 2 amendment composites) × 0.7

See docs/architecture.md for the full scoring breakdown and src/scoring/rubric.md for detailed metric definitions.

Commands

pench CLI

| Command | Description | |---------|-------------| | pench run [OPTIONS] | Orchestrate a full benchmark run (Phase 1 + Phase 2) against a harness | | pench init <name> [--output <dir>] [--include-examples [domain]] | Scaffold a new DDD benchmark harness skeleton | | pench validate <bench-dir> | Check a bench directory for structural completeness | | pench score <results-dir> | Compute and write scorecard.json from a benchmark run | | pench aggregate <results-root> | Aggregate scorecards across harnesses and tooling configs |

pench run options

pench run [OPTIONS]

Options:
  --harness <path>         Absolute or relative path to the bench harness directory (required)
  --harnesses <paths>      Comma-separated list of bench paths (overrides --harness)
  --tooling <name>         Tooling overlay identifier (e.g. tdd, zod, beads)
  --model <arn>            Claude model ARN or identifier string
  --runs <n>               Number of independent runs (default: 1)
  --parallel <n>           Maximum number of concurrent runs (default: 1)
  --scaffold-only          Run Phase 1 only; skip Phase 2 amendments
  --dry-run                Print all commands without executing them

Development commands

| Command | Description | |---------|-------------| | bun run build | Build TypeScript to dist/ | | bun run test | Run Vitest in watch mode | | bun run test:run | Run all tests once | | bun run test:json | Run tests with JSON reporter output | | bun run lint | Lint with Biome | | bun run lint:ci | Lint with Biome (CI mode, fails on errors) | | bunx tsc --noEmit | Type check without emitting | | bash src/runner/run-benchmark.sh | Run the full benchmark via shell orchestrator | | bash scripts/smoke-test.sh | Run smoke test (invokes a real claude -p agent) |

Benches

A bench is a self-contained architecture benchmark that provides:

Fitness functions — ArchUnitTS tests that enforce structural constraints (dependency direction, naming conventions, file size limits, cohesion/coupling metrics)
Claude skills — .claude/skills/ files that guide the agent through scaffolding and evolution
Toolchain docs — reference documentation for the technology stack

Example specs

examples/specs/ contains example Gherkin specifications and amendments you can use as references or copy into your bench:

examples/specs/seat-map/ — full 10-feature DDD example with 5 amendments (venues, inventory, selection, cross-cutting)
examples/specs/smoke/ — minimal single-feature example for quick testing

Adding a new bench

Use pench init to generate a DDD bench skeleton:

# Blank skeleton — you write your own Gherkin specs
pench init my-bench --output ./my-bench

# Pre-populated with example specs from the seat-map domain
pench init my-bench --include-examples

# Pre-populated with the minimal smoke example
pench init my-bench --include-examples smoke

This creates the required structure:

package.json — named @pench/bench-<name>
.claude/skills/scaffold/SKILL.md — scaffold instructions for the agent
.claude/skills/evolve/SKILL.md — evolution instructions for the agent
features/, tests/, amendments/ — empty stubs (default) or pre-populated with example specs (--include-examples)

The fitness/ directory is intentionally not created — the agent generates ArchUnitTS fitness tests during Phase 1 of the benchmark run.

After scaffolding, verify the structure with pench validate ./my-bench, then add your Gherkin feature files and amendments. The runner picks up the bench via pench run --harness ./my-bench.

Development notes

Vitest 3.x is pinned. ArchUnitTS is incompatible with Vitest 4.
globals: true is required in vitest.config.ts for the ArchUnitTS toPassAsync() matcher to resolve.
All workspace packages use "type": "module" for ESM.
Workspaces: src/*.