@agentpatterns/pench
v0.0.1
Published
Benchmark framework for evaluating AI coding agent architecture patterns
Maintainers
Readme
PENCH
An Agentic Pattern Benchmark Tool
/'pɑ̃.ʃe/ (pahn-shay) — French: to lean toward, to favor — as in "pencher pour," to show preference for one thing over another
The Idea
Christopher Alexander saw architecture not as a fixed blueprint but as a living process — a pattern language where each pattern captures a recurring problem and a field-tested response, and the real art lies in how patterns compose into wholes that are greater than their parts. Good architecture isn't engineered from first principles; it emerges through the repeated application of patterns that have proven themselves across many contexts.
The same principle applies to code architecture in the age of AI agents. When you give an agent a set of architectural patterns — hexagonal boundaries, domain-driven design, BDD specs, fitness functions — the resulting code is shaped profoundly by those choices. But unlike human developers who build intuition over years, agents are nondeterministic. The same prompt, the same patterns, the same codebase can produce different results each time. This is not a flaw — it's a feature. It means we can measure how well patterns perform as generative constraints, statistically, across many runs.
pench treats code architecture as an evolutionary discipline. Instead of debating which patterns are "best practice," you run them through a benchmark harness and let the evidence speak. Which patterns produce code that passes more tests? Which ones keep fitness functions green through five rounds of amendments? Which ones resist the entropic drift toward coupling and complexity?
This is Alexander's quality-without-a-name applied to the probabilistic world of AI-generated code: patterns that consistently produce living, evolvable systems score higher than patterns that produce rigid, fragile ones. The nondeterminism of agents becomes the experimental apparatus — one generation tells you nothing, fifty tell you the truth.
Weigh the alternatives. Measure the difference. Lean toward what works.
Why pench?
- Benchmark architecture patterns as agent guardrails — DDD, hexagonal, onion, clean architecture, ports-and-adapters — measured by how they constrain nondeterministic code generation
- Evolutionary architecture scoring — testability, fitness function compliance, coupling metrics, regression resistance, and consistency across runs
- Statistical rigor for probabilistic output — one generation tells you nothing; fifty tell you the truth
- Compare scaffolding strategies — BDD vs TDD, strict linting vs relaxed, monolith vs modular — on your actual codebase
- Pattern language exploration — discover which compositions of patterns produce emergent quality, not just individual pattern compliance
Getting started
Prerequisites
- Node.js >= 18
- Bun >= 1.0
- Claude CLI (
claudecommand available on PATH)
Install from npm
npm install -g penchInstall from source
git clone https://github.com/agentpatterns/pench.git
cd pench
bun installRun locally without installing
If you've cloned the repo and want to run pench without a global install:
bun install
bun link # registers the "pench" binary locally
pench --help # now available on PATH
# when done, unregister with:
bun unlinkOr run commands directly via bun without linking:
bun run src/cli.ts run --harness ./my-bench --dry-runFull workflow
# 1. Scaffold a new benchmark harness (creates <name>/ with DDD skeleton)
pench init my-bench --output ./my-bench
# 1b. Or scaffold with example specs pre-populated (seat-map domain)
pench init my-bench --include-examples
# 1c. Or with a specific example domain
pench init my-bench --include-examples smoke
# 2. Check the harness is structurally complete before running
pench validate ./my-bench
# 3. Write your Gherkin feature specs in my-bench/features/
# and amendments in my-bench/amendments/
# 4. Run the benchmark (Phase 1 scaffold + Phase 2 evolve)
pench run --harness ./my-bench --runs 3
# Dry-run mode — prints all commands without executing
pench run --harness ./my-bench --dry-run
# 5. Score the results
pench score results/my-bench/run-001
# 6. Aggregate across multiple runs
pench aggregate results/my-benchArchitecture
For complete repository structure, benchmark phases, scoring formula, and bench anatomy, see docs/architecture.md.
How It Works
pench evaluates AI coding agents across two phases:
Phase 1: Scaffold
The agent receives Gherkin feature specifications and must scaffold a complete working implementation from scratch. The benchmark suite defines bounded contexts with cross-cutting concerns. The agent must produce code that passes all acceptance tests and architecture fitness functions.
Phase 2: Evolve
After scaffolding, the agent receives a series of amendments — incremental changes to the specification. For each amendment, the agent must evolve the codebase to satisfy new requirements without regressing previously passing tests or violating architectural constraints.
Scoring
The overall score combines both phases:
Overall = Phase 1 composite × 0.3 + weightedMean(Phase 2 amendment composites) × 0.7See docs/architecture.md for the full scoring breakdown and src/scoring/rubric.md for detailed metric definitions.
Commands
pench CLI
| Command | Description |
|---------|-------------|
| pench run [OPTIONS] | Orchestrate a full benchmark run (Phase 1 + Phase 2) against a harness |
| pench init <name> [--output <dir>] [--include-examples [domain]] | Scaffold a new DDD benchmark harness skeleton |
| pench validate <bench-dir> | Check a bench directory for structural completeness |
| pench score <results-dir> | Compute and write scorecard.json from a benchmark run |
| pench aggregate <results-root> | Aggregate scorecards across harnesses and tooling configs |
pench run options
pench run [OPTIONS]
Options:
--harness <path> Absolute or relative path to the bench harness directory (required)
--harnesses <paths> Comma-separated list of bench paths (overrides --harness)
--tooling <name> Tooling overlay identifier (e.g. tdd, zod, beads)
--model <arn> Claude model ARN or identifier string
--runs <n> Number of independent runs (default: 1)
--parallel <n> Maximum number of concurrent runs (default: 1)
--scaffold-only Run Phase 1 only; skip Phase 2 amendments
--dry-run Print all commands without executing themDevelopment commands
| Command | Description |
|---------|-------------|
| bun run build | Build TypeScript to dist/ |
| bun run test | Run Vitest in watch mode |
| bun run test:run | Run all tests once |
| bun run test:json | Run tests with JSON reporter output |
| bun run lint | Lint with Biome |
| bun run lint:ci | Lint with Biome (CI mode, fails on errors) |
| bunx tsc --noEmit | Type check without emitting |
| bash src/runner/run-benchmark.sh | Run the full benchmark via shell orchestrator |
| bash scripts/smoke-test.sh | Run smoke test (invokes a real claude -p agent) |
Benches
A bench is a self-contained architecture benchmark that provides:
- Fitness functions — ArchUnitTS tests that enforce structural constraints (dependency direction, naming conventions, file size limits, cohesion/coupling metrics)
- Claude skills —
.claude/skills/files that guide the agent through scaffolding and evolution - Toolchain docs — reference documentation for the technology stack
Example specs
examples/specs/ contains example Gherkin specifications and amendments you can use as references or copy into your bench:
examples/specs/seat-map/— full 10-feature DDD example with 5 amendments (venues, inventory, selection, cross-cutting)examples/specs/smoke/— minimal single-feature example for quick testing
Adding a new bench
Use pench init to generate a DDD bench skeleton:
# Blank skeleton — you write your own Gherkin specs
pench init my-bench --output ./my-bench
# Pre-populated with example specs from the seat-map domain
pench init my-bench --include-examples
# Pre-populated with the minimal smoke example
pench init my-bench --include-examples smokeThis creates the required structure:
package.json— named@pench/bench-<name>.claude/skills/scaffold/SKILL.md— scaffold instructions for the agent.claude/skills/evolve/SKILL.md— evolution instructions for the agentfeatures/,tests/,amendments/— empty stubs (default) or pre-populated with example specs (--include-examples)
The fitness/ directory is intentionally not created — the agent generates ArchUnitTS fitness tests during Phase 1 of the benchmark run.
After scaffolding, verify the structure with pench validate ./my-bench, then add your Gherkin feature files and amendments. The runner picks up the bench via pench run --harness ./my-bench.
Development notes
- Vitest 3.x is pinned. ArchUnitTS is incompatible with Vitest 4.
globals: trueis required invitest.config.tsfor the ArchUnitTStoPassAsync()matcher to resolve.- All workspace packages use
"type": "module"for ESM. - Workspaces:
src/*.
