@wifo/factory-harness
v0.0.14
Published
Scenario runner for software-factory specs — runs `test:` satisfaction lines and scores `judge:` lines via LLM
Maintainers
Readme
@wifo/factory-harness
The scenario runner. Executes
test:andjudge:lines against a parsed spec; produces a typedHarnessReport.
@wifo/factory-harness powers the runtime's validatePhase. Given a parsed Spec (from @wifo/factory-core), it walks each scenario's Satisfaction: block, runs bun test for test: lines and dispatches to an LLM judge for judge: lines, and returns a typed report. You usually don't reach for this package directly — the runtime does.
For AI agents: start at
AGENTS.md(top-level). This README is detailed reference.
Install
pnpm add @wifo/factory-harnessPre-installed via factory init (the scaffold's runtime depends on it).
When to reach for it
- Programmatically run a spec's scenarios without going through the full runtime. Use
runHarness({ spec, ... })to get aHarnessReport. - Build your own validate phase. Compose
runTestSatisfaction+ a custom judge client to define a domain-specificvalidatePhase. - Implement a custom judge client. The exported
JudgeClientinterface is what the runtime + spec-reviewer + dodPhase all consume. Provide your own (e.g., a different LLM provider) and pass it in. - Parse a
test:line manually.parseTestLinestrips the locked syntax (file path + optional"name"filter) and tolerates stray backticks.
What's inside
CLI
factory-harness run <spec-path> [flags]| Flag | Default | Notes |
|---|---|---|
| --scenario <ids> | all | Comma-separated scenario id filter (e.g., S-1,S-2,H-1). |
| --visible | off | Only visible scenarios (skip holdouts). |
| --holdouts | off | Only holdout scenarios. |
| --no-judge | off | Skip judge: lines (status skipped). |
| --model <name> | claude-haiku-4-5 | Override judge model. |
| --timeout-ms <n> | 60000 | Per-judge timeout. |
The CLI is mostly used in tests + ad-hoc inspection. Production code reaches for runHarness() programmatically (or — more likely — uses the runtime's validatePhase).
Public API
import { runHarness, runTestSatisfaction, parseTestLine, formatReport }
from '@wifo/factory-harness';
import type {
HarnessReport, HarnessScenarioResult, HarnessSatisfactionResult,
HarnessOptions, JudgeClient, Judgment,
TestRunnerOptions, ParsedTestLine, ReporterKind,
} from '@wifo/factory-harness';Concepts
Two satisfaction kinds.
test: <path> "<name>"— spawnsbun test <path> [-t "<name>"]. Pass/fail from exit code. The harness strips a leading + trailing backtick from both the path and the name (since v0.0.6) — bare paths are canonical but legacy backticked paths still work.judge: "<criterion>"— calls aJudgeClient(default: Anthropic Claude via@anthropic-ai/sdkwith tool-use for structured pass/score/reasoning output). The reviewer + the runtime'svalidatePhaseanddodPhaseall reuse this client interface.
Coverage trip detection (v0.0.13+). Per-scenario bun test --test-name-pattern <name> runs only exercise a slice of a file, so a host repo's bunfig.toml coverage threshold trips on the slice and bun exits non-zero even though every scenario assertion passed. The harness parses bun's output: when bun exits non-zero AND the output contains 0 fail AND the canonical coverage threshold of <n> not met marker, the satisfaction is classified as pass with detail prefix harness/coverage-threshold-tripped: <marker>; <existing tail> rather than fail. The conservative match requires both signals — a non-zero exit without the marker is still classified as fail. Coverage is a holistic property, meaningful only when the whole suite runs; the host's coverage gate runs separately at DoD time on the full suite. (v0.0.12 attempted the carve-out via --coverage=false, but bun 1.3.x rejects that flag — v0.0.13 ships the stdout-parse path instead.)
Quote-char normalization in test-name patterns (v0.0.12+). Stylistic apostrophes drift between a spec's test: line (e.g. "v0.0.10's hash") and the test's actual it() name (e.g. 'v0.0.10s hash' — auto-stylized during implementation), so an exact substring match no-matches correct work. The harness now normalizes quote-like characters (ASCII + curly apostrophes, ASCII + curly double-quotes, backticks) on the pattern before passing -t to bun. The companion factory spec lint rule spec/test-name-quote-chars catches non-ASCII quote chars at scoping time so authors can rewrite cleanly.
Test-name regex matching (v0.0.14). The v0.0.12 strip-everything carve-out caused the opposite bug: a spec that genuinely uses "slug's log" got stripped to "slugs log" before -t, while the actual test in the file kept the apostrophe → bun's regex matched 0 tests → 5 phantom no-converge iterations on the v0.0.13 BASELINE. v0.0.14 narrows the strip set:
- Apostrophes (ASCII
'and curly‘’): preserved as literal characters on both sides of the comparison. Modern Claude reliably emits apostrophes init()names; the strip caused false negatives. - Curly double-quotes (
“”): still converted to ASCII"(helpful when authors paste from rich-text editors). - Backticks: still stripped (existing behavior).
A complementary safety net: when bun reports regex "<pattern>" matched 0 tests and exits non-zero, the satisfaction is classified as status: 'error' (NOT fail) with detail prefix harness/test-name-regex-no-match: <marker> in <file>; <existing tail>. The runtime treats error as a tooling-mismatch halting condition rather than re-running the implement phase trying to fix non-existent assertion failures. The detector mirrors the v0.0.13 coverage-trip shape; the coverage-trip path takes precedence when both signals appear.
JudgeClient interface. A single method judge(args) that takes { criterion, scenario, artifact, model, timeoutMs } and returns { pass, score, reasoning }. The runtime ships claudeCliJudgeClient (subprocess-based) in @wifo/factory-spec-review; you can implement your own (e.g., for a different LLM provider).
Status enum. Each scenario's satisfaction lines aggregate into one of pass, fail, error, skipped. runHarness aggregates per-scenario results into the report.
Worked example
import { runHarness } from '@wifo/factory-harness';
import { parseSpec } from '@wifo/factory-core';
const spec = parseSpec(await Bun.file('docs/specs/foo.md').text());
const report = await runHarness({
spec,
cwd: process.cwd(),
noJudge: false,
// optional: provide a custom judge client
// judgeClient: myCustomJudgeClient,
});
console.log(report.summary); // { pass: 3, fail: 0, error: 0, skipped: 0 }
for (const scenario of report.scenarios) {
console.log(scenario.id, scenario.status);
}CLI:
$ pnpm exec factory-harness run docs/specs/foo.md --no-judge
spec=foo scenarios=3
S-1: pass
S-2: pass
S-3: pass
summary: 3 pass, 0 fail, 0 error, 0 skippedSee also
AGENTS.md— single doc for AI agents using the toolchain.packages/runtime/README.md— the runtime'svalidatePhaseis the primary harness consumer.packages/core/README.md— spec format + parser.packages/spec-review/README.md— the spec reviewer reuses the harness'sJudgeClientinterface.CHANGELOG.md— every release's deltas.
Status
Pre-alpha. APIs may break in point releases until v0.1.0.
