@cosmo-wise/harness
v1.0.2
Published
Harness CLI for long-running AI agent tasks
Maintainers
Readme
Harness
Harness is a CLI orchestrator for long-running AI coding tasks. It implements the Harness pattern with three specialized agents:
Planner: turns a user request into a structured plan with sprints and acceptance criteriaGenerator: executes the plan and writes codeEvaluator: scores the result, surfaces issues, and decides whether another iteration is needed
The project is no longer just a thin proof of concept. The current codebase includes resumable runs, parallel sprint execution with Git worktrees, incremental repair strategies, prompt-fragment assets, mission-style regression tasks, and browser-based render-audit artifacts for supported frontend targets.
Current Status
The repository has moved well beyond the original single-loop implementation. The core system now includes:
- A phase-based runtime under
src/core/harness/instead of a single monolith - Unified
run,resume,retry, andreplanflows backed by persisted run state - Parallel sprint execution with isolated Git worktrees
- Incremental iteration strategies:
sprint_retry,issue_fix, andfull_regenerate - Generator semantic statuses:
DONE,DONE_WITH_CONCERNS,NEEDS_CONTEXT, andBLOCKED - Stagnation stop-loss for repeated
issue_fixattempts - Artifact-aware resume that can reuse a persisted
plan.json - Safer queued state writes and per-iteration artifact persistence
- Structured verification evidence persisted into generation artifacts and evaluator prompts
- Runtime-governance evaluation now has an explicit wiring gate: runtime tasks should not pass without evidence for import, instantiation, invocation, and integration coverage
- Status ledgers such as
output/plan/TODO/90-STATUS.mdare now explicitly low-trust evidence; they cannot pass evaluation alone and are deferred until root validation during harvest - explicit configured provider models and fallback chains now run a startup preflight, so auth / not-found failures can stop before planning
- audited
audit/bundles are treated as low-trust reference inputs; rewriting them now fails evaluation instead of self-certifying completion - generating
audit/artifacts without a real audited input bundle, or claiming success without files/structured verification evidence, now fails evaluation as anti-placeholder / anti-self-certification protection - low-score or evaluator-failed terminal runs now persist as
completed_unacceptableinstead of looking like clean success - Prompt fragment assets plus synchronization tooling
- Mission assets for repeatable regression scenarios
- Browser render-audit artifacts wired into the generation → evaluation path for supported frontend targets
This also means the old README was materially outdated. The sections below describe the codebase as it exists now.
Core Workflow
Harness runs an iterative loop:
Planning- The Planner expands the user prompt into an overview, sprints, risks, tech spec, and optional scaffold recommendation.
Generation- The Generator implements the plan, either sequentially or across parallel sprint groups in isolated worktrees.
Render Auditwhen applicable- For supported frontend targets, Harness can build the app, start a preview server, capture browser evidence, and persist screenshots plus browser errors.
Evaluation- The Evaluator scores the result against weighted criteria and can gate visually sensitive frontend work when browser evidence is missing.
Iteration Control- Harness decides whether to stop, retry a sprint, apply a focused issue fix, or fully regenerate.
The loop stops when one of these conditions is met:
- score threshold reached
- max iterations reached
- no significant improvement
- repeated
issue_fixattempts fail to materially improve the score - the Generator returns a semantic blocked state that prevents forward progress
Major Capabilities
1. Three-agent orchestration
Harness supports claude, gemini, codex, and opencode as agent backends. Each agent has its own CLI, model, timeout, and retry policy.
2. Parallel sprint execution
When a plan contains enough independent work, Harness can:
- topologically group non-conflicting sprints
- create isolated Git worktrees
- run multiple Generator executions in parallel
- merge successful work back into the main working tree
- treat retried groups as latest-attempt-only state, so stale blocked/error residue does not leak into the final completion path
This is the main speed lever for larger tasks.
3. Incremental repair instead of blind full rewrites
Harness does not always regenerate everything after a failed evaluation. It can choose:
sprint_retry: retry only specific failed or incomplete sprintsissue_fix: target critical or major evaluator findings with a focused repair passfull_regenerate: rebuild the implementation when local repair is not sufficient
The Generator prompt also includes an internal fix-verify protocol on issue-fix paths.
Evaluator issues can now include a confidence score, and issue_fix can ignore low-confidence findings when incrementalFix.issueConfidenceThreshold is configured.
4. Resume, retry, and replan
Run state is persisted under .harness/runs/<runId>/, so interrupted runs can continue without manual state editing.
Current recovery behavior includes:
resumefor interrupted or failed runsretryas a convenience wrapper for rewinding the latest incomplete iterationreplanfor restarting from planning with appended or replaced guidance- artifact-aware resume that can skip the Planner if a valid
plan.jsonalready exists - repair of stale half-iterations before continuing
- destructive resume repair now writes a per-run
repair-backups/snapshot before trimming artifacts or rewriting damaged persisted state - targeted sprint resume on a completed run that already hit
iterations.maxnow reopens one additional iteration window instead of silently no-oping - fresh runs persist a baseline Git diff snapshot so generation/evaluation/finalize only count net-new file changes instead of inherited dirty-worktree residue
5. Browser-based render audit artifacts
Harness now includes a browser render-audit pipeline for supported frontend targets.
Current behavior:
- detects a supported preview target
- runs build + preview
- waits for the preview URL to become reachable
- captures desktop and mobile screenshots
- records console errors and page errors
- records failed browser requests and HTTP 4xx/5xx responses
- can compare the captured screenshots against configured desktop/mobile baseline references
- can execute configured hover/click interaction audits on desktop or mobile surfaces
- persists the evidence into the iteration artifacts
- passes that evidence into the Evaluator prompt
- can gate visually-oriented frontend tasks when render evidence is missing, visually mismatched, or failed on runtime/resource signals
- includes the matching evidence line when it flags a long-running preview/dev server during generation
- can also consume an external render-audit JSON artifact when another tool owns the browser execution path
Current boundary:
- target detection is currently Vite-oriented
- render-audit requirement is still heuristic rather than fully planner-contract-driven
- Generator should not keep long-running preview or dev servers alive; browser verification belongs to the render-audit phase
- runtime/governance tasks now require wiring evidence in evaluation; module existence alone is not enough to pass
6. Two-stage evaluator pipeline
Harness supports an optional two-stage evaluation pipeline that splits evaluation into:
- Preliminary stage: Fast syntactic and structural checks
- Deep stage: Full semantic quality review
Configuration:
twoStageEvaluation:
enabled: true
preliminaryTimeoutMs: 60000
deepTimeoutMs: 180000Benefits:
- Early exit on critical syntax/structure issues before expensive semantic evaluation
- Separate artifact persistence:
preliminary-evaluation.jsonanddeep-evaluation.json - Status timestamps in run state:
preliminaryEvaluationCompleteanddeepEvaluationComplete - Deep-stage review now consumes the same render-audit evidence and post-review gates as single-stage evaluation, so frontend fidelity failures still downgrade the final verdict
- Graceful fallback to single-stage evaluation when not enabled
7. Prompt assets and documentation contracts
Prompt instructions are no longer maintained only as inline strings. The repo now contains:
- source prompt assets under
src/assets/prompts/ - synced prompt-fragment docs under
docs/prompt-fragments/ - a sync script:
npm run sync:prompt-fragments - tests that guard the existence and synchronization of key prompt assets
8. Mission-style regression assets
missions/ is now a formal regression asset layer, not a scratch directory. It supports repeatable scenarios such as:
happy_pathissue_fixresumeneeds_contextblocked
Mission execution is available through:
npm run mission:run -- missions/01-happy-path.yamlInstallation
Prerequisites
- Node.js
>= 20 - Git
- At least one supported model CLI:
@anthropic-ai/claude-code@google/gemini-cliopenai-codexopencode-ai
Install dependencies:
npm installBuild the CLI:
npm run buildLink it globally if you want harness on your shell PATH:
npm linkIf you want browser render audits for frontend tasks, install Playwright Chromium once:
npm run playwright:installWindows note
Claude Code on Windows typically needs Git Bash configured in .env:
CLAUDE_CODE_GIT_BASH_PATH=D:\Git\bin\bash.exeHarness will continue without it, but Claude Code shell execution may fail.
Quick Start
Run a task in development mode without building:
npm run dev -- run "Create an Express API with authentication"
npm run dev -- run -c examples/harness-axle.yaml --working-dir ./output/axle-smokeRun the built CLI:
harness run "Create an Express API with authentication"
harness run -c examples/harness-axle.yaml --working-dir ./output/axle-smoke
harness run --prompt-file ./task.mdCheck status:
harness status
harness status --follow
harness status --watchdogResume the latest run:
harness resume
harness resume <runId> --append-file ./resume-notes.mdCheck whether the current Harness worktree is bootstrapped correctly:
harness doctorHarvest validated files from a self-opt worktree back into the main repo:
harness harvest ./harness-self-opt-worktree
harness harvest ./harness-self-opt-worktree --applyRun into a dedicated working directory:
harness run "Build a React landing page" -w ./output/landing-pageEnable parallel execution:
harness run "Build a web app" -p --max-concurrency 4CLI Commands
harness run [prompt]
harness run --prompt-file <path>
harness status
harness status --follow
harness doctor
harness harvest <sourceDir>
harness resume [runId]
harness resume [runId] --append-file <path>
harness retry [runId]
harness replan [runId] -a "Additional guidance"
harness replan [runId] --replace-file <path>
harness configImportant options:
-c, --config <path>: config file path-w, --working-dir <dir>: working directory-v, --verbose: verbose logs-p, --parallel: enable parallel sprint execution--max-concurrency <number|auto>: parallel worker limit--iteration-mode <mode>: force iteration strategy (sprint_retry, issue_fix, full_regenerate)--focus-critical: restrict incremental fixes to critical issues--render-audit <off|auto|always>: override render audit mode for this invocation-T, --test-tier <smoke|gate|full>: select test tier for this invocation--watchdog: enable auto-resume watchdog mode onstatus-i, --interval <ms>: polling interval onstatus--max-recovery <number>: maximum recovery attempts onstatus-f, --file <paths...>onharvest: explicit whitelist of files to harvest--applyonharvest: copy files and run validations instead of dry-run listing--source-command <command>onharvest: additional or replacement worktree validation commands--targeted-command <command>onharvest: targeted validation command in the target repo; supports{files}--root-command <command>onharvest: root validation command in the target repo--from-iteration <number>onresume: rewind and continue from a specific iteration--prompt-file <path>onrun: read the task prompt from a file--append <text>onresume: inject extra resume guidance without resetting an in-progress run--append-file <path>onresume/replan: read appended guidance from a file--append <text>/--replace [text]onreplan: change planning guidance--replace [text]onresume: restart from planning with a replaced prompt--replace-file <path>onresume/replan: read replacement planning guidance from a file--sprint <names...>onresume: target specific sprints without replanning
Configuration
Harness loads config from:
--config./harness.yaml./harness.yml./harness.json./.harness/config.yaml
Example:
task: Build a production-ready landing page for Harness
agents:
planner:
cli: gemini
model: gemini-2.5-pro
fallbackModels:
- gemini-2.5-flash
timeout: 300000
maxOuterRetries: 2
retryBaseDelayMs: 2000
retryMaxDelayMs: 8000
generator:
cli: gemini
model: gemini-2.5-flash
fallbackModels:
- gemini-2.5-flash-lite
- gemini-2.5-pro
timeout: 1800000
maxOuterRetries: 3
retryBaseDelayMs: 2000
retryMaxDelayMs: 8000
evaluator:
cli: gemini
model: gemini-2.5-pro
fallbackModels:
- gemini-2.5-flash
timeout: 600000
maxOuterRetries: 2
retryBaseDelayMs: 2000
retryMaxDelayMs: 8000
iterations:
max: 5
scoreThreshold: 90
minImprovement: 3
criteria:
- name: functionality
weight: 30
description: Core requirements are implemented and work end-to-end
- name: code_quality
weight: 25
description: Code is maintainable, readable, and tested appropriately
- name: architecture
weight: 25
description: Modules, boundaries, and structure are coherent
- name: originality
weight: 20
description: The implementation is deliberate, not boilerplate filler
recursion:
maxDepth: 3
preventNested: true
parallel:
enabled: true
maxConcurrency: auto
minSprintsToParallelize: 3
allowConcurrentFileEdits: false
baselineCommand: npm run typecheck
postCreateChecks:
checkUncommittedChanges: true
checkBaselineFiles: true
baselineFiles:
requiredFiles:
- package.json
- tsconfig.json
checkBranchDivergence: true
baseBranch: main
checkpoint:
mode: checkpoint-repo
checkpointsDir: .harness/checkpoints
planning:
confidenceThreshold: 70
routing:
enabled: true
defaultModel: openrouter/gpt-4o-mini
taskRouting:
planning: openrouter/gpt-5
generation: openrouter/claude-3.5-sonnet
evaluation: openrouter/gpt-4o
strategyRouting:
issue_fix: openrouter/gpt-4o-mini
sprint_retry: openrouter/gpt-4o
full_regenerate: openrouter/gpt-5
bridge:
enabled: true
type: mcp
capabilities:
- kind: memory
content: Prefer the external design system guidance when touching shared UI.
- kind: skill
name: design-bridge
description: Uses external bridge context for design-system decisions.
usage: Use when repo-local docs are incomplete.
- kind: command
name: repo-index
description: External command capability that can surface repository metadata.
- kind: mcp
name: design-docs
description: External MCP capability exposing design references.
# Optional Axle CRUD backend guidance. Use this for Go/SQLite CRUD services
# where standard CRUD should stay descriptor-generated and framework-owned.
memory:
facts:
- For Go/SQLite CRUD backends, prefer Axle unless the user asks otherwise.
- Axle apps must be descriptor-first and verified with axle check plus scripts/verify.sh.
skills:
- name: axle-crud-backend
description: Scaffold and adapt LLM-friendly Go/SQLite CRUD backends with Axle.
usage: Run axle app init, replace sample descriptors, regenerate descriptor/catalog output, then run scripts/verify.sh and axle check --root .
renderAudit:
mode: auto
# Optional external render-audit artifact override:
# externalArtifactPath: .trial/runs/latest/render-audit.json
# Optional Trial provider that runs compile/run/export automatically:
# provider: trial
# buildCommand: npm run build
# previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
# previewUrl: http://127.0.0.1:4173
# trial:
# command: trial
# bundlePath: ../prompt-bundles/site-audit
# suiteDir: .trial/suite
# runDir: .trial/run
# artifactPath: .trial/render-audit.json
# contentGate: true
# interactionGate: true
# visualGate: false
# Optional explicit preview target override:
# buildCommand: npm run build
# previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
# previewUrl: http://127.0.0.1:4173
# Optional screenshot baseline diff:
# baseline:
# desktopScreenshotPath: ./render-baselines/home-desktop.png
# mobileScreenshotPath: ./render-baselines/home-mobile.png
# maxMismatchRatio: 0.15
incrementalFix:
enabled: true
maxRetries: 3
focusOnCritical: true
maxConsecutiveIssueFixAttempts: 3
issueConfidenceThreshold: 70
budget:
maxTurnsPerRun: 1500
maxCostUsdPerRun: 15
maxTurnsPerIteration: 400
maxCostUsdPerIteration: 4Notes:
- Gemini should be configured with an explicit
model - fallback models and bounded outer retries are strongly recommended for Gemini
- OpenCode models should use the provider-qualified form expected by
opencode, such asopenrouter/gpt-5 - parallel mode requires a Git repository and a clean workspace
parallel.baselineCommandruns inside each created worktree before sprint execution startsparallel.postCreateCheckscan fail fast on uncommitted changes, missing baseline files, or branch divergence inside created worktreesparallel.checkpoint.mode: checkpoint-repoenables an experimental shadow-checkpoint path that gives each group its ownGIT_DIRunder.harness/checkpoints/- checkpoint mode currently forces
maxConcurrency = 1for safety because groups share the sameGIT_WORK_TREE; native worktrees remain the default and higher-throughput path planning.confidenceThresholdblocks generation when the Planner reports a lowerplanConfidence; use a0-100scale, not0-1incrementalFix.issueConfidenceThresholdletsissue_fixskip low-confidence evaluator findingsbudget.maxTurnsPerRun/budget.maxCostUsdPerRunguard the whole run, whilebudget.maxTurnsPerIteration/budget.maxCostUsdPerIterationcan stop a single overly expensive iteration- when
agents.*.modelorfallbackModelsare explicitly configured, Harness preflights those provider/model combinations before planning - for long prompts on Windows, prefer
--prompt-file,--append-file, and--replace-fileover very long inline CLI arguments routing.taskRoutingcan override the primary model per phase without changing each agent's baseline configrouting.strategyRoutingcan override the Generator model forissue_fix,sprint_retry, andfull_regenerate- when routing changes the primary model, Harness preserves the original agent
modelandfallbackModelsas the retry chain behind the routed primary bridgeis a capability-context layer, not an arbitrary shell passthroughbridge.capabilitieslets you describe externalmemory,skill,command, andmcpaffordances in one schema- Harness normalizes bridge capabilities into runtime
memory/skillsso Planner, Generator, and Evaluator all see the same external context onrunandresume renderAudit.mode: offdisables audit execution and evaluation gatingrenderAudit.mode: autokeeps the default behaviorrenderAudit.mode: alwaysrequires render audit evidence whenever a target is availablerenderAudit.externalArtifactPathlets Harness load precomputed render-audit evidence from another tool such astrialrenderAudit.externalArtifactPathcannot be combined withbuildCommand,previewCommand,previewUrl,baseline, orinteractionsrenderAudit.provider: trialtells Harness to call the siblingtrialCLI itself instead of running the built-in Playwright auditorrenderAudit.trial.bundlePathis required whenrenderAudit.provider=trialrenderAudit.provider=trialstill reuses Harness target resolution forbuildCommand,previewCommand, andpreviewUrlrenderAudit.provider=trialcannot be combined withexternalArtifactPath,baseline, orinteractions- Trial exports now treat prompt-content and interaction failures as hard failures by default, while visual mismatch evidence stays in the evaluator prompt unless
visualGate: true - Trial route summaries are attached to render-audit evidence so repair prompts can see missing text, failed interactions, and diff image paths per route
- custom render targets must provide
buildCommand,previewCommand, andpreviewUrltogether renderAudit.baseline.desktopScreenshotPathandrenderAudit.baseline.mobileScreenshotPathmust be configured togetherrenderAudit.baseline.maxMismatchRatiodefaults to0.15when omitted
Bridge Capability Context
Harness can load external capability descriptions through bridge and fold them into the same runtime knowledge path used by built-in memory and skills.
Example:
bridge:
enabled: true
type: mcp
capabilities:
- kind: memory
content: Prefer the external design system guidance when touching shared UI.
- kind: skill
name: design-bridge
description: Uses external bridge context for design-system decisions.
usage: Use when repo-local docs are incomplete.
- kind: command
name: repo-index
description: External command capability that can surface repository metadata.
- kind: mcp
name: design-docs
description: External MCP capability exposing design references.Current semantics:
memorycapabilities are injected as runtime memory factsskillcapabilities become runtime skillscommandandmcpcapabilities are summarized into runtime memory so agents can reason about what external context exists- the same normalized bridge context is refreshed on
resume - direct bridge command fallback uses
spawn(..., { shell: false }); this feature is not intended to be a generic shell execution tunnel
Shadow Checkpoint Mode
Harness now includes an optional shadow checkpoint execution mode for parallel runtime experiments.
Example:
parallel:
enabled: true
maxConcurrency: auto
minSprintsToParallelize: 3
allowConcurrentFileEdits: false
checkpoint:
mode: checkpoint-repo
checkpointsDir: .harness/checkpointsCurrent semantics:
- each group gets an isolated checkpoint repository under
.harness/checkpoints/<groupId>/git - Generator execution receives
GIT_DIRandGIT_WORK_TREEso edits happen against the shadow checkpoint instead of a new native worktree - Harness captures a baseline checkpoint before the group starts
- failed groups roll back to that baseline before cleanup
- successful groups clean up their checkpoint repo after completion
- this mode is intentionally conservative and currently runs one group at a time
Watchdog Status Mode
harness status --watchdog now uses the same resume path as harness resume instead of a no-op callback.
Examples:
harness status --watchdog
harness status --watchdog -c ./harness.yaml --max-recovery 2Current semantics:
- watchdog mode implies
--follow - when watchdog mode is enabled, Harness loads config so auto-resume uses the real agent/runtime settings; if the active run already persisted a config snapshot, watchdog reuses that snapshot instead of resolving the wrong shell-local
./harness.yaml --max-recoveryis enforced by the watchdog runtime, not just parsed by the CLI- recovery attempts update the persisted run state via
recoveryReasonandrecoveryCount - generator provider progress now refreshes persisted child-process activity telemetry, so active provider output is less likely to be misclassified as a stalled/zombie run
- status surfaces one unified observed health view:
running,stalled,zombie,needs-recovery, orcompleted - provider/tooling crashes now persist an explicit
provider_failedhealth snapshot with a concrete recovery reason instead of collapsing into an undifferentiated failed run - long-running provider child-process activity with no progress change now escalates to
needs-recoveryas a suspected hung-provider state instead of staying forever "healthy"
Doctor Checks
harness doctor is a repo/worktree health check for Harness itself. It currently verifies:
- the working directory exists
- the working directory is writable
node_modulesis present whenpackage.jsonexists- the built Harness CLI at
dist/cli.jsstarts successfully with--help - explicit provider/model/fallback configuration can pass auth/model preflight before you start a long run
This is the fastest way to catch self-optimization/bootstrap failures such as “build passed but the built CLI cannot actually start”.
Harvest Protocol
harness harvest productizes the self-optimization closure flow of “validated file whitelist copy + main-repo revalidation”.
Examples:
harness harvest ./harness-self-opt-worktree
harness harvest ./harness-self-opt-worktree --apply
harness harvest ./harness-self-opt-worktree --apply --file src/core/harness/runHealth.ts --root-command "python3 ../probe/cli.py run --target-repo-path . --suite unit"Current semantics:
- default mode is dry-run: it lists harvestable files and the validation commands that would run
--applyfirst runs worktree validation, then copies the whitelist into the target repo, then forces targeted + root validation in the target repo- low-trust status ledgers such as
output/plan/TODO/90-STATUS.mdare deferred until after root validation succeeds, so failed harvest validation does not prematurely update the main-repo status ledger - when commands are not provided explicitly, Harness infers worktree validation from
package.jsonscriptstypecheck/build - root validation is inferred from
package.jsonscriptstest,typecheck, andbuild - targeted validation always includes
git diff --check -- {files}in the target repo; custom--targeted-commandvalues are appended after that .git/,.harness/, andnode_modules/are never harvested from implicit changed-file discovery- deleted files are reported as skipped because the current protocol productizes file-copy harvest, not deletion replay
Windows Provider Notes
- file-backed prompt inputs are the safest way to pass long task text on Windows shells
- provider failures caused by shell mismatches such as bare
&&or Unix-onlyheadare classified as provider/tooling errors instead of normal code failures - provider-side
404/not foundfailures are tracked separately from auth/quota/transport failures so bad model/provider ids can be surfaced as manual-recovery issues status,resume, and completion summaries surface interesting parallel group retry history so repeated provider/tooling failures are visible without reading raw state files
Smart Model Routing
Harness can route different phases to different models without duplicating the whole agents block.
Example:
routing:
enabled: true
defaultModel: openrouter/gpt-4o-mini
taskRouting:
planning: openrouter/gpt-5
generation: openrouter/claude-3.5-sonnet
evaluation: openrouter/gpt-4o
strategyRouting:
issue_fix: openrouter/gpt-4o-mini
sprint_retry: openrouter/gpt-4o
full_regenerate: openrouter/gpt-5Routing semantics:
taskRoutingapplies toplanning,generation, andevaluationstrategyRoutingapplies on generation paths when Harness already knows it is doingissue_fix,sprint_retry, orfull_regeneratestrategyRoutingtakes precedence overtaskRoutingfor generation retries- if routing swaps the primary model, the original agent
modelbecomes the first fallback, followed by the agent's configuredfallbackModels - OpenCode model ids should stay provider-qualified, such as
openrouter/gpt-5
Render Audit Configuration Guide
Recommended mode selection:
off: use this for backend-only tasks, docs-only work, or environments where browser preview is intentionally unavailableauto: default mode; Harness runs render audit when a target is available, and only gates evaluation when the task looks visually sensitivealways: use this for landing pages, marketing sites, UI redesigns, responsive work, animation polish, or any task where visual claims must be backed by browser evidence
CLI override examples:
harness run "Build a landing page" --render-audit always
harness resume <runId> --render-audit offCustom preview target example:
renderAudit:
mode: always
buildCommand: npm run build
previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
previewUrl: http://127.0.0.1:4173Optional baseline screenshot diff:
renderAudit:
mode: always
buildCommand: npm run build
previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
previewUrl: http://127.0.0.1:4173
baseline:
desktopScreenshotPath: ./render-baselines/home-desktop.png
mobileScreenshotPath: ./render-baselines/home-mobile.png
maxMismatchRatio: 0.12
sections:
- id: hero
desktop:
y: 0
height: 900
maxMismatchRatio: 0.08
mobile:
y: 0
height: 720
maxMismatchRatio: 0.1
interactions:
- id: products-menu
action: hover
selector: '[data-nav-products]'
expectedVisibleText:
- API
- Hosted
- id: mobile-menu
surface: mobile
action: click
selector: '[data-mobile-menu-button]'
expectedVisibleSelector: '[data-mobile-menu-panel]'Important constraints:
- the three custom target fields are all-or-nothing
- baseline screenshot paths are also all-or-nothing: configure both desktop and mobile references together
- each baseline section must define at least one surface crop under
desktopormobile - each interaction audit must define at least one visible expectation via
expectedVisibleTextorexpectedVisibleSelector - when an interaction uses only
expectedVisibleText, Harness treats that text as post-action state evidence and requires it to become newly visible after the hover/click - when both
expectedVisibleSelectorandexpectedVisibleTextare provided, Harness checks the text inside that visible selector instead of scanning the whole page body alwaysdoes not invent a preview target; a supported auto-detected target or a complete custom target must still exist- if you already have a non-Vite preview flow, prefer explicit target config instead of waiting for auto-detection to guess correctly
- a nominally successful browser run can now be downgraded to
success_with_warnings,resource_failed, orvisual_mismatchwhen runtime/resource/diff evidence is bad - section audits reuse the full-page baseline images and crop by region, so they work well with protocol-derived section heights or module bounds
- interaction audits are intentionally small and deterministic: one hover/click action plus concrete visibility expectations
Run Artifacts
Harness persists every run under:
.harness/runs/<runId>/
├── state.json
├── plan.json
├── score-history.json
├── logs/
│ ├── planner.log
│ ├── generator.log
│ └── evaluator.log
└── iterations/
├── iteration-1/
│ ├── generation.json
│ ├── evaluation.json
│ └── render-audit/
│ ├── report.json
│ └── console.json
└── ...These artifacts are what make resume, debugging, and self-optimization practical.
When baseline diff is enabled, report.json includes visualDiff metadata and optional sectionAudits. When interaction audits are configured, report.json also includes interactionAudits. The render-audit directory also contains desktop-diff.png / mobile-diff.png, per-section diff images such as hero-desktop-diff.png, and interaction screenshots such as products-menu-desktop-interaction.png. console.json now includes browser failedRequests and httpFailures in addition to console and page errors.
Quality and Evaluation Model
The Evaluator is intentionally skeptical. Recent changes in the codebase include:
- zero-trust evaluation guidance
- verification-first completion guidance
- scale-aware evaluation instructions for small / standard / large tasks
- render-audit evidence injection into the prompt
- gating logic that can downgrade or fail visually sensitive frontend work when browser evidence is missing
- low-trust status-document guidance and gating, so status ledgers cannot self-certify implementation without code/tests/structured evidence
This is one of the most important design decisions in the project: the Generator and Evaluator are intentionally separate so the system does not grade its own output too generously.
Self-Optimization
Harness can optimize Harness, but it should be treated as an experiment, not a casual local run.
Read:
Key rules:
- use an isolated Git worktree outside the main repo
- copy any uncommitted local snapshot into that worktree
- build first if you want to validate the real shipped CLI path
- explicitly pin Gemini models in config
- prefer
statusandresumeover manual state editing - review worktree diffs before merging anything back
Repository Guide
Useful project entry points:
- Harness.md: design-pattern background and rationale
- SELF_OPTIMIZATION.md: operational playbook for self-optimization
- missions/README.md: regression mission assets
- docs/prompt-fragments/README.md: prompt asset pipeline
- examples/tasks/README.md: task examples and evaluation entry points
- examples/harness-axle.yaml: Axle Go/SQLite CRUD backend workflow example
Relevant source areas:
src/core/harness/: runtime orchestrationsrc/core/state/: persisted state and artifact storagesrc/core/renderAudit/: browser render-audit executionsrc/agents/planner/: planning pipelinesrc/agents/generator/: generation pipelinesrc/agents/evaluator/: evaluation pipeline and gating logic
Development
python3 ../probe/cli.py run --target-repo-path . --suite unit
npm run typecheck
npm run build
npm run lint:naming
npm run lint:arch
npm run sync:prompt-fragments
npm run mission:run -- missions/01-happy-path.yamlFormal unit tests for harness are managed by the private probe module. The public repository keeps only the source tree and test-facing fixtures/docs; the tracked tests/ suite lives in repos/probe/assets/harness/unit/ inside the Chariot workspace.
The repository uses strong guardrails:
- TDD is expected for feature work
- file naming is enforced
- architecture boundaries are tested
- documentation contracts exist for key operational docs
- prompt fragment synchronization is tested
Known Gaps
The project is materially more capable than the old README suggested, but there are still obvious next steps:
- render-audit target detection is currently narrow
- some evaluator-scale guidance still needs to avoid task-specific hard-coding
- browser render audit is useful evidence, but not yet a full visual design judge
Those gaps are real, but they are now layered on top of a substantially stronger orchestration core.
License
MIT
