@cosmo-wise/harness

v1.0.2

Published

a month ago

Harness CLI for long-running AI agent tasks

0High
0Medium
0Low

cosmo-wise

ai agent cli harness claude codex gemini

Harness

Harness is a CLI orchestrator for long-running AI coding tasks. It implements the Harness pattern with three specialized agents:

Planner: turns a user request into a structured plan with sprints and acceptance criteria
Generator: executes the plan and writes code
Evaluator: scores the result, surfaces issues, and decides whether another iteration is needed

The project is no longer just a thin proof of concept. The current codebase includes resumable runs, parallel sprint execution with Git worktrees, incremental repair strategies, prompt-fragment assets, mission-style regression tasks, and browser-based render-audit artifacts for supported frontend targets.

Current Status

The repository has moved well beyond the original single-loop implementation. The core system now includes:

A phase-based runtime under src/core/harness/ instead of a single monolith
Unified run, resume, retry, and replan flows backed by persisted run state
Parallel sprint execution with isolated Git worktrees
Incremental iteration strategies: sprint_retry, issue_fix, and full_regenerate
Generator semantic statuses: DONE, DONE_WITH_CONCERNS, NEEDS_CONTEXT, and BLOCKED
Stagnation stop-loss for repeated issue_fix attempts
Artifact-aware resume that can reuse a persisted plan.json
Safer queued state writes and per-iteration artifact persistence
Structured verification evidence persisted into generation artifacts and evaluator prompts
Runtime-governance evaluation now has an explicit wiring gate: runtime tasks should not pass without evidence for import, instantiation, invocation, and integration coverage
Status ledgers such as output/plan/TODO/90-STATUS.md are now explicitly low-trust evidence; they cannot pass evaluation alone and are deferred until root validation during harvest
explicit configured provider models and fallback chains now run a startup preflight, so auth / not-found failures can stop before planning
audited audit/ bundles are treated as low-trust reference inputs; rewriting them now fails evaluation instead of self-certifying completion
generating audit/ artifacts without a real audited input bundle, or claiming success without files/structured verification evidence, now fails evaluation as anti-placeholder / anti-self-certification protection
low-score or evaluator-failed terminal runs now persist as completed_unacceptable instead of looking like clean success
Prompt fragment assets plus synchronization tooling
Mission assets for repeatable regression scenarios
Browser render-audit artifacts wired into the generation → evaluation path for supported frontend targets

This also means the old README was materially outdated. The sections below describe the codebase as it exists now.

Core Workflow

Harness runs an iterative loop:

Planning
- The Planner expands the user prompt into an overview, sprints, risks, tech spec, and optional scaffold recommendation.
Generation
- The Generator implements the plan, either sequentially or across parallel sprint groups in isolated worktrees.
Render Audit when applicable
- For supported frontend targets, Harness can build the app, start a preview server, capture browser evidence, and persist screenshots plus browser errors.
Evaluation
- The Evaluator scores the result against weighted criteria and can gate visually sensitive frontend work when browser evidence is missing.
Iteration Control
- Harness decides whether to stop, retry a sprint, apply a focused issue fix, or fully regenerate.

The loop stops when one of these conditions is met:

score threshold reached
max iterations reached
no significant improvement
repeated issue_fix attempts fail to materially improve the score
the Generator returns a semantic blocked state that prevents forward progress

Major Capabilities

1. Three-agent orchestration

Harness supports claude, gemini, codex, and opencode as agent backends. Each agent has its own CLI, model, timeout, and retry policy.

2. Parallel sprint execution

When a plan contains enough independent work, Harness can:

topologically group non-conflicting sprints
create isolated Git worktrees
run multiple Generator executions in parallel
merge successful work back into the main working tree
treat retried groups as latest-attempt-only state, so stale blocked/error residue does not leak into the final completion path

This is the main speed lever for larger tasks.

3. Incremental repair instead of blind full rewrites

Harness does not always regenerate everything after a failed evaluation. It can choose:

sprint_retry: retry only specific failed or incomplete sprints
issue_fix: target critical or major evaluator findings with a focused repair pass
full_regenerate: rebuild the implementation when local repair is not sufficient

The Generator prompt also includes an internal fix-verify protocol on issue-fix paths. Evaluator issues can now include a confidence score, and issue_fix can ignore low-confidence findings when incrementalFix.issueConfidenceThreshold is configured.

4. Resume, retry, and replan

Run state is persisted under .harness/runs/<runId>/, so interrupted runs can continue without manual state editing.

Current recovery behavior includes:

resume for interrupted or failed runs
retry as a convenience wrapper for rewinding the latest incomplete iteration
replan for restarting from planning with appended or replaced guidance
artifact-aware resume that can skip the Planner if a valid plan.json already exists
repair of stale half-iterations before continuing
destructive resume repair now writes a per-run repair-backups/ snapshot before trimming artifacts or rewriting damaged persisted state
targeted sprint resume on a completed run that already hit iterations.max now reopens one additional iteration window instead of silently no-oping
fresh runs persist a baseline Git diff snapshot so generation/evaluation/finalize only count net-new file changes instead of inherited dirty-worktree residue

5. Browser-based render audit artifacts

Harness now includes a browser render-audit pipeline for supported frontend targets.

Current behavior:

detects a supported preview target
runs build + preview
waits for the preview URL to become reachable
captures desktop and mobile screenshots
records console errors and page errors
records failed browser requests and HTTP 4xx/5xx responses
can compare the captured screenshots against configured desktop/mobile baseline references
can execute configured hover/click interaction audits on desktop or mobile surfaces
persists the evidence into the iteration artifacts
passes that evidence into the Evaluator prompt
can gate visually-oriented frontend tasks when render evidence is missing, visually mismatched, or failed on runtime/resource signals
includes the matching evidence line when it flags a long-running preview/dev server during generation
can also consume an external render-audit JSON artifact when another tool owns the browser execution path

Current boundary:

target detection is currently Vite-oriented
render-audit requirement is still heuristic rather than fully planner-contract-driven
Generator should not keep long-running preview or dev servers alive; browser verification belongs to the render-audit phase
runtime/governance tasks now require wiring evidence in evaluation; module existence alone is not enough to pass

6. Two-stage evaluator pipeline

Harness supports an optional two-stage evaluation pipeline that splits evaluation into:

Preliminary stage: Fast syntactic and structural checks
Deep stage: Full semantic quality review

Configuration:

twoStageEvaluation:
  enabled: true
  preliminaryTimeoutMs: 60000
  deepTimeoutMs: 180000

Benefits:

Early exit on critical syntax/structure issues before expensive semantic evaluation
Separate artifact persistence: preliminary-evaluation.json and deep-evaluation.json
Status timestamps in run state: preliminaryEvaluationComplete and deepEvaluationComplete
Deep-stage review now consumes the same render-audit evidence and post-review gates as single-stage evaluation, so frontend fidelity failures still downgrade the final verdict
Graceful fallback to single-stage evaluation when not enabled

7. Prompt assets and documentation contracts

Prompt instructions are no longer maintained only as inline strings. The repo now contains:

source prompt assets under src/assets/prompts/
synced prompt-fragment docs under docs/prompt-fragments/
a sync script: npm run sync:prompt-fragments
tests that guard the existence and synchronization of key prompt assets

8. Mission-style regression assets

missions/ is now a formal regression asset layer, not a scratch directory. It supports repeatable scenarios such as:

happy_path
issue_fix
resume
needs_context
blocked

Mission execution is available through:

npm run mission:run -- missions/01-happy-path.yaml

Installation

Prerequisites

Node.js >= 20
Git
At least one supported model CLI:
- @anthropic-ai/claude-code
- @google/gemini-cli
- openai-codex
- opencode-ai

Install dependencies:

npm install

Build the CLI:

npm run build

Link it globally if you want harness on your shell PATH:

npm link

If you want browser render audits for frontend tasks, install Playwright Chromium once:

npm run playwright:install

Windows note

Claude Code on Windows typically needs Git Bash configured in .env:

CLAUDE_CODE_GIT_BASH_PATH=D:\Git\bin\bash.exe

Harness will continue without it, but Claude Code shell execution may fail.

Quick Start

Run a task in development mode without building:

npm run dev -- run "Create an Express API with authentication"
npm run dev -- run -c examples/harness-axle.yaml --working-dir ./output/axle-smoke

Run the built CLI:

harness run "Create an Express API with authentication"
harness run -c examples/harness-axle.yaml --working-dir ./output/axle-smoke
harness run --prompt-file ./task.md

Check status:

harness status
harness status --follow
harness status --watchdog

Resume the latest run:

harness resume
harness resume <runId> --append-file ./resume-notes.md

Check whether the current Harness worktree is bootstrapped correctly:

harness doctor

Harvest validated files from a self-opt worktree back into the main repo:

harness harvest ./harness-self-opt-worktree
harness harvest ./harness-self-opt-worktree --apply

Run into a dedicated working directory:

harness run "Build a React landing page" -w ./output/landing-page

Enable parallel execution:

harness run "Build a web app" -p --max-concurrency 4

CLI Commands

harness run [prompt]
harness run --prompt-file <path>
harness status
harness status --follow
harness doctor
harness harvest <sourceDir>
harness resume [runId]
harness resume [runId] --append-file <path>
harness retry [runId]
harness replan [runId] -a "Additional guidance"
harness replan [runId] --replace-file <path>
harness config

Important options:

-c, --config <path>: config file path
-w, --working-dir <dir>: working directory
-v, --verbose: verbose logs
-p, --parallel: enable parallel sprint execution
--max-concurrency <number|auto>: parallel worker limit
--iteration-mode <mode>: force iteration strategy (sprint_retry, issue_fix, full_regenerate)
--focus-critical: restrict incremental fixes to critical issues
--render-audit <off|auto|always>: override render audit mode for this invocation
-T, --test-tier <smoke|gate|full>: select test tier for this invocation
--watchdog: enable auto-resume watchdog mode on status
-i, --interval <ms>: polling interval on status
--max-recovery <number>: maximum recovery attempts on status
-f, --file <paths...> on harvest: explicit whitelist of files to harvest
--apply on harvest: copy files and run validations instead of dry-run listing
--source-command <command> on harvest: additional or replacement worktree validation commands
--targeted-command <command> on harvest: targeted validation command in the target repo; supports {files}
--root-command <command> on harvest: root validation command in the target repo
--from-iteration <number> on resume: rewind and continue from a specific iteration
--prompt-file <path> on run: read the task prompt from a file
--append <text> on resume: inject extra resume guidance without resetting an in-progress run
--append-file <path> on resume / replan: read appended guidance from a file
--append <text> / --replace [text] on replan: change planning guidance
--replace [text] on resume: restart from planning with a replaced prompt
--replace-file <path> on resume / replan: read replacement planning guidance from a file
--sprint <names...> on resume: target specific sprints without replanning

Configuration

Harness loads config from:

--config
./harness.yaml
./harness.yml
./harness.json
./.harness/config.yaml

Example:

task: Build a production-ready landing page for Harness

agents:
  planner:
    cli: gemini
    model: gemini-2.5-pro
    fallbackModels:
      - gemini-2.5-flash
    timeout: 300000
    maxOuterRetries: 2
    retryBaseDelayMs: 2000
    retryMaxDelayMs: 8000

  generator:
    cli: gemini
    model: gemini-2.5-flash
    fallbackModels:
      - gemini-2.5-flash-lite
      - gemini-2.5-pro
    timeout: 1800000
    maxOuterRetries: 3
    retryBaseDelayMs: 2000
    retryMaxDelayMs: 8000

  evaluator:
    cli: gemini
    model: gemini-2.5-pro
    fallbackModels:
      - gemini-2.5-flash
    timeout: 600000
    maxOuterRetries: 2
    retryBaseDelayMs: 2000
    retryMaxDelayMs: 8000

iterations:
  max: 5
  scoreThreshold: 90
  minImprovement: 3

criteria:
  - name: functionality
    weight: 30
    description: Core requirements are implemented and work end-to-end
  - name: code_quality
    weight: 25
    description: Code is maintainable, readable, and tested appropriately
  - name: architecture
    weight: 25
    description: Modules, boundaries, and structure are coherent
  - name: originality
    weight: 20
    description: The implementation is deliberate, not boilerplate filler

recursion:
  maxDepth: 3
  preventNested: true

parallel:
  enabled: true
  maxConcurrency: auto
  minSprintsToParallelize: 3
  allowConcurrentFileEdits: false
  baselineCommand: npm run typecheck
  postCreateChecks:
    checkUncommittedChanges: true
    checkBaselineFiles: true
    baselineFiles:
      requiredFiles:
        - package.json
        - tsconfig.json
    checkBranchDivergence: true
    baseBranch: main
  checkpoint:
    mode: checkpoint-repo
    checkpointsDir: .harness/checkpoints

planning:
  confidenceThreshold: 70

routing:
  enabled: true
  defaultModel: openrouter/gpt-4o-mini
  taskRouting:
    planning: openrouter/gpt-5
    generation: openrouter/claude-3.5-sonnet
    evaluation: openrouter/gpt-4o
  strategyRouting:
    issue_fix: openrouter/gpt-4o-mini
    sprint_retry: openrouter/gpt-4o
    full_regenerate: openrouter/gpt-5

bridge:
  enabled: true
  type: mcp
  capabilities:
    - kind: memory
      content: Prefer the external design system guidance when touching shared UI.
    - kind: skill
      name: design-bridge
      description: Uses external bridge context for design-system decisions.
      usage: Use when repo-local docs are incomplete.
    - kind: command
      name: repo-index
      description: External command capability that can surface repository metadata.
    - kind: mcp
      name: design-docs
      description: External MCP capability exposing design references.

# Optional Axle CRUD backend guidance. Use this for Go/SQLite CRUD services
# where standard CRUD should stay descriptor-generated and framework-owned.
memory:
  facts:
    - For Go/SQLite CRUD backends, prefer Axle unless the user asks otherwise.
    - Axle apps must be descriptor-first and verified with axle check plus scripts/verify.sh.
  skills:
    - name: axle-crud-backend
      description: Scaffold and adapt LLM-friendly Go/SQLite CRUD backends with Axle.
      usage: Run axle app init, replace sample descriptors, regenerate descriptor/catalog output, then run scripts/verify.sh and axle check --root .

renderAudit:
  mode: auto
  # Optional external render-audit artifact override:
  # externalArtifactPath: .trial/runs/latest/render-audit.json
  # Optional Trial provider that runs compile/run/export automatically:
  # provider: trial
  # buildCommand: npm run build
  # previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  # previewUrl: http://127.0.0.1:4173
  # trial:
  #   command: trial
  #   bundlePath: ../prompt-bundles/site-audit
  #   suiteDir: .trial/suite
  #   runDir: .trial/run
  #   artifactPath: .trial/render-audit.json
  #   contentGate: true
  #   interactionGate: true
  #   visualGate: false
  # Optional explicit preview target override:
  # buildCommand: npm run build
  # previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  # previewUrl: http://127.0.0.1:4173
  # Optional screenshot baseline diff:
  # baseline:
  #   desktopScreenshotPath: ./render-baselines/home-desktop.png
  #   mobileScreenshotPath: ./render-baselines/home-mobile.png
  #   maxMismatchRatio: 0.15

incrementalFix:
  enabled: true
  maxRetries: 3
  focusOnCritical: true
  maxConsecutiveIssueFixAttempts: 3
  issueConfidenceThreshold: 70

budget:
  maxTurnsPerRun: 1500
  maxCostUsdPerRun: 15
  maxTurnsPerIteration: 400
  maxCostUsdPerIteration: 4

Notes:

Gemini should be configured with an explicit model
fallback models and bounded outer retries are strongly recommended for Gemini
OpenCode models should use the provider-qualified form expected by opencode, such as openrouter/gpt-5
parallel mode requires a Git repository and a clean workspace
parallel.baselineCommand runs inside each created worktree before sprint execution starts
parallel.postCreateChecks can fail fast on uncommitted changes, missing baseline files, or branch divergence inside created worktrees
parallel.checkpoint.mode: checkpoint-repo enables an experimental shadow-checkpoint path that gives each group its own GIT_DIR under .harness/checkpoints/
checkpoint mode currently forces maxConcurrency = 1 for safety because groups share the same GIT_WORK_TREE; native worktrees remain the default and higher-throughput path
planning.confidenceThreshold blocks generation when the Planner reports a lower planConfidence; use a 0-100 scale, not 0-1
incrementalFix.issueConfidenceThreshold lets issue_fix skip low-confidence evaluator findings
budget.maxTurnsPerRun / budget.maxCostUsdPerRun guard the whole run, while budget.maxTurnsPerIteration / budget.maxCostUsdPerIteration can stop a single overly expensive iteration
when agents.*.model or fallbackModels are explicitly configured, Harness preflights those provider/model combinations before planning
for long prompts on Windows, prefer --prompt-file, --append-file, and --replace-file over very long inline CLI arguments
routing.taskRouting can override the primary model per phase without changing each agent's baseline config
routing.strategyRouting can override the Generator model for issue_fix, sprint_retry, and full_regenerate
when routing changes the primary model, Harness preserves the original agent model and fallbackModels as the retry chain behind the routed primary
bridge is a capability-context layer, not an arbitrary shell passthrough
bridge.capabilities lets you describe external memory, skill, command, and mcp affordances in one schema
Harness normalizes bridge capabilities into runtime memory / skills so Planner, Generator, and Evaluator all see the same external context on run and resume
renderAudit.mode: off disables audit execution and evaluation gating
renderAudit.mode: auto keeps the default behavior
renderAudit.mode: always requires render audit evidence whenever a target is available
renderAudit.externalArtifactPath lets Harness load precomputed render-audit evidence from another tool such as trial
renderAudit.externalArtifactPath cannot be combined with buildCommand, previewCommand, previewUrl, baseline, or interactions
renderAudit.provider: trial tells Harness to call the sibling trial CLI itself instead of running the built-in Playwright auditor
renderAudit.trial.bundlePath is required when renderAudit.provider=trial
renderAudit.provider=trial still reuses Harness target resolution for buildCommand, previewCommand, and previewUrl
renderAudit.provider=trial cannot be combined with externalArtifactPath, baseline, or interactions
Trial exports now treat prompt-content and interaction failures as hard failures by default, while visual mismatch evidence stays in the evaluator prompt unless visualGate: true
Trial route summaries are attached to render-audit evidence so repair prompts can see missing text, failed interactions, and diff image paths per route
custom render targets must provide buildCommand, previewCommand, and previewUrl together
renderAudit.baseline.desktopScreenshotPath and renderAudit.baseline.mobileScreenshotPath must be configured together
renderAudit.baseline.maxMismatchRatio defaults to 0.15 when omitted

Bridge Capability Context

Harness can load external capability descriptions through bridge and fold them into the same runtime knowledge path used by built-in memory and skills.

Example:

bridge:
  enabled: true
  type: mcp
  capabilities:
    - kind: memory
      content: Prefer the external design system guidance when touching shared UI.
    - kind: skill
      name: design-bridge
      description: Uses external bridge context for design-system decisions.
      usage: Use when repo-local docs are incomplete.
    - kind: command
      name: repo-index
      description: External command capability that can surface repository metadata.
    - kind: mcp
      name: design-docs
      description: External MCP capability exposing design references.

Current semantics:

memory capabilities are injected as runtime memory facts
skill capabilities become runtime skills
command and mcp capabilities are summarized into runtime memory so agents can reason about what external context exists
the same normalized bridge context is refreshed on resume
direct bridge command fallback uses spawn(..., { shell: false }); this feature is not intended to be a generic shell execution tunnel

Shadow Checkpoint Mode

Harness now includes an optional shadow checkpoint execution mode for parallel runtime experiments.

Example:

parallel:
  enabled: true
  maxConcurrency: auto
  minSprintsToParallelize: 3
  allowConcurrentFileEdits: false
  checkpoint:
    mode: checkpoint-repo
    checkpointsDir: .harness/checkpoints

Current semantics:

each group gets an isolated checkpoint repository under .harness/checkpoints/<groupId>/git
Generator execution receives GIT_DIR and GIT_WORK_TREE so edits happen against the shadow checkpoint instead of a new native worktree
Harness captures a baseline checkpoint before the group starts
failed groups roll back to that baseline before cleanup
successful groups clean up their checkpoint repo after completion
this mode is intentionally conservative and currently runs one group at a time

Watchdog Status Mode

harness status --watchdog now uses the same resume path as harness resume instead of a no-op callback.

Examples:

harness status --watchdog
harness status --watchdog -c ./harness.yaml --max-recovery 2

Current semantics:

watchdog mode implies --follow
when watchdog mode is enabled, Harness loads config so auto-resume uses the real agent/runtime settings; if the active run already persisted a config snapshot, watchdog reuses that snapshot instead of resolving the wrong shell-local ./harness.yaml
--max-recovery is enforced by the watchdog runtime, not just parsed by the CLI
recovery attempts update the persisted run state via recoveryReason and recoveryCount
generator provider progress now refreshes persisted child-process activity telemetry, so active provider output is less likely to be misclassified as a stalled/zombie run
status surfaces one unified observed health view: running, stalled, zombie, needs-recovery, or completed
provider/tooling crashes now persist an explicit provider_failed health snapshot with a concrete recovery reason instead of collapsing into an undifferentiated failed run
long-running provider child-process activity with no progress change now escalates to needs-recovery as a suspected hung-provider state instead of staying forever "healthy"

Doctor Checks

harness doctor is a repo/worktree health check for Harness itself. It currently verifies:

the working directory exists
the working directory is writable
node_modules is present when package.json exists
the built Harness CLI at dist/cli.js starts successfully with --help
explicit provider/model/fallback configuration can pass auth/model preflight before you start a long run

This is the fastest way to catch self-optimization/bootstrap failures such as “build passed but the built CLI cannot actually start”.

Harvest Protocol

harness harvest productizes the self-optimization closure flow of “validated file whitelist copy + main-repo revalidation”.

Examples:

harness harvest ./harness-self-opt-worktree
harness harvest ./harness-self-opt-worktree --apply
harness harvest ./harness-self-opt-worktree --apply --file src/core/harness/runHealth.ts --root-command "python3 ../probe/cli.py run --target-repo-path . --suite unit"

Current semantics:

default mode is dry-run: it lists harvestable files and the validation commands that would run
--apply first runs worktree validation, then copies the whitelist into the target repo, then forces targeted + root validation in the target repo
low-trust status ledgers such as output/plan/TODO/90-STATUS.md are deferred until after root validation succeeds, so failed harvest validation does not prematurely update the main-repo status ledger
when commands are not provided explicitly, Harness infers worktree validation from package.json scripts typecheck / build
root validation is inferred from package.json scripts test, typecheck, and build
targeted validation always includes git diff --check -- {files} in the target repo; custom --targeted-command values are appended after that
.git/, .harness/, and node_modules/ are never harvested from implicit changed-file discovery
deleted files are reported as skipped because the current protocol productizes file-copy harvest, not deletion replay

Windows Provider Notes

file-backed prompt inputs are the safest way to pass long task text on Windows shells
provider failures caused by shell mismatches such as bare && or Unix-only head are classified as provider/tooling errors instead of normal code failures
provider-side 404/not found failures are tracked separately from auth/quota/transport failures so bad model/provider ids can be surfaced as manual-recovery issues
status, resume, and completion summaries surface interesting parallel group retry history so repeated provider/tooling failures are visible without reading raw state files

Smart Model Routing

Harness can route different phases to different models without duplicating the whole agents block.

Example:

routing:
  enabled: true
  defaultModel: openrouter/gpt-4o-mini
  taskRouting:
    planning: openrouter/gpt-5
    generation: openrouter/claude-3.5-sonnet
    evaluation: openrouter/gpt-4o
  strategyRouting:
    issue_fix: openrouter/gpt-4o-mini
    sprint_retry: openrouter/gpt-4o
    full_regenerate: openrouter/gpt-5

Routing semantics:

taskRouting applies to planning, generation, and evaluation
strategyRouting applies on generation paths when Harness already knows it is doing issue_fix, sprint_retry, or full_regenerate
strategyRouting takes precedence over taskRouting for generation retries
if routing swaps the primary model, the original agent model becomes the first fallback, followed by the agent's configured fallbackModels
OpenCode model ids should stay provider-qualified, such as openrouter/gpt-5

Render Audit Configuration Guide

Recommended mode selection:

off: use this for backend-only tasks, docs-only work, or environments where browser preview is intentionally unavailable
auto: default mode; Harness runs render audit when a target is available, and only gates evaluation when the task looks visually sensitive
always: use this for landing pages, marketing sites, UI redesigns, responsive work, animation polish, or any task where visual claims must be backed by browser evidence

CLI override examples:

harness run "Build a landing page" --render-audit always
harness resume <runId> --render-audit off

Custom preview target example:

renderAudit:
  mode: always
  buildCommand: npm run build
  previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  previewUrl: http://127.0.0.1:4173

Optional baseline screenshot diff:

renderAudit:
  mode: always
  buildCommand: npm run build
  previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  previewUrl: http://127.0.0.1:4173
  baseline:
    desktopScreenshotPath: ./render-baselines/home-desktop.png
    mobileScreenshotPath: ./render-baselines/home-mobile.png
    maxMismatchRatio: 0.12
    sections:
      - id: hero
        desktop:
          y: 0
          height: 900
          maxMismatchRatio: 0.08
        mobile:
          y: 0
          height: 720
          maxMismatchRatio: 0.1
  interactions:
    - id: products-menu
      action: hover
      selector: '[data-nav-products]'
      expectedVisibleText:
        - API
        - Hosted
    - id: mobile-menu
      surface: mobile
      action: click
      selector: '[data-mobile-menu-button]'
      expectedVisibleSelector: '[data-mobile-menu-panel]'

Important constraints:

the three custom target fields are all-or-nothing
baseline screenshot paths are also all-or-nothing: configure both desktop and mobile references together
each baseline section must define at least one surface crop under desktop or mobile
each interaction audit must define at least one visible expectation via expectedVisibleText or expectedVisibleSelector
when an interaction uses only expectedVisibleText, Harness treats that text as post-action state evidence and requires it to become newly visible after the hover/click
when both expectedVisibleSelector and expectedVisibleText are provided, Harness checks the text inside that visible selector instead of scanning the whole page body
always does not invent a preview target; a supported auto-detected target or a complete custom target must still exist
if you already have a non-Vite preview flow, prefer explicit target config instead of waiting for auto-detection to guess correctly
a nominally successful browser run can now be downgraded to success_with_warnings, resource_failed, or visual_mismatch when runtime/resource/diff evidence is bad
section audits reuse the full-page baseline images and crop by region, so they work well with protocol-derived section heights or module bounds
interaction audits are intentionally small and deterministic: one hover/click action plus concrete visibility expectations

Run Artifacts

Harness persists every run under:

.harness/runs/<runId>/
├── state.json
├── plan.json
├── score-history.json
├── logs/
│   ├── planner.log
│   ├── generator.log
│   └── evaluator.log
└── iterations/
    ├── iteration-1/
    │   ├── generation.json
    │   ├── evaluation.json
    │   └── render-audit/
    │       ├── report.json
    │       └── console.json
    └── ...

These artifacts are what make resume, debugging, and self-optimization practical.

When baseline diff is enabled, report.json includes visualDiff metadata and optional sectionAudits. When interaction audits are configured, report.json also includes interactionAudits. The render-audit directory also contains desktop-diff.png / mobile-diff.png, per-section diff images such as hero-desktop-diff.png, and interaction screenshots such as products-menu-desktop-interaction.png. console.json now includes browser failedRequests and httpFailures in addition to console and page errors.

Quality and Evaluation Model

The Evaluator is intentionally skeptical. Recent changes in the codebase include:

zero-trust evaluation guidance
verification-first completion guidance
scale-aware evaluation instructions for small / standard / large tasks
render-audit evidence injection into the prompt
gating logic that can downgrade or fail visually sensitive frontend work when browser evidence is missing
low-trust status-document guidance and gating, so status ledgers cannot self-certify implementation without code/tests/structured evidence

This is one of the most important design decisions in the project: the Generator and Evaluator are intentionally separate so the system does not grade its own output too generously.

Self-Optimization

Harness can optimize Harness, but it should be treated as an experiment, not a casual local run.

Read:

SELF_OPTIMIZATION.md

Key rules:

use an isolated Git worktree outside the main repo
copy any uncommitted local snapshot into that worktree
build first if you want to validate the real shipped CLI path
explicitly pin Gemini models in config
prefer status and resume over manual state editing
review worktree diffs before merging anything back

Repository Guide

Useful project entry points:

Harness.md: design-pattern background and rationale
SELF_OPTIMIZATION.md: operational playbook for self-optimization
missions/README.md: regression mission assets
docs/prompt-fragments/README.md: prompt asset pipeline
examples/tasks/README.md: task examples and evaluation entry points
examples/harness-axle.yaml: Axle Go/SQLite CRUD backend workflow example

Relevant source areas:

src/core/harness/: runtime orchestration
src/core/state/: persisted state and artifact storage
src/core/renderAudit/: browser render-audit execution
src/agents/planner/: planning pipeline
src/agents/generator/: generation pipeline
src/agents/evaluator/: evaluation pipeline and gating logic

Development

python3 ../probe/cli.py run --target-repo-path . --suite unit
npm run typecheck
npm run build
npm run lint:naming
npm run lint:arch
npm run sync:prompt-fragments
npm run mission:run -- missions/01-happy-path.yaml

Formal unit tests for harness are managed by the private probe module. The public repository keeps only the source tree and test-facing fixtures/docs; the tracked tests/ suite lives in repos/probe/assets/harness/unit/ inside the Chariot workspace.

The repository uses strong guardrails:

TDD is expected for feature work
file naming is enforced
architecture boundaries are tested
documentation contracts exist for key operational docs
prompt fragment synchronization is tested

Known Gaps

The project is materially more capable than the old README suggested, but there are still obvious next steps:

render-audit target detection is currently narrow
some evaluator-scale guidance still needs to avoid task-specific hard-coding
browser render audit is useful evidence, but not yet a full visual design judge

Those gaps are real, but they are now layered on top of a substantially stronger orchestration core.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Harness

Current Status

Core Workflow

Major Capabilities

1. Three-agent orchestration

2. Parallel sprint execution

3. Incremental repair instead of blind full rewrites

4. Resume, retry, and replan

5. Browser-based render audit artifacts

6. Two-stage evaluator pipeline

7. Prompt assets and documentation contracts

8. Mission-style regression assets

Installation

Prerequisites

Windows note

Quick Start

CLI Commands

Configuration

Bridge Capability Context

Shadow Checkpoint Mode

Watchdog Status Mode

Doctor Checks

Harvest Protocol

Windows Provider Notes

Smart Model Routing

Render Audit Configuration Guide

Run Artifacts

Quality and Evaluation Model

Self-Optimization

Repository Guide

Development

Known Gaps

License