npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@cosmo-wise/harness

v1.0.2

Published

Harness CLI for long-running AI agent tasks

Readme

Harness

Harness is a CLI orchestrator for long-running AI coding tasks. It implements the Harness pattern with three specialized agents:

  • Planner: turns a user request into a structured plan with sprints and acceptance criteria
  • Generator: executes the plan and writes code
  • Evaluator: scores the result, surfaces issues, and decides whether another iteration is needed

The project is no longer just a thin proof of concept. The current codebase includes resumable runs, parallel sprint execution with Git worktrees, incremental repair strategies, prompt-fragment assets, mission-style regression tasks, and browser-based render-audit artifacts for supported frontend targets.

Current Status

The repository has moved well beyond the original single-loop implementation. The core system now includes:

  • A phase-based runtime under src/core/harness/ instead of a single monolith
  • Unified run, resume, retry, and replan flows backed by persisted run state
  • Parallel sprint execution with isolated Git worktrees
  • Incremental iteration strategies: sprint_retry, issue_fix, and full_regenerate
  • Generator semantic statuses: DONE, DONE_WITH_CONCERNS, NEEDS_CONTEXT, and BLOCKED
  • Stagnation stop-loss for repeated issue_fix attempts
  • Artifact-aware resume that can reuse a persisted plan.json
  • Safer queued state writes and per-iteration artifact persistence
  • Structured verification evidence persisted into generation artifacts and evaluator prompts
  • Runtime-governance evaluation now has an explicit wiring gate: runtime tasks should not pass without evidence for import, instantiation, invocation, and integration coverage
  • Status ledgers such as output/plan/TODO/90-STATUS.md are now explicitly low-trust evidence; they cannot pass evaluation alone and are deferred until root validation during harvest
  • explicit configured provider models and fallback chains now run a startup preflight, so auth / not-found failures can stop before planning
  • audited audit/ bundles are treated as low-trust reference inputs; rewriting them now fails evaluation instead of self-certifying completion
  • generating audit/ artifacts without a real audited input bundle, or claiming success without files/structured verification evidence, now fails evaluation as anti-placeholder / anti-self-certification protection
  • low-score or evaluator-failed terminal runs now persist as completed_unacceptable instead of looking like clean success
  • Prompt fragment assets plus synchronization tooling
  • Mission assets for repeatable regression scenarios
  • Browser render-audit artifacts wired into the generation → evaluation path for supported frontend targets

This also means the old README was materially outdated. The sections below describe the codebase as it exists now.

Core Workflow

Harness runs an iterative loop:

  1. Planning
    • The Planner expands the user prompt into an overview, sprints, risks, tech spec, and optional scaffold recommendation.
  2. Generation
    • The Generator implements the plan, either sequentially or across parallel sprint groups in isolated worktrees.
  3. Render Audit when applicable
    • For supported frontend targets, Harness can build the app, start a preview server, capture browser evidence, and persist screenshots plus browser errors.
  4. Evaluation
    • The Evaluator scores the result against weighted criteria and can gate visually sensitive frontend work when browser evidence is missing.
  5. Iteration Control
    • Harness decides whether to stop, retry a sprint, apply a focused issue fix, or fully regenerate.

The loop stops when one of these conditions is met:

  • score threshold reached
  • max iterations reached
  • no significant improvement
  • repeated issue_fix attempts fail to materially improve the score
  • the Generator returns a semantic blocked state that prevents forward progress

Major Capabilities

1. Three-agent orchestration

Harness supports claude, gemini, codex, and opencode as agent backends. Each agent has its own CLI, model, timeout, and retry policy.

2. Parallel sprint execution

When a plan contains enough independent work, Harness can:

  • topologically group non-conflicting sprints
  • create isolated Git worktrees
  • run multiple Generator executions in parallel
  • merge successful work back into the main working tree
  • treat retried groups as latest-attempt-only state, so stale blocked/error residue does not leak into the final completion path

This is the main speed lever for larger tasks.

3. Incremental repair instead of blind full rewrites

Harness does not always regenerate everything after a failed evaluation. It can choose:

  • sprint_retry: retry only specific failed or incomplete sprints
  • issue_fix: target critical or major evaluator findings with a focused repair pass
  • full_regenerate: rebuild the implementation when local repair is not sufficient

The Generator prompt also includes an internal fix-verify protocol on issue-fix paths. Evaluator issues can now include a confidence score, and issue_fix can ignore low-confidence findings when incrementalFix.issueConfidenceThreshold is configured.

4. Resume, retry, and replan

Run state is persisted under .harness/runs/<runId>/, so interrupted runs can continue without manual state editing.

Current recovery behavior includes:

  • resume for interrupted or failed runs
  • retry as a convenience wrapper for rewinding the latest incomplete iteration
  • replan for restarting from planning with appended or replaced guidance
  • artifact-aware resume that can skip the Planner if a valid plan.json already exists
  • repair of stale half-iterations before continuing
  • destructive resume repair now writes a per-run repair-backups/ snapshot before trimming artifacts or rewriting damaged persisted state
  • targeted sprint resume on a completed run that already hit iterations.max now reopens one additional iteration window instead of silently no-oping
  • fresh runs persist a baseline Git diff snapshot so generation/evaluation/finalize only count net-new file changes instead of inherited dirty-worktree residue

5. Browser-based render audit artifacts

Harness now includes a browser render-audit pipeline for supported frontend targets.

Current behavior:

  • detects a supported preview target
  • runs build + preview
  • waits for the preview URL to become reachable
  • captures desktop and mobile screenshots
  • records console errors and page errors
  • records failed browser requests and HTTP 4xx/5xx responses
  • can compare the captured screenshots against configured desktop/mobile baseline references
  • can execute configured hover/click interaction audits on desktop or mobile surfaces
  • persists the evidence into the iteration artifacts
  • passes that evidence into the Evaluator prompt
  • can gate visually-oriented frontend tasks when render evidence is missing, visually mismatched, or failed on runtime/resource signals
  • includes the matching evidence line when it flags a long-running preview/dev server during generation
  • can also consume an external render-audit JSON artifact when another tool owns the browser execution path

Current boundary:

  • target detection is currently Vite-oriented
  • render-audit requirement is still heuristic rather than fully planner-contract-driven
  • Generator should not keep long-running preview or dev servers alive; browser verification belongs to the render-audit phase
  • runtime/governance tasks now require wiring evidence in evaluation; module existence alone is not enough to pass

6. Two-stage evaluator pipeline

Harness supports an optional two-stage evaluation pipeline that splits evaluation into:

  1. Preliminary stage: Fast syntactic and structural checks
  2. Deep stage: Full semantic quality review

Configuration:

twoStageEvaluation:
  enabled: true
  preliminaryTimeoutMs: 60000
  deepTimeoutMs: 180000

Benefits:

  • Early exit on critical syntax/structure issues before expensive semantic evaluation
  • Separate artifact persistence: preliminary-evaluation.json and deep-evaluation.json
  • Status timestamps in run state: preliminaryEvaluationComplete and deepEvaluationComplete
  • Deep-stage review now consumes the same render-audit evidence and post-review gates as single-stage evaluation, so frontend fidelity failures still downgrade the final verdict
  • Graceful fallback to single-stage evaluation when not enabled

7. Prompt assets and documentation contracts

Prompt instructions are no longer maintained only as inline strings. The repo now contains:

  • source prompt assets under src/assets/prompts/
  • synced prompt-fragment docs under docs/prompt-fragments/
  • a sync script: npm run sync:prompt-fragments
  • tests that guard the existence and synchronization of key prompt assets

8. Mission-style regression assets

missions/ is now a formal regression asset layer, not a scratch directory. It supports repeatable scenarios such as:

  • happy_path
  • issue_fix
  • resume
  • needs_context
  • blocked

Mission execution is available through:

npm run mission:run -- missions/01-happy-path.yaml

Installation

Prerequisites

  • Node.js >= 20
  • Git
  • At least one supported model CLI:
    • @anthropic-ai/claude-code
    • @google/gemini-cli
    • openai-codex
    • opencode-ai

Install dependencies:

npm install

Build the CLI:

npm run build

Link it globally if you want harness on your shell PATH:

npm link

If you want browser render audits for frontend tasks, install Playwright Chromium once:

npm run playwright:install

Windows note

Claude Code on Windows typically needs Git Bash configured in .env:

CLAUDE_CODE_GIT_BASH_PATH=D:\Git\bin\bash.exe

Harness will continue without it, but Claude Code shell execution may fail.

Quick Start

Run a task in development mode without building:

npm run dev -- run "Create an Express API with authentication"
npm run dev -- run -c examples/harness-axle.yaml --working-dir ./output/axle-smoke

Run the built CLI:

harness run "Create an Express API with authentication"
harness run -c examples/harness-axle.yaml --working-dir ./output/axle-smoke
harness run --prompt-file ./task.md

Check status:

harness status
harness status --follow
harness status --watchdog

Resume the latest run:

harness resume
harness resume <runId> --append-file ./resume-notes.md

Check whether the current Harness worktree is bootstrapped correctly:

harness doctor

Harvest validated files from a self-opt worktree back into the main repo:

harness harvest ./harness-self-opt-worktree
harness harvest ./harness-self-opt-worktree --apply

Run into a dedicated working directory:

harness run "Build a React landing page" -w ./output/landing-page

Enable parallel execution:

harness run "Build a web app" -p --max-concurrency 4

CLI Commands

harness run [prompt]
harness run --prompt-file <path>
harness status
harness status --follow
harness doctor
harness harvest <sourceDir>
harness resume [runId]
harness resume [runId] --append-file <path>
harness retry [runId]
harness replan [runId] -a "Additional guidance"
harness replan [runId] --replace-file <path>
harness config

Important options:

  • -c, --config <path>: config file path
  • -w, --working-dir <dir>: working directory
  • -v, --verbose: verbose logs
  • -p, --parallel: enable parallel sprint execution
  • --max-concurrency <number|auto>: parallel worker limit
  • --iteration-mode <mode>: force iteration strategy (sprint_retry, issue_fix, full_regenerate)
  • --focus-critical: restrict incremental fixes to critical issues
  • --render-audit <off|auto|always>: override render audit mode for this invocation
  • -T, --test-tier <smoke|gate|full>: select test tier for this invocation
  • --watchdog: enable auto-resume watchdog mode on status
  • -i, --interval <ms>: polling interval on status
  • --max-recovery <number>: maximum recovery attempts on status
  • -f, --file <paths...> on harvest: explicit whitelist of files to harvest
  • --apply on harvest: copy files and run validations instead of dry-run listing
  • --source-command <command> on harvest: additional or replacement worktree validation commands
  • --targeted-command <command> on harvest: targeted validation command in the target repo; supports {files}
  • --root-command <command> on harvest: root validation command in the target repo
  • --from-iteration <number> on resume: rewind and continue from a specific iteration
  • --prompt-file <path> on run: read the task prompt from a file
  • --append <text> on resume: inject extra resume guidance without resetting an in-progress run
  • --append-file <path> on resume / replan: read appended guidance from a file
  • --append <text> / --replace [text] on replan: change planning guidance
  • --replace [text] on resume: restart from planning with a replaced prompt
  • --replace-file <path> on resume / replan: read replacement planning guidance from a file
  • --sprint <names...> on resume: target specific sprints without replanning

Configuration

Harness loads config from:

  1. --config
  2. ./harness.yaml
  3. ./harness.yml
  4. ./harness.json
  5. ./.harness/config.yaml

Example:

task: Build a production-ready landing page for Harness

agents:
  planner:
    cli: gemini
    model: gemini-2.5-pro
    fallbackModels:
      - gemini-2.5-flash
    timeout: 300000
    maxOuterRetries: 2
    retryBaseDelayMs: 2000
    retryMaxDelayMs: 8000

  generator:
    cli: gemini
    model: gemini-2.5-flash
    fallbackModels:
      - gemini-2.5-flash-lite
      - gemini-2.5-pro
    timeout: 1800000
    maxOuterRetries: 3
    retryBaseDelayMs: 2000
    retryMaxDelayMs: 8000

  evaluator:
    cli: gemini
    model: gemini-2.5-pro
    fallbackModels:
      - gemini-2.5-flash
    timeout: 600000
    maxOuterRetries: 2
    retryBaseDelayMs: 2000
    retryMaxDelayMs: 8000

iterations:
  max: 5
  scoreThreshold: 90
  minImprovement: 3

criteria:
  - name: functionality
    weight: 30
    description: Core requirements are implemented and work end-to-end
  - name: code_quality
    weight: 25
    description: Code is maintainable, readable, and tested appropriately
  - name: architecture
    weight: 25
    description: Modules, boundaries, and structure are coherent
  - name: originality
    weight: 20
    description: The implementation is deliberate, not boilerplate filler

recursion:
  maxDepth: 3
  preventNested: true

parallel:
  enabled: true
  maxConcurrency: auto
  minSprintsToParallelize: 3
  allowConcurrentFileEdits: false
  baselineCommand: npm run typecheck
  postCreateChecks:
    checkUncommittedChanges: true
    checkBaselineFiles: true
    baselineFiles:
      requiredFiles:
        - package.json
        - tsconfig.json
    checkBranchDivergence: true
    baseBranch: main
  checkpoint:
    mode: checkpoint-repo
    checkpointsDir: .harness/checkpoints

planning:
  confidenceThreshold: 70

routing:
  enabled: true
  defaultModel: openrouter/gpt-4o-mini
  taskRouting:
    planning: openrouter/gpt-5
    generation: openrouter/claude-3.5-sonnet
    evaluation: openrouter/gpt-4o
  strategyRouting:
    issue_fix: openrouter/gpt-4o-mini
    sprint_retry: openrouter/gpt-4o
    full_regenerate: openrouter/gpt-5

bridge:
  enabled: true
  type: mcp
  capabilities:
    - kind: memory
      content: Prefer the external design system guidance when touching shared UI.
    - kind: skill
      name: design-bridge
      description: Uses external bridge context for design-system decisions.
      usage: Use when repo-local docs are incomplete.
    - kind: command
      name: repo-index
      description: External command capability that can surface repository metadata.
    - kind: mcp
      name: design-docs
      description: External MCP capability exposing design references.

# Optional Axle CRUD backend guidance. Use this for Go/SQLite CRUD services
# where standard CRUD should stay descriptor-generated and framework-owned.
memory:
  facts:
    - For Go/SQLite CRUD backends, prefer Axle unless the user asks otherwise.
    - Axle apps must be descriptor-first and verified with axle check plus scripts/verify.sh.
  skills:
    - name: axle-crud-backend
      description: Scaffold and adapt LLM-friendly Go/SQLite CRUD backends with Axle.
      usage: Run axle app init, replace sample descriptors, regenerate descriptor/catalog output, then run scripts/verify.sh and axle check --root .

renderAudit:
  mode: auto
  # Optional external render-audit artifact override:
  # externalArtifactPath: .trial/runs/latest/render-audit.json
  # Optional Trial provider that runs compile/run/export automatically:
  # provider: trial
  # buildCommand: npm run build
  # previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  # previewUrl: http://127.0.0.1:4173
  # trial:
  #   command: trial
  #   bundlePath: ../prompt-bundles/site-audit
  #   suiteDir: .trial/suite
  #   runDir: .trial/run
  #   artifactPath: .trial/render-audit.json
  #   contentGate: true
  #   interactionGate: true
  #   visualGate: false
  # Optional explicit preview target override:
  # buildCommand: npm run build
  # previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  # previewUrl: http://127.0.0.1:4173
  # Optional screenshot baseline diff:
  # baseline:
  #   desktopScreenshotPath: ./render-baselines/home-desktop.png
  #   mobileScreenshotPath: ./render-baselines/home-mobile.png
  #   maxMismatchRatio: 0.15

incrementalFix:
  enabled: true
  maxRetries: 3
  focusOnCritical: true
  maxConsecutiveIssueFixAttempts: 3
  issueConfidenceThreshold: 70

budget:
  maxTurnsPerRun: 1500
  maxCostUsdPerRun: 15
  maxTurnsPerIteration: 400
  maxCostUsdPerIteration: 4

Notes:

  • Gemini should be configured with an explicit model
  • fallback models and bounded outer retries are strongly recommended for Gemini
  • OpenCode models should use the provider-qualified form expected by opencode, such as openrouter/gpt-5
  • parallel mode requires a Git repository and a clean workspace
  • parallel.baselineCommand runs inside each created worktree before sprint execution starts
  • parallel.postCreateChecks can fail fast on uncommitted changes, missing baseline files, or branch divergence inside created worktrees
  • parallel.checkpoint.mode: checkpoint-repo enables an experimental shadow-checkpoint path that gives each group its own GIT_DIR under .harness/checkpoints/
  • checkpoint mode currently forces maxConcurrency = 1 for safety because groups share the same GIT_WORK_TREE; native worktrees remain the default and higher-throughput path
  • planning.confidenceThreshold blocks generation when the Planner reports a lower planConfidence; use a 0-100 scale, not 0-1
  • incrementalFix.issueConfidenceThreshold lets issue_fix skip low-confidence evaluator findings
  • budget.maxTurnsPerRun / budget.maxCostUsdPerRun guard the whole run, while budget.maxTurnsPerIteration / budget.maxCostUsdPerIteration can stop a single overly expensive iteration
  • when agents.*.model or fallbackModels are explicitly configured, Harness preflights those provider/model combinations before planning
  • for long prompts on Windows, prefer --prompt-file, --append-file, and --replace-file over very long inline CLI arguments
  • routing.taskRouting can override the primary model per phase without changing each agent's baseline config
  • routing.strategyRouting can override the Generator model for issue_fix, sprint_retry, and full_regenerate
  • when routing changes the primary model, Harness preserves the original agent model and fallbackModels as the retry chain behind the routed primary
  • bridge is a capability-context layer, not an arbitrary shell passthrough
  • bridge.capabilities lets you describe external memory, skill, command, and mcp affordances in one schema
  • Harness normalizes bridge capabilities into runtime memory / skills so Planner, Generator, and Evaluator all see the same external context on run and resume
  • renderAudit.mode: off disables audit execution and evaluation gating
  • renderAudit.mode: auto keeps the default behavior
  • renderAudit.mode: always requires render audit evidence whenever a target is available
  • renderAudit.externalArtifactPath lets Harness load precomputed render-audit evidence from another tool such as trial
  • renderAudit.externalArtifactPath cannot be combined with buildCommand, previewCommand, previewUrl, baseline, or interactions
  • renderAudit.provider: trial tells Harness to call the sibling trial CLI itself instead of running the built-in Playwright auditor
  • renderAudit.trial.bundlePath is required when renderAudit.provider=trial
  • renderAudit.provider=trial still reuses Harness target resolution for buildCommand, previewCommand, and previewUrl
  • renderAudit.provider=trial cannot be combined with externalArtifactPath, baseline, or interactions
  • Trial exports now treat prompt-content and interaction failures as hard failures by default, while visual mismatch evidence stays in the evaluator prompt unless visualGate: true
  • Trial route summaries are attached to render-audit evidence so repair prompts can see missing text, failed interactions, and diff image paths per route
  • custom render targets must provide buildCommand, previewCommand, and previewUrl together
  • renderAudit.baseline.desktopScreenshotPath and renderAudit.baseline.mobileScreenshotPath must be configured together
  • renderAudit.baseline.maxMismatchRatio defaults to 0.15 when omitted

Bridge Capability Context

Harness can load external capability descriptions through bridge and fold them into the same runtime knowledge path used by built-in memory and skills.

Example:

bridge:
  enabled: true
  type: mcp
  capabilities:
    - kind: memory
      content: Prefer the external design system guidance when touching shared UI.
    - kind: skill
      name: design-bridge
      description: Uses external bridge context for design-system decisions.
      usage: Use when repo-local docs are incomplete.
    - kind: command
      name: repo-index
      description: External command capability that can surface repository metadata.
    - kind: mcp
      name: design-docs
      description: External MCP capability exposing design references.

Current semantics:

  • memory capabilities are injected as runtime memory facts
  • skill capabilities become runtime skills
  • command and mcp capabilities are summarized into runtime memory so agents can reason about what external context exists
  • the same normalized bridge context is refreshed on resume
  • direct bridge command fallback uses spawn(..., { shell: false }); this feature is not intended to be a generic shell execution tunnel

Shadow Checkpoint Mode

Harness now includes an optional shadow checkpoint execution mode for parallel runtime experiments.

Example:

parallel:
  enabled: true
  maxConcurrency: auto
  minSprintsToParallelize: 3
  allowConcurrentFileEdits: false
  checkpoint:
    mode: checkpoint-repo
    checkpointsDir: .harness/checkpoints

Current semantics:

  • each group gets an isolated checkpoint repository under .harness/checkpoints/<groupId>/git
  • Generator execution receives GIT_DIR and GIT_WORK_TREE so edits happen against the shadow checkpoint instead of a new native worktree
  • Harness captures a baseline checkpoint before the group starts
  • failed groups roll back to that baseline before cleanup
  • successful groups clean up their checkpoint repo after completion
  • this mode is intentionally conservative and currently runs one group at a time

Watchdog Status Mode

harness status --watchdog now uses the same resume path as harness resume instead of a no-op callback.

Examples:

harness status --watchdog
harness status --watchdog -c ./harness.yaml --max-recovery 2

Current semantics:

  • watchdog mode implies --follow
  • when watchdog mode is enabled, Harness loads config so auto-resume uses the real agent/runtime settings; if the active run already persisted a config snapshot, watchdog reuses that snapshot instead of resolving the wrong shell-local ./harness.yaml
  • --max-recovery is enforced by the watchdog runtime, not just parsed by the CLI
  • recovery attempts update the persisted run state via recoveryReason and recoveryCount
  • generator provider progress now refreshes persisted child-process activity telemetry, so active provider output is less likely to be misclassified as a stalled/zombie run
  • status surfaces one unified observed health view: running, stalled, zombie, needs-recovery, or completed
  • provider/tooling crashes now persist an explicit provider_failed health snapshot with a concrete recovery reason instead of collapsing into an undifferentiated failed run
  • long-running provider child-process activity with no progress change now escalates to needs-recovery as a suspected hung-provider state instead of staying forever "healthy"

Doctor Checks

harness doctor is a repo/worktree health check for Harness itself. It currently verifies:

  • the working directory exists
  • the working directory is writable
  • node_modules is present when package.json exists
  • the built Harness CLI at dist/cli.js starts successfully with --help
  • explicit provider/model/fallback configuration can pass auth/model preflight before you start a long run

This is the fastest way to catch self-optimization/bootstrap failures such as “build passed but the built CLI cannot actually start”.

Harvest Protocol

harness harvest productizes the self-optimization closure flow of “validated file whitelist copy + main-repo revalidation”.

Examples:

harness harvest ./harness-self-opt-worktree
harness harvest ./harness-self-opt-worktree --apply
harness harvest ./harness-self-opt-worktree --apply --file src/core/harness/runHealth.ts --root-command "python3 ../probe/cli.py run --target-repo-path . --suite unit"

Current semantics:

  • default mode is dry-run: it lists harvestable files and the validation commands that would run
  • --apply first runs worktree validation, then copies the whitelist into the target repo, then forces targeted + root validation in the target repo
  • low-trust status ledgers such as output/plan/TODO/90-STATUS.md are deferred until after root validation succeeds, so failed harvest validation does not prematurely update the main-repo status ledger
  • when commands are not provided explicitly, Harness infers worktree validation from package.json scripts typecheck / build
  • root validation is inferred from package.json scripts test, typecheck, and build
  • targeted validation always includes git diff --check -- {files} in the target repo; custom --targeted-command values are appended after that
  • .git/, .harness/, and node_modules/ are never harvested from implicit changed-file discovery
  • deleted files are reported as skipped because the current protocol productizes file-copy harvest, not deletion replay

Windows Provider Notes

  • file-backed prompt inputs are the safest way to pass long task text on Windows shells
  • provider failures caused by shell mismatches such as bare && or Unix-only head are classified as provider/tooling errors instead of normal code failures
  • provider-side 404/not found failures are tracked separately from auth/quota/transport failures so bad model/provider ids can be surfaced as manual-recovery issues
  • status, resume, and completion summaries surface interesting parallel group retry history so repeated provider/tooling failures are visible without reading raw state files

Smart Model Routing

Harness can route different phases to different models without duplicating the whole agents block.

Example:

routing:
  enabled: true
  defaultModel: openrouter/gpt-4o-mini
  taskRouting:
    planning: openrouter/gpt-5
    generation: openrouter/claude-3.5-sonnet
    evaluation: openrouter/gpt-4o
  strategyRouting:
    issue_fix: openrouter/gpt-4o-mini
    sprint_retry: openrouter/gpt-4o
    full_regenerate: openrouter/gpt-5

Routing semantics:

  • taskRouting applies to planning, generation, and evaluation
  • strategyRouting applies on generation paths when Harness already knows it is doing issue_fix, sprint_retry, or full_regenerate
  • strategyRouting takes precedence over taskRouting for generation retries
  • if routing swaps the primary model, the original agent model becomes the first fallback, followed by the agent's configured fallbackModels
  • OpenCode model ids should stay provider-qualified, such as openrouter/gpt-5

Render Audit Configuration Guide

Recommended mode selection:

  • off: use this for backend-only tasks, docs-only work, or environments where browser preview is intentionally unavailable
  • auto: default mode; Harness runs render audit when a target is available, and only gates evaluation when the task looks visually sensitive
  • always: use this for landing pages, marketing sites, UI redesigns, responsive work, animation polish, or any task where visual claims must be backed by browser evidence

CLI override examples:

harness run "Build a landing page" --render-audit always
harness resume <runId> --render-audit off

Custom preview target example:

renderAudit:
  mode: always
  buildCommand: npm run build
  previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  previewUrl: http://127.0.0.1:4173

Optional baseline screenshot diff:

renderAudit:
  mode: always
  buildCommand: npm run build
  previewCommand: npm run preview -- --host 127.0.0.1 --port 4173
  previewUrl: http://127.0.0.1:4173
  baseline:
    desktopScreenshotPath: ./render-baselines/home-desktop.png
    mobileScreenshotPath: ./render-baselines/home-mobile.png
    maxMismatchRatio: 0.12
    sections:
      - id: hero
        desktop:
          y: 0
          height: 900
          maxMismatchRatio: 0.08
        mobile:
          y: 0
          height: 720
          maxMismatchRatio: 0.1
  interactions:
    - id: products-menu
      action: hover
      selector: '[data-nav-products]'
      expectedVisibleText:
        - API
        - Hosted
    - id: mobile-menu
      surface: mobile
      action: click
      selector: '[data-mobile-menu-button]'
      expectedVisibleSelector: '[data-mobile-menu-panel]'

Important constraints:

  • the three custom target fields are all-or-nothing
  • baseline screenshot paths are also all-or-nothing: configure both desktop and mobile references together
  • each baseline section must define at least one surface crop under desktop or mobile
  • each interaction audit must define at least one visible expectation via expectedVisibleText or expectedVisibleSelector
  • when an interaction uses only expectedVisibleText, Harness treats that text as post-action state evidence and requires it to become newly visible after the hover/click
  • when both expectedVisibleSelector and expectedVisibleText are provided, Harness checks the text inside that visible selector instead of scanning the whole page body
  • always does not invent a preview target; a supported auto-detected target or a complete custom target must still exist
  • if you already have a non-Vite preview flow, prefer explicit target config instead of waiting for auto-detection to guess correctly
  • a nominally successful browser run can now be downgraded to success_with_warnings, resource_failed, or visual_mismatch when runtime/resource/diff evidence is bad
  • section audits reuse the full-page baseline images and crop by region, so they work well with protocol-derived section heights or module bounds
  • interaction audits are intentionally small and deterministic: one hover/click action plus concrete visibility expectations

Run Artifacts

Harness persists every run under:

.harness/runs/<runId>/
├── state.json
├── plan.json
├── score-history.json
├── logs/
│   ├── planner.log
│   ├── generator.log
│   └── evaluator.log
└── iterations/
    ├── iteration-1/
    │   ├── generation.json
    │   ├── evaluation.json
    │   └── render-audit/
    │       ├── report.json
    │       └── console.json
    └── ...

These artifacts are what make resume, debugging, and self-optimization practical.

When baseline diff is enabled, report.json includes visualDiff metadata and optional sectionAudits. When interaction audits are configured, report.json also includes interactionAudits. The render-audit directory also contains desktop-diff.png / mobile-diff.png, per-section diff images such as hero-desktop-diff.png, and interaction screenshots such as products-menu-desktop-interaction.png. console.json now includes browser failedRequests and httpFailures in addition to console and page errors.

Quality and Evaluation Model

The Evaluator is intentionally skeptical. Recent changes in the codebase include:

  • zero-trust evaluation guidance
  • verification-first completion guidance
  • scale-aware evaluation instructions for small / standard / large tasks
  • render-audit evidence injection into the prompt
  • gating logic that can downgrade or fail visually sensitive frontend work when browser evidence is missing
  • low-trust status-document guidance and gating, so status ledgers cannot self-certify implementation without code/tests/structured evidence

This is one of the most important design decisions in the project: the Generator and Evaluator are intentionally separate so the system does not grade its own output too generously.

Self-Optimization

Harness can optimize Harness, but it should be treated as an experiment, not a casual local run.

Read:

Key rules:

  • use an isolated Git worktree outside the main repo
  • copy any uncommitted local snapshot into that worktree
  • build first if you want to validate the real shipped CLI path
  • explicitly pin Gemini models in config
  • prefer status and resume over manual state editing
  • review worktree diffs before merging anything back

Repository Guide

Useful project entry points:

Relevant source areas:

  • src/core/harness/: runtime orchestration
  • src/core/state/: persisted state and artifact storage
  • src/core/renderAudit/: browser render-audit execution
  • src/agents/planner/: planning pipeline
  • src/agents/generator/: generation pipeline
  • src/agents/evaluator/: evaluation pipeline and gating logic

Development

python3 ../probe/cli.py run --target-repo-path . --suite unit
npm run typecheck
npm run build
npm run lint:naming
npm run lint:arch
npm run sync:prompt-fragments
npm run mission:run -- missions/01-happy-path.yaml

Formal unit tests for harness are managed by the private probe module. The public repository keeps only the source tree and test-facing fixtures/docs; the tracked tests/ suite lives in repos/probe/assets/harness/unit/ inside the Chariot workspace.

The repository uses strong guardrails:

  • TDD is expected for feature work
  • file naming is enforced
  • architecture boundaries are tested
  • documentation contracts exist for key operational docs
  • prompt fragment synchronization is tested

Known Gaps

The project is materially more capable than the old README suggested, but there are still obvious next steps:

  • render-audit target detection is currently narrow
  • some evaluator-scale guidance still needs to avoid task-specific hard-coding
  • browser render audit is useful evidence, but not yet a full visual design judge

Those gaps are real, but they are now layered on top of a substantially stronger orchestration core.

License

MIT