agents-harness
v0.3.2
Published
Multi-agent orchestrator for autonomous software development
Maintainers
Readme
agents-harness
A multi-agent orchestrator for autonomous software development. Three AI agents — Planner, Generator, and Evaluator — work together in a loop to turn your feature spec into working code.
Built on the architecture described in Anthropic's engineering blog post: Harness Design for Long-Running Apps. The core idea: separate generation from evaluation (like a GAN), reset context between agent invocations to prevent degradation, and use file-based handoffs so each agent starts fresh.
How It Works
You write a spec
|
v
[Planner] -----> Expands spec, breaks it into sprints, writes contracts
|
v
[Generator] ----> Implements the sprint contract (reads/writes/edits code)
|
v
[Evaluator] ----> Critically tests the implementation against the contract
|
PASS? ----yes--> Next sprint (or done)
|
no
|
v
[Generator] ----> Tries again with evaluator feedback
|
v
(loop up to max attempts)Each agent gets a fresh context on every invocation — no accumulated confusion from long conversations. State is passed between agents via files in the .harness/ directory, not conversation history.
Quick Start
1. Install
npm install -g agents-harness2. Set your API key
# Option A: Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
# Option B: Global config
agents-harness config set api-key sk-ant-...3. Run
cd your-project
agents-harness run "Add user authentication with email/password login and JWT tokens"That's it. The harness will plan, implement, and test the feature across multiple sprints.
Commands
run — Start a new run
agents-harness run "<spec>"Give it a feature description and it handles the rest.
Options:
| Flag | Description | Default |
|------|-------------|---------|
| -s, --scope <workspaces...> | Limit to specific workspaces (monorepo) | All |
| --max-attempts <n> | Max retry attempts per sprint | 3 |
| --max-budget <n> | Max total spend in USD | 50 |
| --no-dashboard | Disable the live web dashboard | On |
| --port <n> | Dashboard port | 3117 |
| --model <model> | Claude model for all agents | — |
| --planner-model <model> | Claude model for the planner agent | opus |
| --generator-model <model> | Claude model for the generator agent | opus |
| --evaluator-model <model> | Claude model for the evaluator agent | sonnet |
Available models: opus, sonnet, haiku (or full model IDs like claude-opus-4-6)
Examples:
# Simple feature (dashboard opens at http://localhost:3117)
agents-harness run "Add a /health endpoint that returns 200 OK"
# With budget limit
agents-harness run "Refactor the auth module to use OAuth2" --max-budget 20
# Monorepo — only touch the backend
agents-harness run "Add pagination to the users API" --scope packages/api
# Disable the dashboard for CI or headless environments
agents-harness run "Build a notification system" --no-dashboard
# Use Sonnet for all agents (cheaper)
agents-harness run "Add a /health endpoint" --model sonnet
# Mix models — Opus for generation, Sonnet for planning/evaluation
agents-harness run "Build auth system" --planner-model sonnet --generator-model opusinit — Initialize project config (optional)
agents-harness initCreates a .harness/ directory with:
config.yaml— agent models, budget limits, attempt limitscriteria.md— custom evaluation criteria template
The harness works without init — it auto-detects your stack. Only run this if you want to customize settings.
Example output:
Detected project:
Repository type: single
Workspace: .
Language: typescript
Framework: next.js
Test runner: vitest
Test command: npx vitest run
CLAUDE.md: found
Created .harness/config.yaml
Created .harness/criteria.mdstatus — Check run progress
agents-harness statusShows the current state of a run — which sprint you're on, pass/fail status, and cost.
Example output:
Status: RUNNING
Spec: Add user authentication with email/password login...
Started: 2025-03-28T10:30:00.000Z
Phase: evaluate
Cost: $2.45 / $50.00
Sprints: 2 / 3
[PASS] Sprint 1 — 1 attempt, $0.85
[....] Sprint 2 — 2 attempts, $1.60
[ ] Sprint 3resume — Resume a stopped run
agents-harness resumePicks up where a stopped or failed run left off. Skips completed sprints.
Options:
| Flag | Description | Default |
|------|-------------|---------|
| --max-budget <n> | Max total spend in USD | 50 |
| --no-dashboard | Disable the live web dashboard | On |
| --port <n> | Dashboard port | 3117 |
| --model <model> | Claude model for all agents | — |
| --planner-model <model> | Claude model for the planner agent | opus |
| --generator-model <model> | Claude model for the generator agent | opus |
| --evaluator-model <model> | Claude model for the evaluator agent | sonnet |
Example:
# Hit Ctrl+C during a run, then later:
agents-harness resume
# Resume with a higher budget
agents-harness resume --max-budget 100
# Resume without dashboard
agents-harness resume --no-dashboard
# Resume with different models
agents-harness resume --model sonnetconfig — Manage global settings
agents-harness config set <key> <value>
agents-harness config get <key>Examples:
# Save your API key globally
agents-harness config set api-key sk-ant-api03-...
# Check what's set
agents-harness config get api-keyConfig is stored at ~/.agents-harness/config.yaml.
Configuration
Zero-config (default)
The harness auto-detects your project:
- Language — TypeScript, Python, Rust, Go
- Framework — Next.js, Django, etc.
- Test runner — vitest, jest, pytest, cargo test, go test
- Repo type — single repo or monorepo (npm workspaces, pnpm, lerna)
- CLAUDE.md — reads project conventions if present
Custom config (optional)
Run agents-harness init, then edit .harness/config.yaml:
agents:
planner:
model: sonnet
generator:
model: opus
maxTurns: 100
evaluator:
model: sonnet
max_attempts_per_sprint: 3
max_budget_per_sprint_usd: 5
max_total_budget_usd: 50Available models: opus, sonnet, haiku
Custom evaluation criteria
Edit .harness/criteria.md to add project-specific rules:
# Custom Evaluation Criteria
- All API endpoints must return proper HTTP status codes
- Database migrations must be reversible
- All user-facing strings must be internationalizedThese are checked in addition to the built-in defaults (correctness, testing, code quality, integration).
Live Dashboard
The dashboard starts automatically at http://localhost:3117 on every run. It provides a split-panel UI for monitoring the entire harness lifecycle.
# Dashboard is on by default
agents-harness run "Build a feature"
# Dashboard: http://localhost:3117
# Disable for CI or headless environments
agents-harness run "Build a feature" --no-dashboardLeft panel:
- Phase pipeline (Plan → Decompose → Contract → Generate → Evaluate → Handoff) with active/done states
- Sprint cards with status, attempt count, cost, and evaluation criteria
Right panel:
- File viewer with tabs (Spec, Sprints, Contract, Evaluation, Handoff)
- Live file updates via WebSocket as agents write to
.harness/files - Auto-switches to the relevant tab when the phase changes
Bottom:
- Collapsible activity stream (every file read, edit, bash command)
- Cost tracking with budget progress bar
- Auto-reconnects if the connection drops
Programmatic API
Use agents-harness as a library in your own tools:
import { Harness } from "agents-harness";
const harness = new Harness({
apiKey: process.env.ANTHROPIC_API_KEY!,
root: "/path/to/project",
maxTotalBudgetUsd: 20,
});
harness.on("phase:start", (data) => {
console.log(`Phase: ${data.phase}, Sprint: ${data.sprint}`);
});
harness.on("evaluation", (data) => {
console.log(`Sprint ${data.sprint}: ${data.result.passed ? "PASS" : "FAIL"}`);
});
harness.on("run:complete", (data) => {
console.log(`Done — ${data.status}, cost: $${data.totalCostUsd.toFixed(2)}`);
});
await harness.run("Add a REST API for managing todos");Exported classes and functions
| Export | Description |
|--------|-------------|
| Harness | Main orchestrator class |
| ContextManager | Wraps Agent SDK with fresh context per call |
| FileProtocol | Manages .harness/ directory state |
| DashboardServer | HTTP + WebSocket dashboard server |
| buildProjectContext | Auto-detect project stack and config |
| detectStack | Detect language, framework, test runner |
| buildSystemPrompt | Build agent system prompts |
| DEFAULT_CRITERIA | Built-in evaluation criteria |
The Three Agents
| Agent | Model | Role | Tools | |-------|-------|------|-------| | Planner | Opus | Writes specs, decomposes into sprints, writes contracts | Read, Write | | Generator | Opus | Implements code based on the contract | Read, Edit, Write, Bash, Glob, Grep | | Evaluator | Sonnet | Critically tests implementation against contract | Read, Bash, Grep, Glob |
Key design principle from the Anthropic article: the generator never evaluates its own work. A separate evaluator with fresh context provides unbiased assessment.
Requirements
- Node.js 18+
- An Anthropic API key
- The
@anthropic-ai/claude-agent-sdkpackage (peer dependency)
Credits
This project is built on the harness architecture described in Anthropic's engineering article: Harness Design for Long-Running Apps. The article introduces the pattern of separating generation from evaluation, using fresh context windows per agent invocation, and file-based state handoffs to enable reliable multi-hour autonomous coding sessions.
License
ISC
