agent-execution-harness
v0.14.4
Published
A transactional execution harness for AI coding agents with evidence-backed reports.
Maintainers
Readme
Agent Execution Harness
Agent Execution Harness helps AI coding agents work like disciplined software engineers instead of improvising through your codebase.
It gives the agent a repeatable operating system for software work:
understand -> plan -> read relevant context -> execute one task -> verify -> record evidence -> report honestly -> remember useful lessonsThe goal is simple: make AI-assisted development more reliable, auditable, and cheaper in tokens.
What's New In v0.14.4
This patch lets agents reuse an approved plan whether it was saved as a file or produced in chat.
- added
plan import --from -for pasted or piped chat plans - kept
plan import --from backlog.mdfor file-based plans - protected existing
plan.jsonfiles from silent overwrite - replacing an existing plan now requires explicit
--overwrite
In plain language: if Codex or OpenCode already wrote the plan, the harness can turn that approved text into plan.json without recreating or overwriting it silently.
What's New In v0.14.3
This patch adds dispatch guidance for agents that may have subagents.
- added
dispatch plananddispatch next --batch - dispatch returns safe serial fallback or parallel handoff packets
- worker JSON validation stays with
handoff validate - dispatch refuses to create a new batch while a run already has active work
- runnable tasks are no longer also reported as blocked
In plain language: the harness can now tell an agent when to use subagents and when to stay serial, without inventing worker validation commands.
What's New In v0.14.2
This patch helps agents turn an approved text backlog into an executable plan.json.
- added
plan importfor the atomic Markdown backlog format - missing-plan errors now show the import -> lint -> session flow
- the harness still does not guess from chat history; the backlog must be saved to a file
In plain language: a plan in chat can now become a real Harness plan without the agent recreating it freely.
What's New In v0.14.1
This patch removes ambiguity from the prompt users give to coding agents.
- added a copy-paste execution prompt with the real harness command
- clarified that Stetix-style projects use
pnpm agent:harness - added a test so public docs/templates do not reintroduce missing-command placeholders
In plain language: weaker agents should receive the exact command to run, not a placeholder they need to interpret.
What's New In v0.14.0
This release makes the harness more token-first for weaker agents.
doctor --coverage: shows compact gaps in project safety controls.doctor --architecture: checks lightweight boundary rules without new dependencies.- topology detection helps recommend controls for CLI, web, API and Supabase projects.
- plan lint and templates now reinforce surgical coding discipline with short rules.
- token benchmarks cap the new doctor outputs so routine agent loops stay cheap.
In plain language: weaker agents get clearer rails and shorter diagnostics before they guess, over-edit or claim success without proof.
What's New In v0.13.2
This patch improves how agents react to repeated failures.
- repeated failures now tell agents to inspect local code/history first
- docs or web research is reserved for external dependency behavior
- agents should compare two possible fixes instead of guessing repeatedly
- the rule stays compact and keeps the install token budget intact
In plain language: when the same error appears again, the agent should stop guessing, look locally first, research only when needed, then choose the smallest safe fix.
What's New In v0.13.1
This patch improves the instructions installed into AGENTS.md.
- agents are reminded to read before writing
- risky ambiguity should stop or ask instead of guessing
- changes should stay surgical and avoid unrelated refactors
- success criteria and evidence come before claiming completion
- the rules stay compact so the install does not waste tokens
In plain language: new installs and updates give the coding agent clearer rails without turning AGENTS.md into a long manifesto.
What's New In v0.13.0
This release hardens the harness for weaker agents and safer public use.
- stable test timeout for slower machines
- dependency audit clean at moderate severity
- CI/release now run secret scan and dependency audit
- stronger dangerous-command detection
- safer command guidance: prefer
--exec+--args-jsonover free shell - richer optional plan controls: forbidden files, expected diff, required checks and rollback command
- stricter high-confidence memory: source files and main-agent validation are required
- benchmark smoke now fails if false success appears or out-of-plan diff is not blocked
In plain language: the harness now does more to stop weaker agents from guessing, touching the wrong files, or claiming success without proof.
What's New In v0.12.4
This patch fixes install/update scripts for existing projects.
agent:harnessnow points to the root CLI:agent-harness- existing projects with the legacy
agent-harness runscript are upgraded safely pnpm agent:harness doctor,--help,--versionand session commands work from the same script
In plain language: after installing or updating, users can run all harness commands from pnpm agent:harness ... without command routing errors.
What's New In v0.12.3
This patch adds plain version output.
agent-harness --versionagent-harness -vagent-harness version
In plain language: after installing, users can check the installed version without seeing a JSON error.
What's New In v0.12.2
This patch polishes install output for first-time users.
init --applynow saysFiles updated safelyinstead of exposing technical action labels.- The JSON hint is shorter and marked as advanced.
In plain language: installation output is less noisy and easier to understand.
What's New In v0.12.1
This patch makes install and readiness output easier for humans.
initprints a short success message by default.doctorprints a readable readiness report by default.--jsonkeeps structured output available for CI, scripts and advanced automation.
In plain language: beginners see "installed successfully" plus the next command to run. Machines can still ask for JSON.
What's New In v0.12.0
This release adds lightweight learning-memory health checks.
learn health --compact: tells the agent when memory needs cleanup.learn audit --compact: lists stale, duplicate or low-confidence lessons in a short read-only report.session startcan returnlearning_health=needs_audit, so agents can audit memory without the user remembering commands.
In plain language: the harness can notice when its lesson notebook is getting noisy and ask the agent to do a compact review. It does not delete lessons automatically.
What's New In v0.11.1
This patch makes installation easier to understand.
init --applynow says clearly when the harness was installed successfully.- The output explains what happened to
AGENTS.md: appended, overwritten, created, or left unchanged. - The next steps show exact doctor and rollback commands.
In plain language: after installing, you should no longer need to guess whether it worked.
What's New In v0.11.0
This release improves the harness learning loop without adding embeddings, extra AI agents, or long reports.
learn validate: lessons must be validated before promotion.- Smarter
learn query: rank lessons by touched files and failure signature. - Repeated failure hint:
verifycan suggest a short learning action after equivalent failures. - Token budgets: validation output and learning hints are capped so routine agent output stays compact.
In plain language: the harness remembers useful lessons more safely, but still talks to the agent in short, cheap messages.
Works With Non-Frontier Agents Too
You do not need a frontier model to benefit from this harness.
Agent Execution Harness is designed to help weaker, cheaper, local, junior, or low-context coding agents execute software work more safely. In weak mode, the harness turns broad implementation work into small deterministic steps:
- one exact next command with
next --exact - fewer files per task
- typed evidence instead of vague status updates
- short repair hints when the agent gets stuck
- blocked completion when the agent changes files outside the declared plan
- compact artifacts so the agent does not need to reread the whole repository
In plain language: a strong model may use the harness as discipline. A weaker model uses it as rails.
You can also use a strong model as the planner/reviewer and a cheaper or weaker model as the worker. The handoff flow gives that worker one compact task capsule, then validates its JSON output before the work can be trusted.
AI agents are useful, but they often fail in the same ways:
- they change files before understanding the task
- they skip steps from the plan
- they say tests passed when no test was run
- they invent files, commands, APIs, or validations
- they declare "done" without proof
This project adds a small execution system around the agent.
It does not try to make the model smarter. It makes the agent easier to guide, audit, and stop when the work becomes unsafe.
In plain language: it is a checklist, memory, learning notebook, and flight recorder for AI-assisted software development.
It helps an AI agent execute software plans in a more organized way by forcing the agent to:
- follow a plan task by task
- declare which files it expects to touch
- run explicit checks
- record evidence
- verify claims before saying "done"
- stop instead of guessing when work becomes unsafe
Why Use This?
Use this repo when you want an AI coding agent to:
- create a clear plan before risky work
- execute that plan step by step
- keep a record of what happened
- run checks and attach evidence
- remember useful codebase context for future tasks
- avoid rereading the whole project every time
- avoid claiming success without proof
The harness is especially useful for:
- bug fixes
- refactors
- multi-step features
- AI-assisted code review
- teams experimenting with autonomous coding agents
- projects where "trust me, it works" is not good enough
The most important benefit is not speed. It is controlled speed.
Without a harness, an agent can move fast and still leave you unsure whether it understood the task, ran the right checks, or changed the right files. With the harness, every important step leaves an artifact: the plan, touched files, commands, evidence, verified claims, and rollback notes.
That turns AI coding from a chat conversation into an engineering workflow you can inspect.
The Full Harness Flow
This is the day-to-day flow the harness tries to enforce:
flowchart TD
A["User asks for a bugfix, feature, or review"] --> B["Agent classifies risk and creates or reads a plan"]
B --> C{"Simple low-risk work?"}
C -->|Yes| D["Read the touched file directly"]
C -->|No: broad, risky, or unclear| E["Query codebase memory with map"]
E --> F["Query learned lessons with learn"]
D --> G["Declare expected files"]
F --> G
G --> H["Execute one task at a time"]
H --> I["Run a real gate: test, typecheck, lint, build, smoke, or custom command"]
I --> J["Store evidence: command, exit code, output excerpt, log ref, sha256"]
J --> K["Verify claims before final report"]
K --> L{"Required evidence complete?"}
L -->|Yes| M["Status: completed"]
L -->|No| N["Status: partial_validated or halt"]
M --> O["Update map and capture useful lessons"]
N --> O
O --> P["Next agent starts with better context and fewer repeated mistakes"]The important part: the agent does not get to say "done" just because it feels confident.
It must prove the work.
The Three Memory Layers
The harness now separates memory into three practical layers:
| Layer | What it answers | Example |
|---|---|---|
| Plan artifact | What was supposed to happen? | "Fix login bug, touch src/auth/session.ts, run focused auth tests." |
| Codebase memory | Where does this logic live? | "Auth session contracts live in src/auth and affect guards." |
| Learning memory | What did we learn from previous failures? | "When session state changes, also test authorization guards." |
This matters because agents waste tokens and make mistakes when they rediscover the same project structure or repeat the same bug pattern. The harness stores compact, evidence-backed context so the next run starts from better information without loading the whole repository.
Truth still has a strict order:
source code > current tests/runtime > canonical docs > evidence > promoted lessons > old chatMemory helps the agent. It never replaces checking the real code.
What You Get
After installation, your project gets:
AGENTS.mdrules that tell the AI agent how to behaveagent-harness.config.jsonfor local policy and artifact settings- plan validation
- execution artifacts
- evidence-backed final reports
- codebase memory commands
- safety checks for risky commands
- compact output modes to reduce token usage
- governed learning loop for evidence-backed lessons
- control catalog showing which risks each harness control covers
- harnessability scoring to show how ready a project is for AI-agent execution
- coverage and architecture diagnostics for compact risk gaps
- repeated-failure steering to suggest small controls after recurring mistakes
- optional approved fixtures for critical behavior that must not be guessed
The intended day-to-day experience is simple:
You: Find this bug.
Agent: Investigates and proposes a plan.
You: Execute the plan using the harness.
Agent: Executes step by step, records evidence, and reports the artifact.
You: Show me proof.
Agent: Shows run_id, artifact, checks, evidence, claims, and rollback.If the agent cannot show evidence, the work is not complete.
Table Of Contents
- Why Use This?
- What You Get
- Quick Start
- Which Mode Should I Use?
- Codebase Memory Diagram
- Learning Loop
- What Problem Does This Solve?
- For Non-Technical Users
- Installation Options
- Common Confusions
- Troubleshooting
- Why Not Just Prompts?
- Core Concepts For Developers
- CLI Reference
- Configuration
- Safety Model
- Local Development
Useful Links
Quick Start
Use this if you want to try the harness in an existing project.
AI agents should read docs/agent-runtime.md for the short runtime protocol. This README is for humans.
Text Backlog To Plan
If Codex or another planner already produced an approved atomic Markdown backlog, save it as backlog.md, then run:
agent-harness plan import --from backlog.md --out plan.json --plan-id my-plan --risk L2 --rollback "Delete generated files."
agent-harness plan-lint --plan plan.json
agent-harness session start --plan plan.json --run-id my-plan --mode weakIf the approved plan exists only in a chat response, paste or pipe that text through stdin:
agent-harness plan import --from - --out plan.json --plan-id my-plan --risk L2 --rollback "Delete generated files."plan import does not overwrite an existing output file by default. Use --overwrite only when replacing the previous plan.json is intentional. Once plan.json exists, the harness reuses it through --plan plan.json; dispatch, handoff and session commands do not recreate the plan.
Supported task format:
- [ ] **Tarefa [1]**: Ajustar arquivo em `src/file.ts`.
- **Dependência:** Nenhum
- **DoD:** `pnpm test:run tests/unit/file.test.ts` passa.The importer is intentionally narrow. It converts the known backlog format from a file or stdin; it does not infer plans from free-form chat.
Dependencies must be Nenhum or Tarefa N; invalid dependency text fails instead of being ignored.
Which Mode Should I Use?
| Mode | Use when | What it optimizes |
|---|---|---|
| standard | normal AI coding agent, normal task | balanced speed and evidence |
| weak | cheaper model, local model, junior agent, or low-context executor | smaller steps, compact output, repair hints |
| strict | sensitive work or less trusted executor | only declared structured commands can pass |
| handoff | strong model plans/reviews while another model executes one task | compact delegation with JSON validation |
Simple rule: use standard by default, weak when the agent drifts, strict when command control matters, and handoff when you want one model to plan and another model to execute.
Control Catalog
The harness keeps a small catalog of controls so users can see what is protected and what is not.
Examples:
plan_lint: catches invalid plans before execution.scope_guard: blocks finish when files outside the plan changed.evidence_policy: blocks success claims without required proof.strict_command_policy: blocks undeclared shell commands in strict mode.handoff_validate: checks JSON returned by weak or external workers.
This is intentionally metadata, not a heavy policy engine. The goal is auditability with near-zero token cost.
Harnessability
doctor --harnessability checks whether a project is easy for AI agents to work in safely:
agent-harness doctor --harnessability --cwd .By default, doctor prints a human-readable report. For JSON output:
agent-harness doctor --json --harnessability --cwd .It scores cheap local signals such as scripts, AGENTS.md, harness config, tests, runtime docs, artifact policy and command policy. A low score does not mean the project is bad. It means agents have fewer rails and may need smaller plans or stronger review.
Coverage And Architecture
Use these before broad, risky or weak-agent work:
agent-harness doctor --coverage --architecture --cwd .--coverage shows which cheap controls are present and which are missing. --architecture checks optional boundary rules from agent-harness.config.json, such as client code importing server-only secrets.
Both outputs are short by default. Use --json only for automation.
Repeated Failure Steering
doctor --steering scans recent harness artifacts and suggests the smallest control when the same failure keeps happening:
agent-harness doctor --steering --cwd .It does not auto-rewrite your rules. It only points to repeated evidence, such as out-of-plan edits or missing evidence, so a human or senior agent can decide whether to add a rule, test or checklist item.
Approved Fixtures
Approved fixtures are optional. Use them only for critical behavior where generated tests are not enough, such as auth, billing, clinical AI, data transforms or structured AI output.
agent-harness fixtures validate --file tests/fixtures/approved/basic-approved-fixture.jsonA fixture must be explicitly owner-approved. This keeps the feature useful without turning every small task into a heavyweight validation process.
The Simple Path
Copy and paste these commands inside the project where you want to use the harness.
Step 1: open your project folder:
cd C:\Projetos\my-appStep 2: preview what will be installed:
npx agent-execution-harness@latest init --adapter generic --cwd .This preview should not change your project.
Step 3: install the harness:
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode appendThis is the recommended command for most projects.
It adds harness rules to AGENTS.md without replacing your current instructions.
Step 4: check that installation worked:
npx agent-execution-harness@latest doctor --harnessability --cwd .Expected result:
Agent Execution Harness doctor passed.
Harnessability score: 90/100Step 5: tell your AI coding agent to use it:
Use the agent harness for approved plans, multi-step work, risky changes, and any task where you need to prove completion.
For L2/L3 tasks, run the harness automatically. The user should not need to remember to ask for it.
Read docs/agent-runtime.md first.
Do not claim success unless the harness artifact is completed and includes evidence plus verified claims.Then talk normally:
Find this bug.
Create a plan.
Execute the approved plan using the harness.
Show me the evidence.What Files Get Added?
The installer may add or update harness setup files such as:
AGENTS.mdagent-harness.config.json- package scripts
- harness artifact folders
- runtime docs for the agent
If your project already has AGENTS.md, the recommended command uses --agents-mode append, so it adds a harness block instead of replacing the file.
The agent should use the harness underneath.
You do not need to memorize the commands below. They show what the agent should run behind the scenes.
Token-light flow for agents:
agent-harness session start --plan plan.json --run-id fix-id --summary "ctx"
agent-harness next
agent-harness files declare --files src/file.ts
agent-harness task start --task-id task-id --files src/file.ts
agent-harness verify --task-id task-id --type focused_tests --cmd "pnpm test"
agent-harness claim auto
agent-harness finish --summary "Validated."
agent-harness report --run-id fix-id --format compactFor weak, local, low-context or cost-sensitive executors, use the micro/compact variants:
agent-harness next --exact --micro
agent-harness dispatch next --batch --runtime subagents
agent-harness handoff --compact --plan plan.json --task-id task-id
agent-harness map query --surface auth --compact
agent-harness learn query --surface auth --top-k 3 --compactThese commands remove duplicate transport metadata from the chat output while preserving the full audit trail in artifacts and full commands. Use the normal output when a human needs to debug; use compact output when an agent only needs the next action.
For strict execution, prefer structured commands instead of shell strings:
agent-harness session start --plan plan.json --run-id fix-id --mode strict
agent-harness verify --task-id task-id --type focused_tests --exec pnpm --args-json "[\"test\"]"strict mode blocks shell-style --cmd by default and requires the command to match the task allowed_commands.
In weak mode, claim auto automatically batches claims when a plan has many tasks. The agent still runs one simple command, while the harness keeps each internal action small enough for low-context executors.
For low-context agents, use next --exact --micro. It returns the exact next harness command plus the stop condition, reducing ordering mistakes such as claiming early, skipping file declaration, or forgetting the active task.
For multi-step plans, tasks can declare depends_on. Run agent-harness plan waves --plan plan.json to preview safe execution order. next --exact then guides the agent only to tasks whose dependencies already passed evidence.
The scope guard also checks the real git diff before finish. If the agent changed a product/source file outside the plan, the run stops with a repair_hint instead of pretending success. In plain language: the agent can only finish if the files it touched match the files it declared.
Dispatch Guidance
Use dispatch when the agent may have subagents but should not guess what can run in parallel.
agent-harness dispatch plan --plan plan.json
agent-harness dispatch plan --plan plan.json --runtime subagents
agent-harness dispatch next --batch --runtime subagentsDispatch does not spawn workers itself. It inspects the plan, dependencies and task metadata, then returns either a safe parallel batch with handoff packets or a serial fallback task. If the runtime has no subagents, omit --runtime subagents and continue with the normal serial next --exact flow.
After a worker returns JSON, validate that output with the existing handoff validator:
agent-harness handoff validate --plan plan.json --task-id task-id --input worker-output.jsonThe optional task isolation field is advisory metadata in this version. It documents the intended worker isolation model, but the harness does not automatically create worktrees, fork workspaces, or sandboxes for dispatch.
Weak Worker Handoff
Use handoff when a strong model creates the plan and a weaker model, local model, junior agent, or external chat does the implementation work.
agent-harness handoff --compact --plan plan.json --task-id task-idPaste prompt into the weak worker. It tells the worker exactly which files and commands are allowed, when to stop, and what JSON to return. After the worker responds, save its JSON and validate it:
agent-harness handoff validate --plan plan.json --task-id task-id --input worker-output.jsonThis keeps the flow token-light: the weak worker receives one compact task capsule, not the whole repository or a long instruction manual. If it invents a file, command, placeholder, or success without evidence, validation fails.
Codebase memory flow for agents:
agent-harness map init
agent-harness map query --surface auth --compact
agent-harness map update --files src/auth/session.ts
agent-harness map record --surface auth --files src/auth/session.ts --summary "Auth session owns login state contracts and must be checked before authorization edits."Use this selectively. Simple one-file work does not need a full map. Risky or unclear work should query the affected surface first, then update memory after code changes.
Learning loop for repeated bugs or known-risk areas:
agent-harness learn query --surface auth --top-k 3 --compact
agent-harness learn capture --surface auth --kind failure_pattern --summary "Auth fixes must verify authorization guards after session edits." --files src/auth/session.ts --evidence-ref .agent-harness/runs/fix.full.json
agent-harness learn promote --lesson-id auth-failure-pattern-20260502This does not train the model. It stores short, evidence-backed lessons that future agents can query without loading the whole history.
Codebase Memory Diagram
This feature gives the agent a compact memory of the project without forcing it to reread the whole codebase on every request.
The idea is practical:
- first, build a small map of the project
- then, query only the area related to the task
- after a real change, update the memory
- next time, the agent starts with better context
flowchart TD
A["First setup in a project"] --> B["agent-harness map init"]
B --> C["Creates compact file and surface index"]
C --> D["User asks for bugfix or feature"]
D --> E{"Simple low-risk change?"}
E -->|Yes| F["Read touched file directly"]
E -->|No: risky or unclear| G["agent-harness map query --surface <surface>"]
G --> H{"Memory fresh?"}
H -->|Yes| I["Use compact memory + read changed files"]
H -->|No: stale or unknown| J["Read real source code and canonical docs"]
I --> K["Implement with harness plan and evidence"]
J --> K
F --> K
K --> L["agent-harness verify records evidence"]
L --> M["agent-harness map update --files <files>"]
M --> N["agent-harness map record --surface <surface> --summary <durable fact>"]
N --> O["Next agent starts with better context and fewer tokens"]Step by step:
map initcreates the first compact index of important project files.- For simple work, the agent should read the touched file directly and skip extra mapping.
- For risky or unclear work, the agent runs
map query --surface <surface>before editing. - If memory is
fresh, it uses the compact summary plus the real files it is changing. - If memory is
staleorunknown, it must read the real source code and canonical docs before trusting the cache. - After implementation,
verifyrecords evidence that checks actually ran. map update --files <files>refreshes file hashes.map recordsaves only durable facts: contracts, flows, invariants, known traps, and key files.- The next agent spends fewer tokens because it can start from compact memory instead of rediscovering the same context.
Truth priority:
real source code > canonical docs > harness memory > chat historyThe memory is a cache. It helps the agent move faster, but it never replaces reading the real code when the risk is high.
Good memory entry:
Auth session owns login state contracts and must be checked before authorization edits.Bad memory entry:
Code updated.The harness rejects vague memory because vague memory makes future agents worse.
Learning Loop
The learning loop is a governed notebook for hard-won lessons.
It is useful when the agent finds a recurring bug, fixes a fragile area, or discovers a verification rule that should not be rediscovered next time.
Flow:
capture -> validate -> promote -> query -> health/audit -> prunecapture: save a candidate lesson from evidence.validate: prove the lesson has evidence, existing files, safe text, and required failure details.promote: allow a specific lesson to appear in future queries.query: return only the most relevant lessons for one surface, optionally ranked by touched files and failure signature.health: cheap check that tells the agent when memory needs a compact audit.audit: short read-only report of stale, duplicate or low-confidence lessons.prune: retire expired or noisy lessons.
Lessons are intentionally small. The default query returns only top_k = 3, so the agent gets useful context without spending tokens on old history.
This is not model training. It is an evidence-backed memory notebook. Routine output stays compact: no embeddings, no extra reviewing agent, no automatic deletion, and no long learning report unless a human asks for audit detail.
For non-technical users, this should feel automatic: during L2/L3 work, session start may tell the agent learning_health=needs_audit; the agent then runs learn audit --compact and reports the result in plain language.
Truth priority:
source code > current tests/runtime > canonical docs > evidence > promoted lessons > old chatThe learning loop improves reuse, but it does not replace reading real code for risky work.
Copy-Paste Prompt For Your Agent
After installing the harness, give your AI coding agent this instruction:
Use the agent harness for approved plans, multi-step work, risky changes, and any task where you need to prove completion.
For L2/L3 tasks, run the harness automatically. The user should not need to remember to ask for it.
Read docs/agent-runtime.md first; do not load the full README for routine execution.
Before editing, validate the plan.
During execution, keep the harness artifact updated.
Prefer token-light commands: session start, next, verify, claim auto, finish.
For risky or unclear work, query codebase memory before editing and update it after changing durable structure.
Do not claim success unless the artifact is completed and includes evidence plus verified claims.
In the final answer, include run_id, artifact path, status, gates, evidence, verified claims, and rollback notes.What Problem Does This Solve?
AI coding agents can write code quickly, but speed is not the same as reliable delivery.
Without a harness, an agent can:
- edit before understanding the task
- skip plan steps
- say tests passed without running tests
- invent files, commands, APIs, or validations
- expand scope without noticing
- keep going after dangerous ambiguity
- declare success without proof
This harness reduces those failures by creating an execution contract.
The agent can still reason and write code, but the harness requires a structured artifact that records what actually happened.
That artifact becomes the difference between:
"I think it is fixed."and:
"This run completed. Here is the plan, the changed files, the checks, the evidence, the verified claims, and the rollback path."Explain It Like I Am New To This
Think of the harness as three things:
- a checklist: what the agent must do
- a flight recorder: what the agent actually did
- a memory notebook: what the agent should remember next time
The flight recorder saves proof:
- what task was executed
- what files were involved
- what command was run
- whether the command passed or failed
- what evidence supports the final answer
So when the agent says "done", you can ask:
Where is the artifact?
What evidence proves it?
Which claims were verified?If the agent cannot answer, the work is not truly complete.
For Non-Technical Users
Do I Need To Understand The Commands?
Usually, no.
The intended experience is conversational:
User: Create a plan.
Agent: Here is the plan.
User: Execute the plan using the harness.
Agent: Runs the harness, edits code, records evidence, and reports the artifact.You only need to know the high-level rule:
Do not trust "done" unless the agent gives evidence from the harness artifact.
What Should I Ask The Agent?
Use prompts like these:
Investigate this bug. Do not edit files yet.Create a plan with files, risks, tests, and rollback.Execute this approved plan using the harness.Do not say it is done unless the harness artifact is completed.Show me the run_id, artifact path, final status, evidence, tests, and verified claims.How Do I Know It Worked?
A strong final answer should include:
run_id- artifact path
- final status
- evidence
- tests or gates executed
- verified claims
- rollback notes when relevant
The safest completion signal is:
status: completed
phase: completed
verified claims: present
evidence: presentIf those fields are missing, treat the work as partial.
Good Final Answer Example
run_id: fix-login-20260428
artifact: .agent-harness/runs/fix-login-20260428.json
status: completed
gates: pnpm test:run tests/login.test.ts
evidence: exit_code 0, affected login tests passed
verified claims: bug_reproduced_before_fix, bug_fixed_after_fix, acceptance_criteria_met
rollback: revert commit abc123 or restore files listed in the artifactWeak Final Answer Example
Done. It should work now.Do not trust this. It has no artifact, no evidence, and no verified claims.
Installation Options
You can use the harness without becoming an npm expert.
If you are new, use npx. It downloads and runs the latest package for you.
Recommended Install
Use this for most projects:
cd C:\Projetos\my-app
npx agent-execution-harness@latest init --adapter generic --cwd .
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .What each command does:
cd C:\Projetos\my-app: opens your project folderinit --adapter generic --cwd .: previews the installationinit --adapter generic --cwd . --apply --agents-mode append: installs the harness and appends rules toAGENTS.mddoctor --harnessability --cwd .: checks if everything is configured and gives a readiness score
Expected doctor result:
Agent Execution Harness doctor passed.
Harnessability score: 90/100AGENTS.md Options
AGENTS.md is the instruction file your coding agent reads.
Choose one mode:
# safest: keep your existing AGENTS.md unchanged
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode skip
# recommended: add harness rules to your existing AGENTS.md
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
# advanced: replace AGENTS.md after creating a backup
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode overwriteUse append if you are not sure.
Preview Only
Run this when you only want to see what would happen:
npx agent-execution-harness@latest init --adapter generic --cwd .Preview mode does not apply the installation.
Stetix-Style Project
For projects that want the Stetix adapter:
npx agent-execution-harness@latest init --adapter stetix --cwd . --apply --agents-mode appendInstall As A Dev Dependency
Use this when you want the harness pinned in package.json:
npm install --save-dev agent-execution-harnessThen commands are available as:
agent-harness doctor --harnessability --cwd .
agent-harness run
agent-harness reportUpdating An Existing Installation
Use this when you already installed the harness and want the newest version.
If you used npx, run the same installer with @latest:
cd C:\Projetos\my-app
npx agent-execution-harness@latest init --adapter generic --cwd .
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .If you installed it in package.json, update the package first:
cd C:\Projetos\my-app
npm install --save-dev agent-execution-harness@latest
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .For projects using pnpm:
pnpm add -D agent-execution-harness@latest
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .Simple explanation: updating means downloading the new harness package, running the installer again, and checking the project with doctor.
Use --agents-mode append unless you are sure you want to replace your existing AGENTS.md.
Safe update behavior:
- existing harness history is preserved
- existing run reports in
.agent-harness/runs/are preserved - existing map/learning memory in
.agent-harness/is preserved - existing project history such as
docs/historico.mdis preserved - existing
agent-harness.config.jsonis not replaced automatically - existing runtime docs are not replaced automatically
- existing
package.jsonscripts are kept; missing harness scripts are added .gitignorereceives harness lines only onceAGENTS.mdis appended only when you choose--agents-mode appendAGENTS.mdis replaced only when you choose--agents-mode overwrite- every applied install creates a backup under
.agent-harness/backups/
Think of update like installing a new tool version beside your project rules. It should improve the harness commands, not erase your project memory.
After Installing
Use natural language with your AI coding agent:
Create a plan for this bug.
Execute the plan using the harness.
Show the run_id, artifact path, status, evidence, verified claims, and rollback.Good final signal:
status: completed
evidence: present
verified claims: presentWeak final signal:
Done. It should work.Common Confusions
This section explains the common terms without assuming you are a developer.
What Is npm?
npm is the package registry where this tool is published.
GitHub stores the source code. npm distributes the installable package.
What Is npx?
npx runs a package from npm without requiring you to install it manually first.
This command:
npx agent-execution-harness@latest doctor --harnessability --cwd .means:
Download the latest harness package, run its doctor command, and check this project.Is The Harness Automatic?
Only when the project and agent are configured to use it.
The harness is not hidden magic inside every AI tool. It works when:
- the project has harness files installed
- the project has clear
AGENTS.mdrules - the agent reads and follows those rules
- the agent can run local commands
- the task matches a rule requiring the harness
For example:
For approved multi-step plans, use the agent harness.
Do not declare success without a completed artifact, evidence, and verified claims.Can A Bad Agent Ignore It?
Yes, if the surrounding tool lets it ignore project instructions.
The harness makes correct behavior easier to enforce and audit, but it cannot physically control every possible model or coding tool unless that tool invokes it.
That is why the final answer must include artifact evidence.
Practical rule:
No artifact, no evidence, no trust.Troubleshooting
npx asks whether to install the package
That is normal. Accept it.
doctor does not pass
Read the findings. Usually this means one of these is missing:
AGENTS.mdagent-harness.config.json- package scripts
- ignored artifact folder
Fix the reported item and run doctor again.
The agent says it used the harness but gives no artifact
Treat the work as incomplete. Ask:
Show the run_id, artifact path, final status, evidence, and verified claims.The agent refuses or forgets to use the harness
Add a stronger project instruction in AGENTS.md:
For approved plans, multi-step work, risky changes, and delegated execution, use agent-harness.
Do not declare success without a completed artifact, evidence, and verified claims.I only want to try it without changing my project
Run the init command without --apply:
npx agent-execution-harness@latest init --adapter generic --cwd .This is a preview. It should not apply the installation.
Why Not Just Prompts?
Prompts are useful, but prompts are memory and intention. They can be ignored, forgotten, or interpreted differently by different models.
Agent Execution Harness turns the most important parts of the workflow into explicit runtime artifacts:
- the plan is structured
- the current phase is recorded
- allowed actions are constrained
- evidence must match a gate
- claims must be verified
- final reports are derived from artifacts
The harness does not replace prompts. It gives prompts something harder to drift away from.
Practical difference:
Prompt only:
"Please be careful and run tests."
Harness-backed:
"Record the gate, record exit_code, attach output excerpt, verify the claim, and only then report completion."That is why this project focuses on execution evidence, not just better wording.
What Installation Adds To A Project
The installer is designed to configure harness-related files such as:
AGENTS.mdagent-harness.config.json- package scripts
- ignored artifact folders
- plan/checklist templates
- adapter-specific rules
If the project already has AGENTS.md, the installer is conservative:
- default behavior: keep the existing file unchanged
--agents-mode append: add a marked harness block once--agents-mode overwrite: replace the file, with backup created first
The harness artifact folder is usually:
.agent-harness/runs/Those artifacts prove what the agent actually did.
Core Concepts For Developers
Plan
A plan is a JSON document with:
schema_versionplan_idrisk_levelrollback_expectationgatestasks
Each task must include:
task_idacceptance_criteria
Example:
{
"schema_version": "agent_harness_plan_v1",
"plan_id": "basic-plan",
"risk_level": "L2",
"rollback_expectation": "Delete generated test files.",
"gates": ["node --version"],
"tasks": [
{
"task_id": "basic-task",
"depends_on": [],
"acceptance_criteria": "node --version evidence passes."
}
]
}Action
An action is one state transition requested by the agent.
Supported actions:
read_contextdeclare_filesedit_file_readyrun_gaterecord_evidenceverify_claimsfinal_reporthalt_for_risk
Artifact
The artifact is the source of truth for execution.
Default path:
.agent-harness/runs/<run_id>.jsonIf the artifact is not completed, the work is not complete.
Evidence
Evidence records proof that a gate or check happened.
Each evidence item records:
evidence_id- optional
evidence_type - optional
evidence_types checkresultexit_codeoutput_excerptscope_covered- optional
residual_gap - optional
output_refandsha256for long logs stored outside chat
Evidence types let the harness compare proof against the plan:
{
"evidence_id": "ui-verification",
"evidence_types": ["focused_tests", "scoped_lint", "scoped_typecheck", "visual_assertion"],
"check": "pnpm agent:verify:ui",
"result": "pass",
"exit_code": 0,
"output_excerpt": "Focused tests, lint, typecheck and visual assertion passed.",
"scope_covered": "sidebar UI verification"
}Evidence Policy
Plans may declare required evidence per task:
{
"task_id": "verify-sidebar-layout",
"surface": "ui_layout",
"files": ["src/components/AppLayout.tsx"],
"required_evidence": [
"focused_tests",
"scoped_lint",
"scoped_typecheck",
"browser_smoke|visual_assertion"
],
"acceptance_criteria": "Sidebar layout has no overlap."
}If required_evidence is not provided, the harness infers requirements from the task surface and files. UI/layout work requires focused tests, scoped lint, scoped typecheck and browser smoke or visual assertion before a run can be completed.
If required proof is missing, the run becomes partial_validated instead of completed, and the report shows the evidence score plus missing requirements.
Claims
A claim is something the agent wants to assert as true.
Supported claim kinds include:
file_existscommand_rangate_passedgate_faileddangerous_command_blockedtask_reconciledbug_reproduced_before_fixbug_fixed_after_fixacceptance_criteria_metcontract_preservedrollback_definedno_product_code_changed
Final reports should be derived from verified claims, not from agent confidence.
State Machine
The run follows this lifecycle:
init -> preflight -> task_start -> gate -> evidence -> report -> completedA run may also enter:
haltMeaning:
init: run createdpreflight: context read, files must be declaredtask_start: task execution can begingate: validation command must be declaredevidence: validation result must be recordedreport: claims must be verified and final report producedcompleted: run is completehalt: execution stopped for safety or invalid state
CLI Reference
When installed from npm, use:
agent-harness <command>When developing from this repository, use:
node bin/agent-harness.mjs <command>init
Prepares a target project using templates.
agent-harness init --adapter generic --cwd .By default, init is a dry run. Applying changes must be explicit:
agent-harness init --adapter generic --cwd . --applyExisting AGENTS.md handling:
agent-harness init --adapter generic --cwd . --apply --agents-mode skip
agent-harness init --adapter generic --cwd . --apply --agents-mode append
agent-harness init --adapter generic --cwd . --apply --agents-mode overwriteWithout --agents-mode, interactive terminals ask what to do. Non-interactive runs use skip to avoid overwriting user instructions.
doctor
Checks whether a project is configured correctly.
agent-harness doctor --harnessability --cwd .plan-lint
Validates a plan before execution.
agent-harness plan-lint --plan plan.jsonplan waves
Shows dependency waves from optional task depends_on fields.
agent-harness plan waves --plan plan.jsondispatch
Guides serial fallback or safe subagent batches without spawning workers directly.
agent-harness dispatch plan --plan plan.json
agent-harness dispatch next --batch --runtime subagentsValidate returned worker JSON with agent-harness handoff validate --plan plan.json --task-id task-id --input worker-output.json. Dispatch does not have a separate validate command.
execute
Initializes or resumes a run.
agent-harness execute --plan plan.json --run-id fix-loginsession start
Starts an active low-token session so later commands do not need to repeat --plan, --run-id and --mode.
agent-harness session start --plan plan.json --run-id fix-loginnext
Returns only the next actionable step from the current artifact.
agent-harness nextverify
Runs a policy-checked command, stores long output as a referenced log, records sha256, and creates evidence automatically.
agent-harness verify --task-id fix-login --type focused_tests --cmd "pnpm test"Strict structured command:
agent-harness verify --task-id fix-login --type focused_tests --exec pnpm --args-json "[\"test\"]"map
Maintains compact codebase memory for selective reuse.
agent-harness map init
agent-harness map status
agent-harness map query --surface auth
agent-harness map update --files src/auth/session.ts
agent-harness map record --surface auth --files src/auth/session.ts --summary "Auth session owns login state contracts and must be checked before authorization edits."Use map query before risky or unclear work. Use map update after changed files. Use map record only for durable facts: contracts, flows, invariants, known traps, and key files.
learn
Maintains governed lessons from real evidence.
agent-harness learn capture --surface auth --kind failure_pattern --summary "Auth fixes must verify authorization guards after session edits." --files src/auth/session.ts --evidence-ref .agent-harness/runs/fix.full.json
agent-harness learn validate --lesson-id auth-failure-pattern-20260502
agent-harness learn review --surface auth
agent-harness learn promote --lesson-id auth-failure-pattern-20260502
agent-harness learn query --surface auth --top-k 3 --compact --files src/auth/session.ts --failure-signature "guard failed"
agent-harness learn pruneUse learn query for repeated failures or known-risk surfaces. Use learn capture only after evidence exists, learn validate before promotion, and compact queries for weak agents. Do not store secrets or generic notes.
run
Applies one low-level action to the run state.
agent-harness run \
--plan plan.json \
--run-id fix-login \
--action '{"schema_version":"agent_harness_action_v1","type":"read_context","summary":"Read plan and repo context."}'This is the transactional command used by autonomous agents.
report
Generates a final report from a completed artifact.
agent-harness report --run-id fix-loginbenchmark
Runs an offline benchmark over captured scenarios.
agent-harness benchmark --mode smokeThe benchmark does not call model APIs.
Configuration
Projects can define:
agent-harness.config.jsonExample:
{
"schema_version": "agent_harness_config_v1",
"artifact_dir": ".agent-harness/runs",
"product_paths": ["src/", "supabase/"],
"required_scripts": ["agent:harness", "agent:plan:lint"],
"doctor_profile": "generic",
"command_policy": {
"allow": [],
"deny": ["DROP", "TRUNCATE", "git reset --hard", "push --force"]
},
"token_budget": {
"observation_format": "ultra_compact",
"summary_max_chars": 240,
"output_excerpt_max_chars": 600,
"report_compact_max_chars": 1600
},
"codebase_memory": {
"enabled": true,
"memory_dir": ".agent-harness/memory",
"default_strategy": "query",
"stale_after_days": 14,
"max_summary_chars": 1200,
"surface_budgets": {
"auth": 1800,
"db": 1800,
"api": 1400,
"ai": 1400,
"ui": 900,
"ui_layout": 900,
"docs": 500,
"generic": 700
},
"high_risk_surfaces": ["auth", "db", "api", "ai"]
},
"learning_memory": {
"enabled": true,
"memory_dir": ".agent-harness/learning",
"top_k": 3,
"ttl_days": 60,
"max_summary_chars": 500,
"max_lessons_per_surface": 20
},
"architecture_rules": [
{
"id": "no_client_secret_import",
"from": "src/**/*.ts",
"forbid_import": "**/service-role**",
"reason": "client code must not import server-only secrets"
}
]
}Important fields:
artifact_dir: where run artifacts are storedproduct_paths: paths treated as product coderequired_scripts: scripts expected by doctordoctor_profile: validation profile for the projectcommand_policy: allow/deny rules for commandstoken_budget: controls compact output and log excerpt limitscodebase_memory: controls selective repository mapping and memory freshnesslearning_memory: controls evidence-backed lessons from fixes and failuresarchitecture_rules: optional compact path/import boundary checks fordoctor --architecture
Deny rules take priority over allow rules.
Safety Model
The harness blocks or halts when it sees unsafe behavior, including:
- destructive database commands
- destructive Git commands
- force push
- recursive forced deletion
- evidence that does not match the pending gate
- final report before verified claims
- completion before all tasks are reconciled
- shell commands in
strictmode when structured--execis required - validation commands that are not listed in the task
allowed_commandsinstrictmode
This does not replace human judgment. It creates mechanical pressure against unsafe automation.
What This Harness Does Not Promise
This harness does not guarantee that:
- the model will understand the product perfectly
- every bug will be found
- every generated test is meaningful
- every architectural decision is correct
- no human review is needed
It does provide stronger operating discipline:
- work is decomposed into tasks
- state is recorded
- evidence is required
- claims are checked
- dangerous operations are blocked or halted
- final reports are derived from artifacts
Local Development
Use this section only if you want to edit or contribute to the harness itself.
Install dependencies:
pnpm installRun typecheck:
pnpm typecheckRun tests:
pnpm testBuild:
pnpm buildRun integration tests:
pnpm test:integrationRun smoke benchmark:
pnpm benchmark:smokeRun release readiness audit:
pnpm audit:release-readinessProject Status
Current version:
0.14.4Package:
agent-execution-harnessTreat this as an early public foundation for structured AI-assisted development.
It is useful today if you want agents to work with plans, evidence, safety stops, compact memory, and audit-friendly reports.
Weak Model Mode
Use --mode weak when the coding agent has low reasoning power, small context, or keeps drifting from the plan. The harness then behaves like guardrails on a narrow road: one compact next action, fewer files per task, typed evidence, shorter summaries, and repair hints when a gate fails.
Practical flow:
agent-harness plan-lint --plan plan.json
agent-harness session start --plan plan.json --run-id my-fix --mode weak
agent-harness next
agent-harness verify --task-id task-1 --type focused_tests --cmd "pnpm test"
agent-harness claim auto
agent-harness finish --summary "validated"weak mode is not for every request. Use normal mode for simple, trusted agents; use weak mode for risky work, junior agents, local LLMs, or repeated failures.
Strict Mode
Use --mode strict when the agent is weak, the work is sensitive, or you want the strongest local enforcement currently available.
Strict mode adds three important rules:
- validation commands must be declared in the plan task
allowed_commands - shell-style
--cmdis blocked by default - use
--execplus--args-jsonso the harness runs a structured command without shell parsing
Example:
agent-harness session start --plan plan.json --run-id strict-fix --mode strict
agent-harness verify --task-id task-1 --type focused_tests --exec pnpm --args-json "[\"test:run\",\"tests/login.test.ts\"]"Strict mode is safer, but less flexible. If a command is not declared in the plan, the harness stops instead of guessing.
