agent-execution-harness

v0.14.4

Published

4 hours ago

A transactional execution harness for AI coding agents with evidence-backed reports.

0High
0Medium
0Low

luciusfire

ai-agent coding-agent agent-harness evidence automation governance

Agent Execution Harness

Agent Execution Harness helps AI coding agents work like disciplined software engineers instead of improvising through your codebase.

It gives the agent a repeatable operating system for software work:

understand -> plan -> read relevant context -> execute one task -> verify -> record evidence -> report honestly -> remember useful lessons

The goal is simple: make AI-assisted development more reliable, auditable, and cheaper in tokens.

What's New In v0.14.4

This patch lets agents reuse an approved plan whether it was saved as a file or produced in chat.

added plan import --from - for pasted or piped chat plans
kept plan import --from backlog.md for file-based plans
protected existing plan.json files from silent overwrite
replacing an existing plan now requires explicit --overwrite

In plain language: if Codex or OpenCode already wrote the plan, the harness can turn that approved text into plan.json without recreating or overwriting it silently.

What's New In v0.14.3

This patch adds dispatch guidance for agents that may have subagents.

added dispatch plan and dispatch next --batch
dispatch returns safe serial fallback or parallel handoff packets
worker JSON validation stays with handoff validate
dispatch refuses to create a new batch while a run already has active work
runnable tasks are no longer also reported as blocked

In plain language: the harness can now tell an agent when to use subagents and when to stay serial, without inventing worker validation commands.

What's New In v0.14.2

This patch helps agents turn an approved text backlog into an executable plan.json.

added plan import for the atomic Markdown backlog format
missing-plan errors now show the import -> lint -> session flow
the harness still does not guess from chat history; the backlog must be saved to a file

In plain language: a plan in chat can now become a real Harness plan without the agent recreating it freely.

What's New In v0.14.1

This patch removes ambiguity from the prompt users give to coding agents.

added a copy-paste execution prompt with the real harness command
clarified that Stetix-style projects use pnpm agent:harness
added a test so public docs/templates do not reintroduce missing-command placeholders

In plain language: weaker agents should receive the exact command to run, not a placeholder they need to interpret.

What's New In v0.14.0

This release makes the harness more token-first for weaker agents.

doctor --coverage: shows compact gaps in project safety controls.
doctor --architecture: checks lightweight boundary rules without new dependencies.
topology detection helps recommend controls for CLI, web, API and Supabase projects.
plan lint and templates now reinforce surgical coding discipline with short rules.
token benchmarks cap the new doctor outputs so routine agent loops stay cheap.

In plain language: weaker agents get clearer rails and shorter diagnostics before they guess, over-edit or claim success without proof.

What's New In v0.13.2

This patch improves how agents react to repeated failures.

repeated failures now tell agents to inspect local code/history first
docs or web research is reserved for external dependency behavior
agents should compare two possible fixes instead of guessing repeatedly
the rule stays compact and keeps the install token budget intact

In plain language: when the same error appears again, the agent should stop guessing, look locally first, research only when needed, then choose the smallest safe fix.

What's New In v0.13.1

This patch improves the instructions installed into AGENTS.md.

agents are reminded to read before writing
risky ambiguity should stop or ask instead of guessing
changes should stay surgical and avoid unrelated refactors
success criteria and evidence come before claiming completion
the rules stay compact so the install does not waste tokens

In plain language: new installs and updates give the coding agent clearer rails without turning AGENTS.md into a long manifesto.

What's New In v0.13.0

This release hardens the harness for weaker agents and safer public use.

stable test timeout for slower machines
dependency audit clean at moderate severity
CI/release now run secret scan and dependency audit
stronger dangerous-command detection
safer command guidance: prefer --exec + --args-json over free shell
richer optional plan controls: forbidden files, expected diff, required checks and rollback command
stricter high-confidence memory: source files and main-agent validation are required
benchmark smoke now fails if false success appears or out-of-plan diff is not blocked

In plain language: the harness now does more to stop weaker agents from guessing, touching the wrong files, or claiming success without proof.

What's New In v0.12.4

This patch fixes install/update scripts for existing projects.

agent:harness now points to the root CLI: agent-harness
existing projects with the legacy agent-harness run script are upgraded safely
pnpm agent:harness doctor, --help, --version and session commands work from the same script

In plain language: after installing or updating, users can run all harness commands from pnpm agent:harness ... without command routing errors.

What's New In v0.12.3

This patch adds plain version output.

agent-harness --version
agent-harness -v
agent-harness version

In plain language: after installing, users can check the installed version without seeing a JSON error.

What's New In v0.12.2

This patch polishes install output for first-time users.

init --apply now says Files updated safely instead of exposing technical action labels.
The JSON hint is shorter and marked as advanced.

In plain language: installation output is less noisy and easier to understand.

What's New In v0.12.1

This patch makes install and readiness output easier for humans.

init prints a short success message by default.
doctor prints a readable readiness report by default.
--json keeps structured output available for CI, scripts and advanced automation.

In plain language: beginners see "installed successfully" plus the next command to run. Machines can still ask for JSON.

What's New In v0.12.0

This release adds lightweight learning-memory health checks.

learn health --compact: tells the agent when memory needs cleanup.
learn audit --compact: lists stale, duplicate or low-confidence lessons in a short read-only report.
session start can return learning_health=needs_audit, so agents can audit memory without the user remembering commands.

In plain language: the harness can notice when its lesson notebook is getting noisy and ask the agent to do a compact review. It does not delete lessons automatically.

What's New In v0.11.1

This patch makes installation easier to understand.

init --apply now says clearly when the harness was installed successfully.
The output explains what happened to AGENTS.md: appended, overwritten, created, or left unchanged.
The next steps show exact doctor and rollback commands.

In plain language: after installing, you should no longer need to guess whether it worked.

What's New In v0.11.0

This release improves the harness learning loop without adding embeddings, extra AI agents, or long reports.

learn validate: lessons must be validated before promotion.
Smarter learn query: rank lessons by touched files and failure signature.
Repeated failure hint: verify can suggest a short learning action after equivalent failures.
Token budgets: validation output and learning hints are capped so routine agent output stays compact.

In plain language: the harness remembers useful lessons more safely, but still talks to the agent in short, cheap messages.

Works With Non-Frontier Agents Too

You do not need a frontier model to benefit from this harness.

Agent Execution Harness is designed to help weaker, cheaper, local, junior, or low-context coding agents execute software work more safely. In weak mode, the harness turns broad implementation work into small deterministic steps:

one exact next command with next --exact
fewer files per task
typed evidence instead of vague status updates
short repair hints when the agent gets stuck
blocked completion when the agent changes files outside the declared plan
compact artifacts so the agent does not need to reread the whole repository

In plain language: a strong model may use the harness as discipline. A weaker model uses it as rails.

You can also use a strong model as the planner/reviewer and a cheaper or weaker model as the worker. The handoff flow gives that worker one compact task capsule, then validates its JSON output before the work can be trusted.

AI agents are useful, but they often fail in the same ways:

they change files before understanding the task
they skip steps from the plan
they say tests passed when no test was run
they invent files, commands, APIs, or validations
they declare "done" without proof

This project adds a small execution system around the agent.

It does not try to make the model smarter. It makes the agent easier to guide, audit, and stop when the work becomes unsafe.

In plain language: it is a checklist, memory, learning notebook, and flight recorder for AI-assisted software development.

It helps an AI agent execute software plans in a more organized way by forcing the agent to:

follow a plan task by task
declare which files it expects to touch
run explicit checks
record evidence
verify claims before saying "done"
stop instead of guessing when work becomes unsafe

Why Use This?

Use this repo when you want an AI coding agent to:

create a clear plan before risky work
execute that plan step by step
keep a record of what happened
run checks and attach evidence
remember useful codebase context for future tasks
avoid rereading the whole project every time
avoid claiming success without proof

The harness is especially useful for:

bug fixes
refactors
multi-step features
AI-assisted code review
teams experimenting with autonomous coding agents
projects where "trust me, it works" is not good enough

The most important benefit is not speed. It is controlled speed.

Without a harness, an agent can move fast and still leave you unsure whether it understood the task, ran the right checks, or changed the right files. With the harness, every important step leaves an artifact: the plan, touched files, commands, evidence, verified claims, and rollback notes.

That turns AI coding from a chat conversation into an engineering workflow you can inspect.

The Full Harness Flow

This is the day-to-day flow the harness tries to enforce:

flowchart TD
  A["User asks for a bugfix, feature, or review"] --> B["Agent classifies risk and creates or reads a plan"]
  B --> C{"Simple low-risk work?"}
  C -->|Yes| D["Read the touched file directly"]
  C -->|No: broad, risky, or unclear| E["Query codebase memory with map"]
  E --> F["Query learned lessons with learn"]
  D --> G["Declare expected files"]
  F --> G
  G --> H["Execute one task at a time"]
  H --> I["Run a real gate: test, typecheck, lint, build, smoke, or custom command"]
  I --> J["Store evidence: command, exit code, output excerpt, log ref, sha256"]
  J --> K["Verify claims before final report"]
  K --> L{"Required evidence complete?"}
  L -->|Yes| M["Status: completed"]
  L -->|No| N["Status: partial_validated or halt"]
  M --> O["Update map and capture useful lessons"]
  N --> O
  O --> P["Next agent starts with better context and fewer repeated mistakes"]

The important part: the agent does not get to say "done" just because it feels confident.

It must prove the work.

The Three Memory Layers

The harness now separates memory into three practical layers:

| Layer | What it answers | Example | |---|---|---| | Plan artifact | What was supposed to happen? | "Fix login bug, touch src/auth/session.ts, run focused auth tests." | | Codebase memory | Where does this logic live? | "Auth session contracts live in src/auth and affect guards." | | Learning memory | What did we learn from previous failures? | "When session state changes, also test authorization guards." |

This matters because agents waste tokens and make mistakes when they rediscover the same project structure or repeat the same bug pattern. The harness stores compact, evidence-backed context so the next run starts from better information without loading the whole repository.

Truth still has a strict order:

source code > current tests/runtime > canonical docs > evidence > promoted lessons > old chat

Memory helps the agent. It never replaces checking the real code.

What You Get

After installation, your project gets:

AGENTS.md rules that tell the AI agent how to behave
agent-harness.config.json for local policy and artifact settings
plan validation
execution artifacts
evidence-backed final reports
codebase memory commands
safety checks for risky commands
compact output modes to reduce token usage
governed learning loop for evidence-backed lessons
control catalog showing which risks each harness control covers
harnessability scoring to show how ready a project is for AI-agent execution
coverage and architecture diagnostics for compact risk gaps
repeated-failure steering to suggest small controls after recurring mistakes
optional approved fixtures for critical behavior that must not be guessed

The intended day-to-day experience is simple:

You: Find this bug.
Agent: Investigates and proposes a plan.
You: Execute the plan using the harness.
Agent: Executes step by step, records evidence, and reports the artifact.
You: Show me proof.
Agent: Shows run_id, artifact, checks, evidence, claims, and rollback.

If the agent cannot show evidence, the work is not complete.

Useful Links

Quick Start

Use this if you want to try the harness in an existing project.

AI agents should read docs/agent-runtime.md for the short runtime protocol. This README is for humans.

Text Backlog To Plan

If Codex or another planner already produced an approved atomic Markdown backlog, save it as backlog.md, then run:

agent-harness plan import --from backlog.md --out plan.json --plan-id my-plan --risk L2 --rollback "Delete generated files."
agent-harness plan-lint --plan plan.json
agent-harness session start --plan plan.json --run-id my-plan --mode weak

If the approved plan exists only in a chat response, paste or pipe that text through stdin:

agent-harness plan import --from - --out plan.json --plan-id my-plan --risk L2 --rollback "Delete generated files."

plan import does not overwrite an existing output file by default. Use --overwrite only when replacing the previous plan.json is intentional. Once plan.json exists, the harness reuses it through --plan plan.json; dispatch, handoff and session commands do not recreate the plan.

Supported task format:

- [ ] **Tarefa [1]**: Ajustar arquivo em `src/file.ts`.
  - **Dependência:** Nenhum
  - **DoD:** `pnpm test:run tests/unit/file.test.ts` passa.

The importer is intentionally narrow. It converts the known backlog format from a file or stdin; it does not infer plans from free-form chat. Dependencies must be Nenhum or Tarefa N; invalid dependency text fails instead of being ignored.

Which Mode Should I Use?

| Mode | Use when | What it optimizes | |---|---|---| | standard | normal AI coding agent, normal task | balanced speed and evidence | | weak | cheaper model, local model, junior agent, or low-context executor | smaller steps, compact output, repair hints | | strict | sensitive work or less trusted executor | only declared structured commands can pass | | handoff | strong model plans/reviews while another model executes one task | compact delegation with JSON validation |

Simple rule: use standard by default, weak when the agent drifts, strict when command control matters, and handoff when you want one model to plan and another model to execute.

Control Catalog

The harness keeps a small catalog of controls so users can see what is protected and what is not.

Examples:

plan_lint: catches invalid plans before execution.
scope_guard: blocks finish when files outside the plan changed.
evidence_policy: blocks success claims without required proof.
strict_command_policy: blocks undeclared shell commands in strict mode.
handoff_validate: checks JSON returned by weak or external workers.

This is intentionally metadata, not a heavy policy engine. The goal is auditability with near-zero token cost.

Harnessability

doctor --harnessability checks whether a project is easy for AI agents to work in safely:

agent-harness doctor --harnessability --cwd .

By default, doctor prints a human-readable report. For JSON output:

agent-harness doctor --json --harnessability --cwd .

It scores cheap local signals such as scripts, AGENTS.md, harness config, tests, runtime docs, artifact policy and command policy. A low score does not mean the project is bad. It means agents have fewer rails and may need smaller plans or stronger review.

Coverage And Architecture

Use these before broad, risky or weak-agent work:

agent-harness doctor --coverage --architecture --cwd .

--coverage shows which cheap controls are present and which are missing. --architecture checks optional boundary rules from agent-harness.config.json, such as client code importing server-only secrets.

Both outputs are short by default. Use --json only for automation.

Repeated Failure Steering

doctor --steering scans recent harness artifacts and suggests the smallest control when the same failure keeps happening:

agent-harness doctor --steering --cwd .

It does not auto-rewrite your rules. It only points to repeated evidence, such as out-of-plan edits or missing evidence, so a human or senior agent can decide whether to add a rule, test or checklist item.

Approved Fixtures

Approved fixtures are optional. Use them only for critical behavior where generated tests are not enough, such as auth, billing, clinical AI, data transforms or structured AI output.

agent-harness fixtures validate --file tests/fixtures/approved/basic-approved-fixture.json

A fixture must be explicitly owner-approved. This keeps the feature useful without turning every small task into a heavyweight validation process.

The Simple Path

Copy and paste these commands inside the project where you want to use the harness.

Step 1: open your project folder:

cd C:\Projetos\my-app

Step 2: preview what will be installed:

npx agent-execution-harness@latest init --adapter generic --cwd .

This preview should not change your project.

Step 3: install the harness:

npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append

This is the recommended command for most projects.

It adds harness rules to AGENTS.md without replacing your current instructions.

Step 4: check that installation worked:

npx agent-execution-harness@latest doctor --harnessability --cwd .

Expected result:

Agent Execution Harness doctor passed.
Harnessability score: 90/100

Step 5: tell your AI coding agent to use it:

Use the agent harness for approved plans, multi-step work, risky changes, and any task where you need to prove completion.
For L2/L3 tasks, run the harness automatically. The user should not need to remember to ask for it.
Read docs/agent-runtime.md first.
Do not claim success unless the harness artifact is completed and includes evidence plus verified claims.

Then talk normally:

Find this bug.
Create a plan.
Execute the approved plan using the harness.
Show me the evidence.

What Files Get Added?

The installer may add or update harness setup files such as:

AGENTS.md
agent-harness.config.json
package scripts
harness artifact folders
runtime docs for the agent

If your project already has AGENTS.md, the recommended command uses --agents-mode append, so it adds a harness block instead of replacing the file.

The agent should use the harness underneath.

You do not need to memorize the commands below. They show what the agent should run behind the scenes.

Token-light flow for agents:

agent-harness session start --plan plan.json --run-id fix-id --summary "ctx"
agent-harness next
agent-harness files declare --files src/file.ts
agent-harness task start --task-id task-id --files src/file.ts
agent-harness verify --task-id task-id --type focused_tests --cmd "pnpm test"
agent-harness claim auto
agent-harness finish --summary "Validated."
agent-harness report --run-id fix-id --format compact

For weak, local, low-context or cost-sensitive executors, use the micro/compact variants:

agent-harness next --exact --micro
agent-harness dispatch next --batch --runtime subagents
agent-harness handoff --compact --plan plan.json --task-id task-id
agent-harness map query --surface auth --compact
agent-harness learn query --surface auth --top-k 3 --compact

These commands remove duplicate transport metadata from the chat output while preserving the full audit trail in artifacts and full commands. Use the normal output when a human needs to debug; use compact output when an agent only needs the next action.

For strict execution, prefer structured commands instead of shell strings:

agent-harness session start --plan plan.json --run-id fix-id --mode strict
agent-harness verify --task-id task-id --type focused_tests --exec pnpm --args-json "[\"test\"]"

strict mode blocks shell-style --cmd by default and requires the command to match the task allowed_commands.

In weak mode, claim auto automatically batches claims when a plan has many tasks. The agent still runs one simple command, while the harness keeps each internal action small enough for low-context executors.

For low-context agents, use next --exact --micro. It returns the exact next harness command plus the stop condition, reducing ordering mistakes such as claiming early, skipping file declaration, or forgetting the active task.

For multi-step plans, tasks can declare depends_on. Run agent-harness plan waves --plan plan.json to preview safe execution order. next --exact then guides the agent only to tasks whose dependencies already passed evidence.

The scope guard also checks the real git diff before finish. If the agent changed a product/source file outside the plan, the run stops with a repair_hint instead of pretending success. In plain language: the agent can only finish if the files it touched match the files it declared.

Dispatch Guidance

Use dispatch when the agent may have subagents but should not guess what can run in parallel.

agent-harness dispatch plan --plan plan.json
agent-harness dispatch plan --plan plan.json --runtime subagents
agent-harness dispatch next --batch --runtime subagents

Dispatch does not spawn workers itself. It inspects the plan, dependencies and task metadata, then returns either a safe parallel batch with handoff packets or a serial fallback task. If the runtime has no subagents, omit --runtime subagents and continue with the normal serial next --exact flow.

After a worker returns JSON, validate that output with the existing handoff validator:

agent-harness handoff validate --plan plan.json --task-id task-id --input worker-output.json

The optional task isolation field is advisory metadata in this version. It documents the intended worker isolation model, but the harness does not automatically create worktrees, fork workspaces, or sandboxes for dispatch.

Weak Worker Handoff

Use handoff when a strong model creates the plan and a weaker model, local model, junior agent, or external chat does the implementation work.

agent-harness handoff --compact --plan plan.json --task-id task-id

Paste prompt into the weak worker. It tells the worker exactly which files and commands are allowed, when to stop, and what JSON to return. After the worker responds, save its JSON and validate it:

agent-harness handoff validate --plan plan.json --task-id task-id --input worker-output.json

This keeps the flow token-light: the weak worker receives one compact task capsule, not the whole repository or a long instruction manual. If it invents a file, command, placeholder, or success without evidence, validation fails.

Codebase memory flow for agents:

agent-harness map init
agent-harness map query --surface auth --compact
agent-harness map update --files src/auth/session.ts
agent-harness map record --surface auth --files src/auth/session.ts --summary "Auth session owns login state contracts and must be checked before authorization edits."

Use this selectively. Simple one-file work does not need a full map. Risky or unclear work should query the affected surface first, then update memory after code changes.

Learning loop for repeated bugs or known-risk areas:

agent-harness learn query --surface auth --top-k 3 --compact
agent-harness learn capture --surface auth --kind failure_pattern --summary "Auth fixes must verify authorization guards after session edits." --files src/auth/session.ts --evidence-ref .agent-harness/runs/fix.full.json
agent-harness learn promote --lesson-id auth-failure-pattern-20260502

This does not train the model. It stores short, evidence-backed lessons that future agents can query without loading the whole history.

Codebase Memory Diagram

This feature gives the agent a compact memory of the project without forcing it to reread the whole codebase on every request.

The idea is practical:

first, build a small map of the project
then, query only the area related to the task
after a real change, update the memory
next time, the agent starts with better context

flowchart TD
  A["First setup in a project"] --> B["agent-harness map init"]
  B --> C["Creates compact file and surface index"]
  C --> D["User asks for bugfix or feature"]
  D --> E{"Simple low-risk change?"}
  E -->|Yes| F["Read touched file directly"]
  E -->|No: risky or unclear| G["agent-harness map query --surface <surface>"]
  G --> H{"Memory fresh?"}
  H -->|Yes| I["Use compact memory + read changed files"]
  H -->|No: stale or unknown| J["Read real source code and canonical docs"]
  I --> K["Implement with harness plan and evidence"]
  J --> K
  F --> K
  K --> L["agent-harness verify records evidence"]
  L --> M["agent-harness map update --files <files>"]
  M --> N["agent-harness map record --surface <surface> --summary <durable fact>"]
  N --> O["Next agent starts with better context and fewer tokens"]

Step by step:

map init creates the first compact index of important project files.
For simple work, the agent should read the touched file directly and skip extra mapping.
For risky or unclear work, the agent runs map query --surface <surface> before editing.
If memory is fresh, it uses the compact summary plus the real files it is changing.
If memory is stale or unknown, it must read the real source code and canonical docs before trusting the cache.
After implementation, verify records evidence that checks actually ran.
map update --files <files> refreshes file hashes.
map record saves only durable facts: contracts, flows, invariants, known traps, and key files.
The next agent spends fewer tokens because it can start from compact memory instead of rediscovering the same context.

Truth priority:

real source code > canonical docs > harness memory > chat history

The memory is a cache. It helps the agent move faster, but it never replaces reading the real code when the risk is high.

Good memory entry:

Auth session owns login state contracts and must be checked before authorization edits.

Bad memory entry:

Code updated.

The harness rejects vague memory because vague memory makes future agents worse.

Learning Loop

The learning loop is a governed notebook for hard-won lessons.

It is useful when the agent finds a recurring bug, fixes a fragile area, or discovers a verification rule that should not be rediscovered next time.

Flow:

capture -> validate -> promote -> query -> health/audit -> prune

capture: save a candidate lesson from evidence.
validate: prove the lesson has evidence, existing files, safe text, and required failure details.
promote: allow a specific lesson to appear in future queries.
query: return only the most relevant lessons for one surface, optionally ranked by touched files and failure signature.
health: cheap check that tells the agent when memory needs a compact audit.
audit: short read-only report of stale, duplicate or low-confidence lessons.
prune: retire expired or noisy lessons.

Lessons are intentionally small. The default query returns only top_k = 3, so the agent gets useful context without spending tokens on old history.

This is not model training. It is an evidence-backed memory notebook. Routine output stays compact: no embeddings, no extra reviewing agent, no automatic deletion, and no long learning report unless a human asks for audit detail.

For non-technical users, this should feel automatic: during L2/L3 work, session start may tell the agent learning_health=needs_audit; the agent then runs learn audit --compact and reports the result in plain language.

Truth priority:

source code > current tests/runtime > canonical docs > evidence > promoted lessons > old chat

The learning loop improves reuse, but it does not replace reading real code for risky work.

Copy-Paste Prompt For Your Agent

After installing the harness, give your AI coding agent this instruction:

Use the agent harness for approved plans, multi-step work, risky changes, and any task where you need to prove completion.
For L2/L3 tasks, run the harness automatically. The user should not need to remember to ask for it.
Read docs/agent-runtime.md first; do not load the full README for routine execution.
Before editing, validate the plan.
During execution, keep the harness artifact updated.
Prefer token-light commands: session start, next, verify, claim auto, finish.
For risky or unclear work, query codebase memory before editing and update it after changing durable structure.
Do not claim success unless the artifact is completed and includes evidence plus verified claims.
In the final answer, include run_id, artifact path, status, gates, evidence, verified claims, and rollback notes.

What Problem Does This Solve?

AI coding agents can write code quickly, but speed is not the same as reliable delivery.

Without a harness, an agent can:

edit before understanding the task
skip plan steps
say tests passed without running tests
invent files, commands, APIs, or validations
expand scope without noticing
keep going after dangerous ambiguity
declare success without proof

This harness reduces those failures by creating an execution contract.

The agent can still reason and write code, but the harness requires a structured artifact that records what actually happened.

That artifact becomes the difference between:

"I think it is fixed."

and:

"This run completed. Here is the plan, the changed files, the checks, the evidence, the verified claims, and the rollback path."

Explain It Like I Am New To This

Think of the harness as three things:

a checklist: what the agent must do
a flight recorder: what the agent actually did
a memory notebook: what the agent should remember next time

The flight recorder saves proof:

what task was executed
what files were involved
what command was run
whether the command passed or failed
what evidence supports the final answer

So when the agent says "done", you can ask:

Where is the artifact?
What evidence proves it?
Which claims were verified?

If the agent cannot answer, the work is not truly complete.

For Non-Technical Users

Do I Need To Understand The Commands?

Usually, no.

The intended experience is conversational:

User: Create a plan.
Agent: Here is the plan.
User: Execute the plan using the harness.
Agent: Runs the harness, edits code, records evidence, and reports the artifact.

You only need to know the high-level rule:

Do not trust "done" unless the agent gives evidence from the harness artifact.

What Should I Ask The Agent?

Use prompts like these:

Investigate this bug. Do not edit files yet.

Create a plan with files, risks, tests, and rollback.

Execute this approved plan using the harness.

Do not say it is done unless the harness artifact is completed.

Show me the run_id, artifact path, final status, evidence, tests, and verified claims.

How Do I Know It Worked?

A strong final answer should include:

run_id
artifact path
final status
evidence
tests or gates executed
verified claims
rollback notes when relevant

The safest completion signal is:

status: completed
phase: completed
verified claims: present
evidence: present

If those fields are missing, treat the work as partial.

Good Final Answer Example

run_id: fix-login-20260428
artifact: .agent-harness/runs/fix-login-20260428.json
status: completed
gates: pnpm test:run tests/login.test.ts
evidence: exit_code 0, affected login tests passed
verified claims: bug_reproduced_before_fix, bug_fixed_after_fix, acceptance_criteria_met
rollback: revert commit abc123 or restore files listed in the artifact

Weak Final Answer Example

Done. It should work now.

Do not trust this. It has no artifact, no evidence, and no verified claims.

Installation Options

You can use the harness without becoming an npm expert.

If you are new, use npx. It downloads and runs the latest package for you.

Recommended Install

Use this for most projects:

cd C:\Projetos\my-app
npx agent-execution-harness@latest init --adapter generic --cwd .
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .

What each command does:

cd C:\Projetos\my-app: opens your project folder
init --adapter generic --cwd .: previews the installation
init --adapter generic --cwd . --apply --agents-mode append: installs the harness and appends rules to AGENTS.md
doctor --harnessability --cwd .: checks if everything is configured and gives a readiness score

Expected doctor result:

Agent Execution Harness doctor passed.
Harnessability score: 90/100

AGENTS.md Options

AGENTS.md is the instruction file your coding agent reads.

Choose one mode:

# safest: keep your existing AGENTS.md unchanged
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode skip

# recommended: add harness rules to your existing AGENTS.md
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append

# advanced: replace AGENTS.md after creating a backup
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode overwrite

Use append if you are not sure.

Preview Only

Run this when you only want to see what would happen:

npx agent-execution-harness@latest init --adapter generic --cwd .

Preview mode does not apply the installation.

Stetix-Style Project

For projects that want the Stetix adapter:

npx agent-execution-harness@latest init --adapter stetix --cwd . --apply --agents-mode append

Install As A Dev Dependency

Use this when you want the harness pinned in package.json:

npm install --save-dev agent-execution-harness

Then commands are available as:

agent-harness doctor --harnessability --cwd .
agent-harness run
agent-harness report

Updating An Existing Installation

Use this when you already installed the harness and want the newest version.

If you used npx, run the same installer with @latest:

cd C:\Projetos\my-app
npx agent-execution-harness@latest init --adapter generic --cwd .
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .

If you installed it in package.json, update the package first:

cd C:\Projetos\my-app
npm install --save-dev agent-execution-harness@latest
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .

For projects using pnpm:

pnpm add -D agent-execution-harness@latest
npx agent-execution-harness@latest init --adapter generic --cwd . --apply --agents-mode append
npx agent-execution-harness@latest doctor --harnessability --cwd .

Simple explanation: updating means downloading the new harness package, running the installer again, and checking the project with doctor.

Use --agents-mode append unless you are sure you want to replace your existing AGENTS.md.

Safe update behavior:

existing harness history is preserved
existing run reports in .agent-harness/runs/ are preserved
existing map/learning memory in .agent-harness/ is preserved
existing project history such as docs/historico.md is preserved
existing agent-harness.config.json is not replaced automatically
existing runtime docs are not replaced automatically
existing package.json scripts are kept; missing harness scripts are added
.gitignore receives harness lines only once
AGENTS.md is appended only when you choose --agents-mode append
AGENTS.md is replaced only when you choose --agents-mode overwrite
every applied install creates a backup under .agent-harness/backups/

Think of update like installing a new tool version beside your project rules. It should improve the harness commands, not erase your project memory.

After Installing

Use natural language with your AI coding agent:

Create a plan for this bug.
Execute the plan using the harness.
Show the run_id, artifact path, status, evidence, verified claims, and rollback.

Good final signal:

status: completed
evidence: present
verified claims: present

Weak final signal:

Done. It should work.

Common Confusions

This section explains the common terms without assuming you are a developer.

What Is npm?

npm is the package registry where this tool is published.

GitHub stores the source code. npm distributes the installable package.

What Is npx?

npx runs a package from npm without requiring you to install it manually first.

This command:

npx agent-execution-harness@latest doctor --harnessability --cwd .

means:

Download the latest harness package, run its doctor command, and check this project.

Is The Harness Automatic?

Only when the project and agent are configured to use it.

The harness is not hidden magic inside every AI tool. It works when:

the project has harness files installed
the project has clear AGENTS.md rules
the agent reads and follows those rules
the agent can run local commands
the task matches a rule requiring the harness

For example:

For approved multi-step plans, use the agent harness.
Do not declare success without a completed artifact, evidence, and verified claims.

Can A Bad Agent Ignore It?

Yes, if the surrounding tool lets it ignore project instructions.

The harness makes correct behavior easier to enforce and audit, but it cannot physically control every possible model or coding tool unless that tool invokes it.

That is why the final answer must include artifact evidence.

Practical rule:

No artifact, no evidence, no trust.

Troubleshooting

`npx` asks whether to install the package

That is normal. Accept it.

`doctor` does not pass

Read the findings. Usually this means one of these is missing:

AGENTS.md
agent-harness.config.json
package scripts
ignored artifact folder

Fix the reported item and run doctor again.

The agent says it used the harness but gives no artifact

Treat the work as incomplete. Ask:

Show the run_id, artifact path, final status, evidence, and verified claims.

The agent refuses or forgets to use the harness

Add a stronger project instruction in AGENTS.md:

For approved plans, multi-step work, risky changes, and delegated execution, use agent-harness.
Do not declare success without a completed artifact, evidence, and verified claims.

I only want to try it without changing my project

Run the init command without --apply:

npx agent-execution-harness@latest init --adapter generic --cwd .

This is a preview. It should not apply the installation.

Why Not Just Prompts?

Prompts are useful, but prompts are memory and intention. They can be ignored, forgotten, or interpreted differently by different models.

Agent Execution Harness turns the most important parts of the workflow into explicit runtime artifacts:

the plan is structured
the current phase is recorded
allowed actions are constrained
evidence must match a gate
claims must be verified
final reports are derived from artifacts

The harness does not replace prompts. It gives prompts something harder to drift away from.

Practical difference:

Prompt only:
"Please be careful and run tests."

Harness-backed:
"Record the gate, record exit_code, attach output excerpt, verify the claim, and only then report completion."

That is why this project focuses on execution evidence, not just better wording.

What Installation Adds To A Project

The installer is designed to configure harness-related files such as:

AGENTS.md
agent-harness.config.json
package scripts
ignored artifact folders
plan/checklist templates
adapter-specific rules

If the project already has AGENTS.md, the installer is conservative:

default behavior: keep the existing file unchanged
--agents-mode append: add a marked harness block once
--agents-mode overwrite: replace the file, with backup created first

The harness artifact folder is usually:

.agent-harness/runs/

Those artifacts prove what the agent actually did.

Core Concepts For Developers

Plan

A plan is a JSON document with:

schema_version
plan_id
risk_level
rollback_expectation
gates
tasks

Each task must include:

task_id
acceptance_criteria

Example:

{
  "schema_version": "agent_harness_plan_v1",
  "plan_id": "basic-plan",
  "risk_level": "L2",
  "rollback_expectation": "Delete generated test files.",
  "gates": ["node --version"],
  "tasks": [
    {
      "task_id": "basic-task",
      "depends_on": [],
      "acceptance_criteria": "node --version evidence passes."
    }
  ]
}

Action

An action is one state transition requested by the agent.

Supported actions:

read_context
declare_files
edit_file_ready
run_gate
record_evidence
verify_claims
final_report
halt_for_risk

Artifact

The artifact is the source of truth for execution.

Default path:

.agent-harness/runs/<run_id>.json

If the artifact is not completed, the work is not complete.

Evidence

Evidence records proof that a gate or check happened.

Each evidence item records:

evidence_id
optional evidence_type
optional evidence_types
check
result
exit_code
output_excerpt
scope_covered
optional residual_gap
optional output_ref and sha256 for long logs stored outside chat

Evidence types let the harness compare proof against the plan:

{
  "evidence_id": "ui-verification",
  "evidence_types": ["focused_tests", "scoped_lint", "scoped_typecheck", "visual_assertion"],
  "check": "pnpm agent:verify:ui",
  "result": "pass",
  "exit_code": 0,
  "output_excerpt": "Focused tests, lint, typecheck and visual assertion passed.",
  "scope_covered": "sidebar UI verification"
}

Evidence Policy

Plans may declare required evidence per task:

{
  "task_id": "verify-sidebar-layout",
  "surface": "ui_layout",
  "files": ["src/components/AppLayout.tsx"],
  "required_evidence": [
    "focused_tests",
    "scoped_lint",
    "scoped_typecheck",
    "browser_smoke|visual_assertion"
  ],
  "acceptance_criteria": "Sidebar layout has no overlap."
}

If required_evidence is not provided, the harness infers requirements from the task surface and files. UI/layout work requires focused tests, scoped lint, scoped typecheck and browser smoke or visual assertion before a run can be completed.

If required proof is missing, the run becomes partial_validated instead of completed, and the report shows the evidence score plus missing requirements.

Claims

A claim is something the agent wants to assert as true.

Supported claim kinds include:

file_exists
command_ran
gate_passed
gate_failed
dangerous_command_blocked
task_reconciled
bug_reproduced_before_fix
bug_fixed_after_fix
acceptance_criteria_met
contract_preserved
rollback_defined
no_product_code_changed

Final reports should be derived from verified claims, not from agent confidence.

State Machine

The run follows this lifecycle:

init -> preflight -> task_start -> gate -> evidence -> report -> completed

A run may also enter:

halt

Meaning:

init: run created
preflight: context read, files must be declared
task_start: task execution can begin
gate: validation command must be declared
evidence: validation result must be recorded
report: claims must be verified and final report produced
completed: run is complete
halt: execution stopped for safety or invalid state

CLI Reference

When installed from npm, use:

agent-harness <command>

When developing from this repository, use:

node bin/agent-harness.mjs <command>

`init`

Prepares a target project using templates.

agent-harness init --adapter generic --cwd .

By default, init is a dry run. Applying changes must be explicit:

agent-harness init --adapter generic --cwd . --apply

Existing AGENTS.md handling:

agent-harness init --adapter generic --cwd . --apply --agents-mode skip
agent-harness init --adapter generic --cwd . --apply --agents-mode append
agent-harness init --adapter generic --cwd . --apply --agents-mode overwrite

Without --agents-mode, interactive terminals ask what to do. Non-interactive runs use skip to avoid overwriting user instructions.

`doctor`

Checks whether a project is configured correctly.

agent-harness doctor --harnessability --cwd .

`plan-lint`

Validates a plan before execution.

agent-harness plan-lint --plan plan.json

`plan waves`

Shows dependency waves from optional task depends_on fields.

agent-harness plan waves --plan plan.json

`dispatch`

Guides serial fallback or safe subagent batches without spawning workers directly.

agent-harness dispatch plan --plan plan.json
agent-harness dispatch next --batch --runtime subagents

Validate returned worker JSON with agent-harness handoff validate --plan plan.json --task-id task-id --input worker-output.json. Dispatch does not have a separate validate command.

`execute`

Initializes or resumes a run.

agent-harness execute --plan plan.json --run-id fix-login

`session start`

Starts an active low-token session so later commands do not need to repeat --plan, --run-id and --mode.

agent-harness session start --plan plan.json --run-id fix-login

`next`

Returns only the next actionable step from the current artifact.

agent-harness next

`verify`

Runs a policy-checked command, stores long output as a referenced log, records sha256, and creates evidence automatically.

agent-harness verify --task-id fix-login --type focused_tests --cmd "pnpm test"

Strict structured command:

agent-harness verify --task-id fix-login --type focused_tests --exec pnpm --args-json "[\"test\"]"

`map`

Maintains compact codebase memory for selective reuse.

agent-harness map init
agent-harness map status
agent-harness map query --surface auth
agent-harness map update --files src/auth/session.ts
agent-harness map record --surface auth --files src/auth/session.ts --summary "Auth session owns login state contracts and must be checked before authorization edits."

Use map query before risky or unclear work. Use map update after changed files. Use map record only for durable facts: contracts, flows, invariants, known traps, and key files.

`learn`

Maintains governed lessons from real evidence.

agent-harness learn capture --surface auth --kind failure_pattern --summary "Auth fixes must verify authorization guards after session edits." --files src/auth/session.ts --evidence-ref .agent-harness/runs/fix.full.json
agent-harness learn validate --lesson-id auth-failure-pattern-20260502
agent-harness learn review --surface auth
agent-harness learn promote --lesson-id auth-failure-pattern-20260502
agent-harness learn query --surface auth --top-k 3 --compact --files src/auth/session.ts --failure-signature "guard failed"
agent-harness learn prune

Use learn query for repeated failures or known-risk surfaces. Use learn capture only after evidence exists, learn validate before promotion, and compact queries for weak agents. Do not store secrets or generic notes.

`run`

Applies one low-level action to the run state.

agent-harness run \
  --plan plan.json \
  --run-id fix-login \
  --action '{"schema_version":"agent_harness_action_v1","type":"read_context","summary":"Read plan and repo context."}'

This is the transactional command used by autonomous agents.

`report`

Generates a final report from a completed artifact.

agent-harness report --run-id fix-login

`benchmark`

Runs an offline benchmark over captured scenarios.

agent-harness benchmark --mode smoke

The benchmark does not call model APIs.

Configuration

Projects can define:

agent-harness.config.json

Example:

{
  "schema_version": "agent_harness_config_v1",
  "artifact_dir": ".agent-harness/runs",
  "product_paths": ["src/", "supabase/"],
  "required_scripts": ["agent:harness", "agent:plan:lint"],
  "doctor_profile": "generic",
  "command_policy": {
    "allow": [],
    "deny": ["DROP", "TRUNCATE", "git reset --hard", "push --force"]
  },
  "token_budget": {
    "observation_format": "ultra_compact",
    "summary_max_chars": 240,
    "output_excerpt_max_chars": 600,
    "report_compact_max_chars": 1600
  },
  "codebase_memory": {
    "enabled": true,
    "memory_dir": ".agent-harness/memory",
    "default_strategy": "query",
    "stale_after_days": 14,
    "max_summary_chars": 1200,
    "surface_budgets": {
      "auth": 1800,
      "db": 1800,
      "api": 1400,
      "ai": 1400,
      "ui": 900,
      "ui_layout": 900,
      "docs": 500,
      "generic": 700
    },
    "high_risk_surfaces": ["auth", "db", "api", "ai"]
  },
  "learning_memory": {
    "enabled": true,
    "memory_dir": ".agent-harness/learning",
    "top_k": 3,
    "ttl_days": 60,
    "max_summary_chars": 500,
    "max_lessons_per_surface": 20
  },
  "architecture_rules": [
    {
      "id": "no_client_secret_import",
      "from": "src/**/*.ts",
      "forbid_import": "**/service-role**",
      "reason": "client code must not import server-only secrets"
    }
  ]
}

Important fields:

artifact_dir: where run artifacts are stored
product_paths: paths treated as product code
required_scripts: scripts expected by doctor
doctor_profile: validation profile for the project
command_policy: allow/deny rules for commands
token_budget: controls compact output and log excerpt limits
codebase_memory: controls selective repository mapping and memory freshness
learning_memory: controls evidence-backed lessons from fixes and failures
architecture_rules: optional compact path/import boundary checks for doctor --architecture

Deny rules take priority over allow rules.

Safety Model

The harness blocks or halts when it sees unsafe behavior, including:

destructive database commands
destructive Git commands
force push
recursive forced deletion
evidence that does not match the pending gate
final report before verified claims
completion before all tasks are reconciled
shell commands in strict mode when structured --exec is required
validation commands that are not listed in the task allowed_commands in strict mode

This does not replace human judgment. It creates mechanical pressure against unsafe automation.

What This Harness Does Not Promise

This harness does not guarantee that:

the model will understand the product perfectly
every bug will be found
every generated test is meaningful
every architectural decision is correct
no human review is needed

It does provide stronger operating discipline:

work is decomposed into tasks
state is recorded
evidence is required
claims are checked
dangerous operations are blocked or halted
final reports are derived from artifacts

Local Development

Use this section only if you want to edit or contribute to the harness itself.

Install dependencies:

pnpm install

Run typecheck:

pnpm typecheck

Run tests:

pnpm test

Build:

pnpm build

Run integration tests:

pnpm test:integration

Run smoke benchmark:

pnpm benchmark:smoke

Run release readiness audit:

pnpm audit:release-readiness

Project Status

Current version:

0.14.4

Package:

agent-execution-harness

Treat this as an early public foundation for structured AI-assisted development.

It is useful today if you want agents to work with plans, evidence, safety stops, compact memory, and audit-friendly reports.

Weak Model Mode

Use --mode weak when the coding agent has low reasoning power, small context, or keeps drifting from the plan. The harness then behaves like guardrails on a narrow road: one compact next action, fewer files per task, typed evidence, shorter summaries, and repair hints when a gate fails.

Practical flow:

agent-harness plan-lint --plan plan.json
agent-harness session start --plan plan.json --run-id my-fix --mode weak
agent-harness next
agent-harness verify --task-id task-1 --type focused_tests --cmd "pnpm test"
agent-harness claim auto
agent-harness finish --summary "validated"

weak mode is not for every request. Use normal mode for simple, trusted agents; use weak mode for risky work, junior agents, local LLMs, or repeated failures.

Strict Mode

Use --mode strict when the agent is weak, the work is sensitive, or you want the strongest local enforcement currently available.

Strict mode adds three important rules:

validation commands must be declared in the plan task allowed_commands
shell-style --cmd is blocked by default
use --exec plus --args-json so the harness runs a structured command without shell parsing

Example:

agent-harness session start --plan plan.json --run-id strict-fix --mode strict
agent-harness verify --task-id task-1 --type focused_tests --exec pnpm --args-json "[\"test:run\",\"tests/login.test.ts\"]"

Strict mode is safer, but less flexible. If a command is not declared in the plan, the harness stops instead of guessing.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Agent Execution Harness

What's New In v0.14.4

What's New In v0.14.3

What's New In v0.14.2

What's New In v0.14.1

What's New In v0.14.0

What's New In v0.13.2

What's New In v0.13.1

What's New In v0.13.0

What's New In v0.12.4

What's New In v0.12.3

What's New In v0.12.2

What's New In v0.12.1

What's New In v0.12.0

What's New In v0.11.1

What's New In v0.11.0

Works With Non-Frontier Agents Too

Why Use This?

The Full Harness Flow

The Three Memory Layers

What You Get

Table Of Contents

Useful Links

Quick Start

Text Backlog To Plan

Which Mode Should I Use?

Control Catalog

Harnessability

Coverage And Architecture

Repeated Failure Steering

Approved Fixtures

The Simple Path

What Files Get Added?

Dispatch Guidance

Weak Worker Handoff

Codebase Memory Diagram

Learning Loop

Copy-Paste Prompt For Your Agent

What Problem Does This Solve?

Explain It Like I Am New To This

For Non-Technical Users

Do I Need To Understand The Commands?

What Should I Ask The Agent?

How Do I Know It Worked?

Good Final Answer Example

Weak Final Answer Example

Installation Options

Recommended Install

AGENTS.md Options

Preview Only

Stetix-Style Project

Install As A Dev Dependency

Updating An Existing Installation

After Installing

Common Confusions

What Is npm?

What Is npx?

Is The Harness Automatic?

Can A Bad Agent Ignore It?

Troubleshooting

npx asks whether to install the package

doctor does not pass

The agent says it used the harness but gives no artifact

The agent refuses or forgets to use the harness

I only want to try it without changing my project

Why Not Just Prompts?

What Installation Adds To A Project

Core Concepts For Developers

Plan

Action

Artifact

Evidence

Evidence Policy

Claims

`npx` asks whether to install the package

`doctor` does not pass

`init`

`doctor`

`plan-lint`

`plan waves`

`dispatch`

`execute`

`session start`

`next`

`verify`

`map`

`learn`

`run`

`report`

`benchmark`