agent-harness-kit
v0.15.0
Published
Solo-dev harness engineering kit for Claude Code. Layered architecture, structural tests, garbage-collection ritual, review subagents — without the enterprise overhead.
Downloads
2,734
Maintainers
Readme
agent-harness-kit
The infrastructure layer that makes AI agents production-ready.
Solo-dev harness engineering kit for Claude Code. One command, ~30 minutes, and your hobby project gets the patterns that took OpenAI from prototype to 1M lines of agent-generated code: layered architecture, structural tests, garbage collection, review subagents, JSON feature tracking, and pre-completion checklists — without the enterprise overhead.
The Harness Engineering Shift
February 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world" documenting how their Frontier Product Exploration team built an internal product with ~1 million lines of code over 5 months — with zero lines manually written by humans.
The results:
- 3 engineers → 7 engineers
- ~1,500 PRs merged (3.5 PRs per engineer per day)
- Each engineer operating at 3-10x capacity through agent delegation
- Agents running autonomously for 6+ hours per task
- ~1 billion tokens processed per day
The insight: The work shifted from writing code to engineering the harness — the infrastructure, constraints, and feedback loops that make agents reliable at scale.
March 2026: LangChain demonstrated this principle empirically. By improving their agent harness alone (no model changes), they jumped from 52.8% → 66.5% on Terminal-Bench 2.0, climbing 25 spots on the leaderboard.
The pattern is clear: Harness quality matters more than model choice for production outcomes.
Why This Kit Exists
You're a solo developer or small team. You don't have OpenAI's infrastructure budget or Stripe's agent platform team. But you can adopt the same patterns at hobby-project scale:
What you get:
- Proven patterns from production harnesses — OpenAI's two-fold initializer/coding-agent split, Anthropic's CLAUDE.md table-of-contents approach, Mitchell Hashimoto's "engineer the harness" discipline
- 30 skills that codify rituals from teams shipping agent-generated code at scale (
/add-feature,/garbage-collection,/harness-improvement-loop,/review-this-pr, etc.) - 9 read-only review subagents for cheap second-opinion passes (architecture, security, reliability, performance, API consistency, trace failure, eval rubric, adapter compatibility, release readiness)
- Structural enforcement via TypeScript, Python, Go, Rust, Swift, and Kotlin adapters — catch layer violations before they compound
- Cost guardrails and attribution — default budget plus provider-call cost by skill, task, and cache read/write bucket
- JSON feature tracking (not Markdown) — Anthropic's pattern for machine-readable planning
- Pre-completion checklists — OpenAI's golden-principles garbage collection ritual, scaled to top-3 fixes per week
What this kit does NOT claim:
- Structural tests don't differentiate on happy-path 1-shot tasks. When seed code shows the pattern, Claude follows it — we measured 0/6 layer violations across bare and kit arms on our
ts-layeredfixture (5 consecutive null benches, May 2026). - The value is in long sessions, adversarial pressure, greenfield code, and weaker models — where pattern context drifts and shortcuts become tempting. Use the lint as a safety net, not as the reason you adopted the kit.
Installation
Option A: One-line install (recommended)
curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bashIf the interactive prompt exits with aborted by user at Project name in a piped shell, rerun with defaults:
curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --yesOr run the initializer directly so the prompt owns the terminal input:
npx agent-harness-kit initUpgrade existing installation:
curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --upgradeOption B: Scaffold into existing repo
npx agent-harness-kit initOption C: Install as Claude Code plugin
/plugin marketplace add tuanle96/agent-harness-kit
/plugin install agent-harness-kit@agent-harness-kit-marketplaceWhat Ships
Skills (30)
Slash commands that codify production harness rituals:
| Command | Purpose |
| -------------------------------- | -------------------------------------------------------- |
| /add-feature <description> | Implement one item from .harness/feature_list.json |
| /add-adr | Add a numbered Architecture Decision Record |
| /benchmark-suite | Run Mini SWE-bench style harness regression tasks |
| /context-health | Inspect context usage, token budget, and compaction risk |
| /create-story | Create an acceptance-tested Story Packet |
| /debug-flow | Run the failing flow before fixing it |
| /deliver-html | Ship an analysis/audit/plan as a self-contained HTML |
| /doc-drift-scan | Find stale path/command references in docs/ |
| /eval-rubric-author | Add deterministic checks plus evidence-backed rubrics |
| /eval-runner | Regression-test the harness itself |
| /feature-intake | Classify new work before implementation |
| /garbage-collection | Friday cleanup (top-3 fixes only at solo scale) |
| /harness-improvement-loop | Turn trace-backed failures into measured harness changes |
| /i18n-add-locale <code> | Scaffold a new translation locale for skills + CLAUDE.md |
| /inspect-app | Boot dev server + drive the failing flow before edits |
| /inspect-module <path> | Map a module before editing |
| /map-domain | Render layer config + flag config-vs-filesystem drift |
| /middleware-pipeline | Use retry/cache/timeout/telemetry/budget middleware |
| /model-profile | Compare model profiles by pass rate, cost, and latency |
| /orchestrate | Select or run a multi-agent workflow pattern |
| /propose-harness-improvement | Convert an agent failure into a permanent prevention |
| /refactor-feature | Restructure .harness/feature_list.json with proof gate |
| /regression-benchmark | Run Tier 2 isolated and multi-session regression benchmarks |
| /review-this-pr | Deterministic diff review against the current base |
| /setup-nightly-eval | Enable the nightly eval GitHub Actions workflow |
| /skill-discovery | Index skills and load full instructions on demand |
| /structural-test-author | Codify a new architectural rule mechanically |
| /trace-analyzer | Classify eval/session failures from trace evidence |
| /verify-ui | Run browser validation with screenshots and network logs |
| /write-skill | Create a new SKILL.md with valid frontmatter |
Review Subagents (9)
Read-only personas for second-opinion passes:
architecture-reviewer— layering, coupling, cohesionadapter-compatibility-reviewer— adapter claims, render paths, testsapi-consistency-reviewer— naming, versioning, breaking changeseval-rubric-reviewer— deterministic checks and evidence-backed rubricssecurity-reviewer— OWASP Top 10, auth, secretsreliability-reviewer— error handling, retries, observabilityperformance-reviewer— N+1 queries, caching, indexingrelease-harness-reviewer— package, installer, npm, and release truthtrace-failure-analyst— eval, regression, hook, and session failure triage
Hooks (9 event groups)
- SessionStart: Inject compact project context on startup/resume/compact.
- UserPromptSubmit: Block prompt patterns that bypass harness safety.
- PreToolUse: Guard risky Bash/edit operations and enforce per-skill permission policy before tools run.
- Notification: Notify on blocking states.
- PostToolUse: Run structural checks after edits and record skill telemetry.
- PreCompact: Snapshot state before context compaction.
- Stop: Pre-completion checklist with
stop_hook_activeloop guard. - SubagentStop: Re-check structural state after subagent work.
- SessionEnd: Roll up session telemetry.
Adapters (6)
- TypeScript adapter: ts-morph + eslint-plugin-boundaries + dependency-cruiser
- Python adapter: libcst + import-linter
- Go adapter: go-parser structural checks + shared eval runner
- Rust adapter: rust-lexer structural checks + shared eval runner
- Swift adapter: swift-lexer structural checks + shared eval runner
- Kotlin adapter: kotlin-lexer structural checks + shared eval runner
Ownership policy
User-owned files are never clobbered on init or upgrade: CLAUDE.md, AGENTS.md, .harness/docs/architecture.md, .harness/docs/core-beliefs.md, .harness/docs/golden-principles.md, .harness/docs/tech-debt-tracker.md, .harness/feature_list.json, .harness/config.json.
Projects can extend that protected set in .harness/config.json:
{
"ownership": {
"userOwnedFiles": [".harness/docs/local-runbook.md"],
"generatedMutableFiles": [".harness/custom-state.json"]
}
}Eval Harness
Four dimensions: outcome / process / style / efficiency
Directory Structure
your-repo/
├── CLAUDE.md # 50–80 line table of contents
├── AGENTS.md # symlink → CLAUDE.md
├── .claude/
│ ├── settings.json
│ ├── skills/ # 30 skills with SKILL.md + skill.json contracts
│ ├── agents/ # 9 reviewer personas
│ └── hooks/hooks.json
├── .harness/
│ ├── config.json
│ ├── permissions.json # per-skill tool allow/deny matrix
│ ├── skill-registry.json # version/capability registry
│ ├── feature_list.json # JSON, not Markdown — Anthropic pattern
│ ├── docs/
│ │ ├── architecture.md
│ │ ├── core-beliefs.md
│ │ ├── golden-principles.md
│ │ ├── telemetry-schema.md
│ │ ├── tech-debt-tracker.md
│ │ └── adr/
│ │ └── 0001-use-agent-harness-kit.md
│ ├── installed.json # kit lockfile (sha-tracked)
│ ├── PROGRESS.md # session log
│ ├── scripts/
│ │ ├── structural-test-on-edit.sh # PostToolUse hook target
│ │ ├── precompletion-checklist.sh # Stop hook target
│ │ ├── pretooluse-skill-permission-guard.mjs
│ │ ├── check-skill-contracts.mjs
│ │ ├── orchestration-schema-check.mjs
│ │ ├── session-replay.mjs
│ │ ├── cost-tracker.mjs
│ │ ├── dev-up.sh
│ │ ├── pre-push.sh
│ │ └── install-git-hooks.sh
│ └── structural-baseline.json # existing-violation baselineConfiguration (.harness/config.json)
{
"version": "0.1.0",
"language": "typescript",
"framework": "nextjs",
"preset": "nextjs",
"domains": [
{
"name": "default",
"root": "src",
"layers": ["types", "config", "repo", "service", "runtime", "ui"]
}
],
"providers": ["auth", "telemetry", "feature-flags"],
"ownership": {
"userOwnedFiles": [],
"generatedMutableFiles": []
},
"models": {
"main": "claude-sonnet-4-6",
"reviewers": "claude-sonnet-4-6",
"explore": "claude-haiku-4-5"
},
"budgets": { "perRunUsd": 2.0, "perDayUsd": 10.0 }
}Philosophy (5 Axioms)
1. CLAUDE.md is a table of contents, not an encyclopedia
HumanLayer measured ~150–200 instructions as the reliable cap; OpenAI's own root file is ~100 lines. This kit's CLAUDE.md is 50–80 lines.
2. Every agent failure becomes a permanent harness change
Mitchell Hashimoto's "engineer the harness" discipline. The /propose-harness-improvement skill enforces this.
3. Computational sensors as safety net
Fowler/Böckeler's architectural fitness functions. The TypeScript, Python, Go, Rust, Swift, and Kotlin adapters ship deterministic structural checks; LLM subagents are reserved for semantic judgment.
Note: In our 1-shot bench (n=3, ts-layered), the agent already followed visible seed patterns and produced 0 boundary violations without enforcement. Treat structural tests as a safety net for drift in long sessions, not as a happy-path differentiator.
4. Garbage collection over Friday cleanup, scaled to solo
OpenAI's golden-principles ritual, shrunk to top-3 fixes per week.
5. HTML for human deliverables, Markdown for agent files
- Markdown is the right format for files an agent reads-and-edits (CLAUDE.md, SKILL.md, ADRs)
- HTML is the right format for documents a HUMAN reads-and-decides (audit reports, analyses, plans, decision docs)
A long Markdown deliverable invites the human to scroll, miss the conclusion, and ask the agent to clarify — burning more tokens than the HTML markup costs. The /deliver-html skill writes self-contained HTML at repo root with a shared dark-theme CSS; the rule is documented in golden principle #11 and ADR-0002.
CLI Commands
agent-harness-kit init # scaffold a repo (interactive)
agent-harness-kit init --yes # accept all detected defaults
agent-harness-kit upgrade # non-destructive upgrade, preserves user edits
agent-harness-kit doctor # diagnose installed kit + Claude Code env
agent-harness-kit --versionToken / Cost Expectations
A typical day with the default model split (Sonnet 4.6 main + Haiku 4.5 explore + Sonnet 4.6 reviewers) stays under ~$2 of API traffic for a single developer.
The eval-runner skill enforces a per-run budget set in .harness/config.json.
OpenAI's harness processed ~1 billion tokens per day with 7 engineers. At solo scale, you're looking at ~10-50M tokens/day depending on session intensity.
Support Matrix
| Stack | Adapter | Preset | Dev command | Status |
| ------------------------------ | ------------ | ----------- | -------------------------------------- | ------ |
| Next.js 14 + TypeScript | typescript | nextjs | npm run dev | v0.1 |
| Express | typescript | node-api | node ./src/server.js | v0.1 |
| Fastify | typescript | node-api | node ./src/server.js | v0.1 |
| NestJS | typescript | node-api | npm run start:dev | v0.1 |
| FastAPI | python | fastapi | uvicorn app.main:app --reload | v0.1 |
| Django | python | django | python manage.py runserver | v0.1 |
| Flask | python | flask | flask --app app run --debug | v0.1 |
| Go | go | none | go run ./cmd/... | v0.4 |
| Rust | rust | none | cargo run | v0.4 |
| Swift | swift | none | swift run | v0.7 |
| Kotlin | kotlin | none | ./gradlew run | v0.7 |
Dependency Footprint
Runtime dependencies are intentionally split by surface:
| Dependency | Why it is present | Impact if missing |
| ---------- | ----------------- | ----------------- |
| commander | CLI command routing (init, upgrade, doctor) | CLI cannot start |
| @inquirer/prompts | Interactive init/upgrade prompts | Interactive mode fails; --yes paths still avoid most prompts |
| @clack/prompts | Polyglot setup selector with cancel handling | Polyglot-root setup falls back poorly |
| react + ink | Rich polyglot onboarding renderer only, not the hot scaffold path | Smart setup loses the app map UI; core render/upgrade logic still does not depend on React state |
| handlebars | Template rendering | init/upgrade cannot render scaffold files |
| picocolors | CLI diagnostics | Output loses structured color but behavior is otherwise unchanged |
Optional peer dependencies are adapter tooling, not core runtime:
| Peer dependency | Used by | When missing |
| --------------- | ------- | ------------ |
| ts-morph | TypeScript structural runner | npm run harness:check fails with an explicit install message |
| eslint-plugin-boundaries | TypeScript ESLint defense-in-depth config | ESLint boundary config cannot run, but the ts-morph runner remains the primary gate |
| dependency-cruiser | Optional TypeScript dependency graph checks | Dependency-cruiser reports are unavailable; structural runner still enforces layer direction |
The TypeScript init path patches these peer tools into the target repo's devDependencies non-destructively.
CI: Real-Claude E2E Test (v0.7+)
The kit ships a CI job that spawns the real claude binary against a fresh init of itself and asserts that the SessionStart hook actually fires (with the expected additionalContext payload).
This catches the class of bug that v0.6's silent-no-op hooks fell into — every synthetic test passed for seven releases while not a single hook ever triggered inside a real Claude Code session.
The release gate also runs a real /orchestrate --run E2E against a freshly
initialized kit. That path verifies fanout/fanin runtime output, schema
validation, transcript capture, telemetry export, session replay, cost
attribution, and cache read/write bucket closure.
Behavior:
- Locally:
npm testruns the real-Claude E2E case. The machine must have theclaudebinary installed and authenticated through either local Claude Code auth orANTHROPIC_API_KEY. - CI: the normal test job runs non-Claude tests; the required
e2e-claudejob installs@anthropic-ai/claude-codeglobally and exercises one claude turn (~$0.01–0.05). Missing auth is a failed environment, not a skipped test.
For GitHub Actions, configure the ANTHROPIC_API_KEY repository secret before enabling the E2E job.
Local run (uses whatever auth the claude binary already has):
node scripts/e2e-claude-cli.mjs
node scripts/e2e-orchestrate-claude.mjsHonest Expectations
What this kit DOES differentiate from bare claude-cli (anecdotal + design-level):
- ✅ Opinionated CLAUDE.md template (50–80 lines) so context isn't blown on style
- ✅ 30 skills that codify Hashimoto/OpenAI rituals
- ✅ 9 read-only review subagents for cheap second-opinion passes
- ✅
.harness/feature_list.json+ ADR template + GC ritual for solo-scale planning hygiene - ✅ Solo-dev cost defaults (~$2/day) and per-run budget enforcement
What it does NOT measurably differentiate (5 consecutive null benches, May 2026):
- ❌ Structural enforcement on happy-path 1-shot tasks. When seed code shows the layer pattern, claude-cli follows it — the boundaries lint has nothing to catch. We measured 0/6 ui→repo violations across bare and kit arms on the
ts-layeredfixture.
Where the structural test MIGHT still earn its keep (untested, listed for honesty, not as a claim):
- Long multi-turn sessions where pattern context drifts
- Adversarial "make it fast" pressure that tempts shortcuts
- Greenfield code with no existing pattern to follow
- Weaker model substrates (haiku, gpt-4o-mini)
Use the lint as a safety net, not as the reason you adopted the kit.
The Harness Engineering Trend (2025-2026)
Timeline:
- August 2025: OpenAI's Frontier Product Exploration team starts the 1M-LOC experiment
- February 2026: OpenAI publishes "Harness engineering: leveraging Codex in an agent-first world"
- February 2026: Mitchell Hashimoto publishes "My AI Adoption Journey" coining "engineer the harness" as Step 5
- March 2026: LangChain demonstrates +13.7pp Terminal-Bench improvement via harness changes alone
- Q1 2026: Anthropic, Stripe, and other teams publish details about their agent harnesses
Key Insight:
"The work moved from writing code to building infrastructure that makes agents reliable at scale." — OpenAI
Industry Adoption:
Within 90 days of Hashimoto's post, "harness engineering" became the standard term for the infrastructure layer around AI agents. Teams at Anthropic, OpenAI, LangChain, Stripe, and others published their patterns.
Why It Matters for Solo Devs:
You don't need a 7-person team or a billion-token-per-day budget to benefit. The patterns scale down:
- CLAUDE.md as table of contents (not encyclopedia) — Anthropic pattern
- JSON feature tracking — machine-readable planning
- Garbage collection ritual — top-3 fixes per week instead of enterprise-scale cleanup
- Review subagents — cheap second opinions without human bottlenecks
- Structural tests — safety net for long sessions
This kit is those patterns, packaged for hobby-project scale.
References
- OpenAI: Harness engineering: leveraging Codex in an agent-first world
- Mitchell Hashimoto: My AI Adoption Journey
- LangChain: Improving Deep Agents with harness engineering
- Anthropic: Harness design for long-running application development
- HumanLayer: CLAUDE.md best practices
- Martin Fowler: Architectural Fitness Functions
License
MIT
Contributing
Issues and PRs welcome at github.com/tuanle96/agent-harness-kit
Found a bug? Open an issue.
Have a pattern from your own harness? Submit a PR with the skill or hook.
Want to add a language adapter? Check docs/adding-an-adapter.md and the existing TypeScript/Python/Go/Rust/Swift/Kotlin adapters.
