agent-harness-kit

v0.15.0

Published

5 minutes ago

Solo-dev harness engineering kit for Claude Code. Layered architecture, structural tests, garbage-collection ritual, review subagents — without the enterprise overhead.

Downloads

2,734

0High
0Medium
0Low

justin96

claude-code harness-engineering layered-architecture structural-tests solo-dev code-review agents skills

agent-harness-kit

The infrastructure layer that makes AI agents production-ready.

Solo-dev harness engineering kit for Claude Code. One command, ~30 minutes, and your hobby project gets the patterns that took OpenAI from prototype to 1M lines of agent-generated code: layered architecture, structural tests, garbage collection, review subagents, JSON feature tracking, and pre-completion checklists — without the enterprise overhead.

The Harness Engineering Shift

February 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world" documenting how their Frontier Product Exploration team built an internal product with ~1 million lines of code over 5 months — with zero lines manually written by humans.

The results:

3 engineers → 7 engineers
~1,500 PRs merged (3.5 PRs per engineer per day)
Each engineer operating at 3-10x capacity through agent delegation
Agents running autonomously for 6+ hours per task
~1 billion tokens processed per day

The insight: The work shifted from writing code to engineering the harness — the infrastructure, constraints, and feedback loops that make agents reliable at scale.

March 2026: LangChain demonstrated this principle empirically. By improving their agent harness alone (no model changes), they jumped from 52.8% → 66.5% on Terminal-Bench 2.0, climbing 25 spots on the leaderboard.

The pattern is clear: Harness quality matters more than model choice for production outcomes.

Why This Kit Exists

You're a solo developer or small team. You don't have OpenAI's infrastructure budget or Stripe's agent platform team. But you can adopt the same patterns at hobby-project scale:

What you get:

Proven patterns from production harnesses — OpenAI's two-fold initializer/coding-agent split, Anthropic's CLAUDE.md table-of-contents approach, Mitchell Hashimoto's "engineer the harness" discipline
30 skills that codify rituals from teams shipping agent-generated code at scale (/add-feature, /garbage-collection, /harness-improvement-loop, /review-this-pr, etc.)
9 read-only review subagents for cheap second-opinion passes (architecture, security, reliability, performance, API consistency, trace failure, eval rubric, adapter compatibility, release readiness)
Structural enforcement via TypeScript, Python, Go, Rust, Swift, and Kotlin adapters — catch layer violations before they compound
Cost guardrails and attribution — default budget plus provider-call cost by skill, task, and cache read/write bucket
JSON feature tracking (not Markdown) — Anthropic's pattern for machine-readable planning
Pre-completion checklists — OpenAI's golden-principles garbage collection ritual, scaled to top-3 fixes per week

What this kit does NOT claim:

Structural tests don't differentiate on happy-path 1-shot tasks. When seed code shows the pattern, Claude follows it — we measured 0/6 layer violations across bare and kit arms on our ts-layered fixture (5 consecutive null benches, May 2026).
The value is in long sessions, adversarial pressure, greenfield code, and weaker models — where pattern context drifts and shortcuts become tempting. Use the lint as a safety net, not as the reason you adopted the kit.

Installation

Option A: One-line install (recommended)

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash

If the interactive prompt exits with aborted by user at Project name in a piped shell, rerun with defaults:

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --yes

Or run the initializer directly so the prompt owns the terminal input:

npx agent-harness-kit init

Upgrade existing installation:

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --upgrade

Option B: Scaffold into existing repo

npx agent-harness-kit init

Option C: Install as Claude Code plugin

/plugin marketplace add tuanle96/agent-harness-kit
/plugin install agent-harness-kit@agent-harness-kit-marketplace

What Ships

Skills (30)

Slash commands that codify production harness rituals:

| Command | -------------------------------- | /add-feature | /add-adr | /benchmark-suite | /context-health | /create-story | /debug-flow | /deliver-html | /doc-drift-scan | /eval-rubric-author | /eval-runner | /feature-intake | /garbage-collection | /harness-improvement-loop | /i18n-add-locale | /inspect-app | /inspect-module | /map-domain | /middleware-pipeline | /model-profile | /orchestrate | /propose-harness-improvement | /refactor-feature | /regression-benchmark | /review-this-pr | /setup-nightly-eval | /skill-discovery | /structural-test-author | /trace-analyzer | /verify-ui | /write-skill | Purpose | | -------------------------------------------------------- | <description> | Implement one item from .harness/feature_list.json | | Add a numbered Architecture Decision Record | | Run Mini SWE-bench style harness regression tasks | | Inspect context usage, token budget, and compaction risk | | Create an acceptance-tested Story Packet | | Run the failing flow before fixing it | | Ship an analysis/audit/plan as a self-contained HTML | | Find stale path/command references in docs/ | | Add deterministic checks plus evidence-backed rubrics | | Regression-test the harness itself | | Classify new work before implementation | | Friday cleanup (top-3 fixes only at solo scale) | | Turn trace-backed failures into measured harness changes | <code> | Scaffold a new translation locale for skills + CLAUDE.md | | Boot dev server + drive the failing flow before edits | <path> | Map a module before editing | | Render layer config + flag config-vs-filesystem drift | | Use retry/cache/timeout/telemetry/budget middleware | | Compare model profiles by pass rate, cost, and latency | | Select or run a multi-agent workflow pattern | | Convert an agent failure into a permanent prevention | | Restructure .harness/feature_list.json with proof gate | | Run Tier 2 isolated and multi-session regression benchmarks | | Deterministic diff review against the current base | | Enable the nightly eval GitHub Actions workflow | | Index skills and load full instructions on demand | | Codify a new architectural rule mechanically | | Classify eval/session failures from trace evidence | | Run browser validation with screenshots and network logs | | Create a new SKILL.md with valid frontmatter |

Review Subagents (9)

Read-only personas for second-opinion passes:

architecture-reviewer — layering, coupling, cohesion
adapter-compatibility-reviewer — adapter claims, render paths, tests
api-consistency-reviewer — naming, versioning, breaking changes
eval-rubric-reviewer — deterministic checks and evidence-backed rubrics
security-reviewer — OWASP Top 10, auth, secrets
reliability-reviewer — error handling, retries, observability
performance-reviewer — N+1 queries, caching, indexing
release-harness-reviewer — package, installer, npm, and release truth
trace-failure-analyst — eval, regression, hook, and session failure triage

Hooks (9 event groups)

SessionStart: Inject compact project context on startup/resume/compact.
UserPromptSubmit: Block prompt patterns that bypass harness safety.
PreToolUse: Guard risky Bash/edit operations and enforce per-skill permission policy before tools run.
Notification: Notify on blocking states.
PostToolUse: Run structural checks after edits and record skill telemetry.
PreCompact: Snapshot state before context compaction.
Stop: Pre-completion checklist with stop_hook_active loop guard.
SubagentStop: Re-check structural state after subagent work.
SessionEnd: Roll up session telemetry.

Adapters (6)

TypeScript adapter: ts-morph + eslint-plugin-boundaries + dependency-cruiser
Python adapter: libcst + import-linter
Go adapter: go-parser structural checks + shared eval runner
Rust adapter: rust-lexer structural checks + shared eval runner
Swift adapter: swift-lexer structural checks + shared eval runner
Kotlin adapter: kotlin-lexer structural checks + shared eval runner

Ownership policy

User-owned files are never clobbered on init or upgrade: CLAUDE.md, AGENTS.md, .harness/docs/architecture.md, .harness/docs/core-beliefs.md, .harness/docs/golden-principles.md, .harness/docs/tech-debt-tracker.md, .harness/feature_list.json, .harness/config.json.

Projects can extend that protected set in .harness/config.json:

{
  "ownership": {
    "userOwnedFiles": [".harness/docs/local-runbook.md"],
    "generatedMutableFiles": [".harness/custom-state.json"]
  }
}

Eval Harness

Four dimensions: outcome / process / style / efficiency

Directory Structure

your-repo/
├── CLAUDE.md                          # 50–80 line table of contents
├── AGENTS.md                          # symlink → CLAUDE.md
├── .claude/
│   ├── settings.json
│   ├── skills/                        # 30 skills with SKILL.md + skill.json contracts
│   ├── agents/                        # 9 reviewer personas
│   └── hooks/hooks.json
├── .harness/
│   ├── config.json
│   ├── permissions.json               # per-skill tool allow/deny matrix
│   ├── skill-registry.json            # version/capability registry
│   ├── feature_list.json              # JSON, not Markdown — Anthropic pattern
│   ├── docs/
│   │   ├── architecture.md
│   │   ├── core-beliefs.md
│   │   ├── golden-principles.md
│   │   ├── telemetry-schema.md
│   │   ├── tech-debt-tracker.md
│   │   └── adr/
│   │       └── 0001-use-agent-harness-kit.md
│   ├── installed.json                 # kit lockfile (sha-tracked)
│   ├── PROGRESS.md                    # session log
│   ├── scripts/
│   │   ├── structural-test-on-edit.sh # PostToolUse hook target
│   │   ├── precompletion-checklist.sh # Stop hook target
│   │   ├── pretooluse-skill-permission-guard.mjs
│   │   ├── check-skill-contracts.mjs
│   │   ├── orchestration-schema-check.mjs
│   │   ├── session-replay.mjs
│   │   ├── cost-tracker.mjs
│   │   ├── dev-up.sh
│   │   ├── pre-push.sh
│   │   └── install-git-hooks.sh
│   └── structural-baseline.json       # existing-violation baseline

Configuration (`.harness/config.json`)

{
  "version": "0.1.0",
  "language": "typescript",
  "framework": "nextjs",
  "preset": "nextjs",
  "domains": [
    {
      "name": "default",
      "root": "src",
      "layers": ["types", "config", "repo", "service", "runtime", "ui"]
    }
  ],
  "providers": ["auth", "telemetry", "feature-flags"],
  "ownership": {
    "userOwnedFiles": [],
    "generatedMutableFiles": []
  },
  "models": {
    "main": "claude-sonnet-4-6",
    "reviewers": "claude-sonnet-4-6",
    "explore": "claude-haiku-4-5"
  },
  "budgets": { "perRunUsd": 2.0, "perDayUsd": 10.0 }
}

Philosophy (5 Axioms)

1. CLAUDE.md is a table of contents, not an encyclopedia

HumanLayer measured ~150–200 instructions as the reliable cap; OpenAI's own root file is ~100 lines. This kit's CLAUDE.md is 50–80 lines.

2. Every agent failure becomes a permanent harness change

Mitchell Hashimoto's "engineer the harness" discipline. The /propose-harness-improvement skill enforces this.

3. Computational sensors as safety net

Fowler/Böckeler's architectural fitness functions. The TypeScript, Python, Go, Rust, Swift, and Kotlin adapters ship deterministic structural checks; LLM subagents are reserved for semantic judgment.

Note: In our 1-shot bench (n=3, ts-layered), the agent already followed visible seed patterns and produced 0 boundary violations without enforcement. Treat structural tests as a safety net for drift in long sessions, not as a happy-path differentiator.

4. Garbage collection over Friday cleanup, scaled to solo

OpenAI's golden-principles ritual, shrunk to top-3 fixes per week.

5. HTML for human deliverables, Markdown for agent files

Markdown is the right format for files an agent reads-and-edits (CLAUDE.md, SKILL.md, ADRs)
HTML is the right format for documents a HUMAN reads-and-decides (audit reports, analyses, plans, decision docs)

A long Markdown deliverable invites the human to scroll, miss the conclusion, and ask the agent to clarify — burning more tokens than the HTML markup costs. The /deliver-html skill writes self-contained HTML at repo root with a shared dark-theme CSS; the rule is documented in golden principle #11 and ADR-0002.

CLI Commands

agent-harness-kit init        # scaffold a repo (interactive)
agent-harness-kit init --yes  # accept all detected defaults
agent-harness-kit upgrade     # non-destructive upgrade, preserves user edits
agent-harness-kit doctor      # diagnose installed kit + Claude Code env
agent-harness-kit --version

Token / Cost Expectations

A typical day with the default model split (Sonnet 4.6 main + Haiku 4.5 explore + Sonnet 4.6 reviewers) stays under ~$2 of API traffic for a single developer.

The eval-runner skill enforces a per-run budget set in .harness/config.json.

OpenAI's harness processed ~1 billion tokens per day with 7 engineers. At solo scale, you're looking at ~10-50M tokens/day depending on session intensity.

Support Matrix

| Stack | Adapter | Preset | Dev command | Status | | ------------------------------ | ------------ | ----------- | -------------------------------------- | ------ | | Next.js 14 + TypeScript | typescript | nextjs | npm run dev | v0.1 | | Express | typescript | node-api | node ./src/server.js | v0.1 | | Fastify | typescript | node-api | node ./src/server.js | v0.1 | | NestJS | typescript | node-api | npm run start:dev | v0.1 | | FastAPI | python | fastapi | uvicorn app.main:app --reload | v0.1 | | Django | python | django | python manage.py runserver | v0.1 | | Flask | python | flask | flask --app app run --debug | v0.1 | | Go | go | none | go run ./cmd/... | v0.4 | | Rust | rust | none | cargo run | v0.4 | | Swift | swift | none | swift run | v0.7 | | Kotlin | kotlin | none | ./gradlew run | v0.7 |

Dependency Footprint

Runtime dependencies are intentionally split by surface:

| Dependency | Why it is present | Impact if missing | | ---------- | ----------------- | ----------------- | | commander | CLI command routing (init, upgrade, doctor) | CLI cannot start | | @inquirer/prompts | Interactive init/upgrade prompts | Interactive mode fails; --yes paths still avoid most prompts | | @clack/prompts | Polyglot setup selector with cancel handling | Polyglot-root setup falls back poorly | | react + ink | Rich polyglot onboarding renderer only, not the hot scaffold path | Smart setup loses the app map UI; core render/upgrade logic still does not depend on React state | | handlebars | Template rendering | init/upgrade cannot render scaffold files | | picocolors | CLI diagnostics | Output loses structured color but behavior is otherwise unchanged |

Optional peer dependencies are adapter tooling, not core runtime:

| Peer dependency | Used by | When missing | | --------------- | ------- | ------------ | | ts-morph | TypeScript structural runner | npm run harness:check fails with an explicit install message | | eslint-plugin-boundaries | TypeScript ESLint defense-in-depth config | ESLint boundary config cannot run, but the ts-morph runner remains the primary gate | | dependency-cruiser | Optional TypeScript dependency graph checks | Dependency-cruiser reports are unavailable; structural runner still enforces layer direction |

The TypeScript init path patches these peer tools into the target repo's devDependencies non-destructively.

CI: Real-Claude E2E Test (v0.7+)

The kit ships a CI job that spawns the real claude binary against a fresh init of itself and asserts that the SessionStart hook actually fires (with the expected additionalContext payload).

This catches the class of bug that v0.6's silent-no-op hooks fell into — every synthetic test passed for seven releases while not a single hook ever triggered inside a real Claude Code session.

The release gate also runs a real /orchestrate --run E2E against a freshly initialized kit. That path verifies fanout/fanin runtime output, schema validation, transcript capture, telemetry export, session replay, cost attribution, and cache read/write bucket closure.

Behavior:

Locally: npm test runs the real-Claude E2E case. The machine must have the claude binary installed and authenticated through either local Claude Code auth or ANTHROPIC_API_KEY.
CI: the normal test job runs non-Claude tests; the required e2e-claude job installs @anthropic-ai/claude-code globally and exercises one claude turn (~$0.01–0.05). Missing auth is a failed environment, not a skipped test.

For GitHub Actions, configure the ANTHROPIC_API_KEY repository secret before enabling the E2E job.

Local run (uses whatever auth the claude binary already has):

node scripts/e2e-claude-cli.mjs
node scripts/e2e-orchestrate-claude.mjs

Honest Expectations

What this kit DOES differentiate from bare claude-cli (anecdotal + design-level):

✅ Opinionated CLAUDE.md template (50–80 lines) so context isn't blown on style
✅ 30 skills that codify Hashimoto/OpenAI rituals
✅ 9 read-only review subagents for cheap second-opinion passes
✅ .harness/feature_list.json + ADR template + GC ritual for solo-scale planning hygiene
✅ Solo-dev cost defaults (~$2/day) and per-run budget enforcement

What it does NOT measurably differentiate (5 consecutive null benches, May 2026):

❌ Structural enforcement on happy-path 1-shot tasks. When seed code shows the layer pattern, claude-cli follows it — the boundaries lint has nothing to catch. We measured 0/6 ui→repo violations across bare and kit arms on the ts-layered fixture.

Where the structural test MIGHT still earn its keep (untested, listed for honesty, not as a claim):

Long multi-turn sessions where pattern context drifts
Adversarial "make it fast" pressure that tempts shortcuts
Greenfield code with no existing pattern to follow
Weaker model substrates (haiku, gpt-4o-mini)

Use the lint as a safety net, not as the reason you adopted the kit.

The Harness Engineering Trend (2025-2026)

Timeline:

August 2025: OpenAI's Frontier Product Exploration team starts the 1M-LOC experiment
February 2026: OpenAI publishes "Harness engineering: leveraging Codex in an agent-first world"
February 2026: Mitchell Hashimoto publishes "My AI Adoption Journey" coining "engineer the harness" as Step 5
March 2026: LangChain demonstrates +13.7pp Terminal-Bench improvement via harness changes alone
Q1 2026: Anthropic, Stripe, and other teams publish details about their agent harnesses

Key Insight:

"The work moved from writing code to building infrastructure that makes agents reliable at scale." — OpenAI

Industry Adoption:

Within 90 days of Hashimoto's post, "harness engineering" became the standard term for the infrastructure layer around AI agents. Teams at Anthropic, OpenAI, LangChain, Stripe, and others published their patterns.

Why It Matters for Solo Devs:

You don't need a 7-person team or a billion-token-per-day budget to benefit. The patterns scale down:

CLAUDE.md as table of contents (not encyclopedia) — Anthropic pattern
JSON feature tracking — machine-readable planning
Garbage collection ritual — top-3 fixes per week instead of enterprise-scale cleanup
Review subagents — cheap second opinions without human bottlenecks
Structural tests — safety net for long sessions

This kit is those patterns, packaged for hobby-project scale.

References

License

MIT

Contributing

Issues and PRs welcome at github.com/tuanle96/agent-harness-kit

Found a bug? Open an issue.
Have a pattern from your own harness? Submit a PR with the skill or hook.
Want to add a language adapter? Check docs/adding-an-adapter.md and the existing TypeScript/Python/Go/Rust/Swift/Kotlin adapters.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

agent-harness-kit

The Harness Engineering Shift

Why This Kit Exists

What you get:

What this kit does NOT claim:

Installation

Option A: One-line install (recommended)

Option B: Scaffold into existing repo

Option C: Install as Claude Code plugin

What Ships

Skills (30)

Review Subagents (9)

Hooks (9 event groups)

Adapters (6)

Ownership policy

Eval Harness

Directory Structure

Configuration (.harness/config.json)

Philosophy (5 Axioms)

1. CLAUDE.md is a table of contents, not an encyclopedia

2. Every agent failure becomes a permanent harness change

3. Computational sensors as safety net

4. Garbage collection over Friday cleanup, scaled to solo

5. HTML for human deliverables, Markdown for agent files

CLI Commands

Token / Cost Expectations

Support Matrix

Dependency Footprint

CI: Real-Claude E2E Test (v0.7+)

Honest Expectations

What this kit DOES differentiate from bare claude-cli (anecdotal + design-level):

What it does NOT measurably differentiate (5 consecutive null benches, May 2026):

Where the structural test MIGHT still earn its keep (untested, listed for honesty, not as a claim):

The Harness Engineering Trend (2025-2026)

Timeline:

Key Insight:

Industry Adoption:

Why It Matters for Solo Devs:

References

License

Contributing

Configuration (`.harness/config.json`)