npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

agent-harness-kit

v0.15.0

Published

Solo-dev harness engineering kit for Claude Code. Layered architecture, structural tests, garbage-collection ritual, review subagents — without the enterprise overhead.

Downloads

2,734

Readme

agent-harness-kit

The infrastructure layer that makes AI agents production-ready.

Solo-dev harness engineering kit for Claude Code. One command, ~30 minutes, and your hobby project gets the patterns that took OpenAI from prototype to 1M lines of agent-generated code: layered architecture, structural tests, garbage collection, review subagents, JSON feature tracking, and pre-completion checklists — without the enterprise overhead.

npm version License: MIT


The Harness Engineering Shift

February 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world" documenting how their Frontier Product Exploration team built an internal product with ~1 million lines of code over 5 months — with zero lines manually written by humans.

The results:

  • 3 engineers → 7 engineers
  • ~1,500 PRs merged (3.5 PRs per engineer per day)
  • Each engineer operating at 3-10x capacity through agent delegation
  • Agents running autonomously for 6+ hours per task
  • ~1 billion tokens processed per day

The insight: The work shifted from writing code to engineering the harness — the infrastructure, constraints, and feedback loops that make agents reliable at scale.

March 2026: LangChain demonstrated this principle empirically. By improving their agent harness alone (no model changes), they jumped from 52.8% → 66.5% on Terminal-Bench 2.0, climbing 25 spots on the leaderboard.

The pattern is clear: Harness quality matters more than model choice for production outcomes.


Why This Kit Exists

You're a solo developer or small team. You don't have OpenAI's infrastructure budget or Stripe's agent platform team. But you can adopt the same patterns at hobby-project scale:

What you get:

  • Proven patterns from production harnesses — OpenAI's two-fold initializer/coding-agent split, Anthropic's CLAUDE.md table-of-contents approach, Mitchell Hashimoto's "engineer the harness" discipline
  • 30 skills that codify rituals from teams shipping agent-generated code at scale (/add-feature, /garbage-collection, /harness-improvement-loop, /review-this-pr, etc.)
  • 9 read-only review subagents for cheap second-opinion passes (architecture, security, reliability, performance, API consistency, trace failure, eval rubric, adapter compatibility, release readiness)
  • Structural enforcement via TypeScript, Python, Go, Rust, Swift, and Kotlin adapters — catch layer violations before they compound
  • Cost guardrails and attribution — default budget plus provider-call cost by skill, task, and cache read/write bucket
  • JSON feature tracking (not Markdown) — Anthropic's pattern for machine-readable planning
  • Pre-completion checklists — OpenAI's golden-principles garbage collection ritual, scaled to top-3 fixes per week

What this kit does NOT claim:

  • Structural tests don't differentiate on happy-path 1-shot tasks. When seed code shows the pattern, Claude follows it — we measured 0/6 layer violations across bare and kit arms on our ts-layered fixture (5 consecutive null benches, May 2026).
  • The value is in long sessions, adversarial pressure, greenfield code, and weaker models — where pattern context drifts and shortcuts become tempting. Use the lint as a safety net, not as the reason you adopted the kit.

Installation

Option A: One-line install (recommended)

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash

If the interactive prompt exits with aborted by user at Project name in a piped shell, rerun with defaults:

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --yes

Or run the initializer directly so the prompt owns the terminal input:

npx agent-harness-kit init

Upgrade existing installation:

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --upgrade

Option B: Scaffold into existing repo

npx agent-harness-kit init

Option C: Install as Claude Code plugin

/plugin marketplace add tuanle96/agent-harness-kit
/plugin install agent-harness-kit@agent-harness-kit-marketplace

What Ships

Skills (30)

Slash commands that codify production harness rituals:

| Command | Purpose | | -------------------------------- | -------------------------------------------------------- | | /add-feature <description> | Implement one item from .harness/feature_list.json | | /add-adr | Add a numbered Architecture Decision Record | | /benchmark-suite | Run Mini SWE-bench style harness regression tasks | | /context-health | Inspect context usage, token budget, and compaction risk | | /create-story | Create an acceptance-tested Story Packet | | /debug-flow | Run the failing flow before fixing it | | /deliver-html | Ship an analysis/audit/plan as a self-contained HTML | | /doc-drift-scan | Find stale path/command references in docs/ | | /eval-rubric-author | Add deterministic checks plus evidence-backed rubrics | | /eval-runner | Regression-test the harness itself | | /feature-intake | Classify new work before implementation | | /garbage-collection | Friday cleanup (top-3 fixes only at solo scale) | | /harness-improvement-loop | Turn trace-backed failures into measured harness changes | | /i18n-add-locale <code> | Scaffold a new translation locale for skills + CLAUDE.md | | /inspect-app | Boot dev server + drive the failing flow before edits | | /inspect-module <path> | Map a module before editing | | /map-domain | Render layer config + flag config-vs-filesystem drift | | /middleware-pipeline | Use retry/cache/timeout/telemetry/budget middleware | | /model-profile | Compare model profiles by pass rate, cost, and latency | | /orchestrate | Select or run a multi-agent workflow pattern | | /propose-harness-improvement | Convert an agent failure into a permanent prevention | | /refactor-feature | Restructure .harness/feature_list.json with proof gate | | /regression-benchmark | Run Tier 2 isolated and multi-session regression benchmarks | | /review-this-pr | Deterministic diff review against the current base | | /setup-nightly-eval | Enable the nightly eval GitHub Actions workflow | | /skill-discovery | Index skills and load full instructions on demand | | /structural-test-author | Codify a new architectural rule mechanically | | /trace-analyzer | Classify eval/session failures from trace evidence | | /verify-ui | Run browser validation with screenshots and network logs | | /write-skill | Create a new SKILL.md with valid frontmatter |

Review Subagents (9)

Read-only personas for second-opinion passes:

  • architecture-reviewer — layering, coupling, cohesion
  • adapter-compatibility-reviewer — adapter claims, render paths, tests
  • api-consistency-reviewer — naming, versioning, breaking changes
  • eval-rubric-reviewer — deterministic checks and evidence-backed rubrics
  • security-reviewer — OWASP Top 10, auth, secrets
  • reliability-reviewer — error handling, retries, observability
  • performance-reviewer — N+1 queries, caching, indexing
  • release-harness-reviewer — package, installer, npm, and release truth
  • trace-failure-analyst — eval, regression, hook, and session failure triage

Hooks (9 event groups)

  • SessionStart: Inject compact project context on startup/resume/compact.
  • UserPromptSubmit: Block prompt patterns that bypass harness safety.
  • PreToolUse: Guard risky Bash/edit operations and enforce per-skill permission policy before tools run.
  • Notification: Notify on blocking states.
  • PostToolUse: Run structural checks after edits and record skill telemetry.
  • PreCompact: Snapshot state before context compaction.
  • Stop: Pre-completion checklist with stop_hook_active loop guard.
  • SubagentStop: Re-check structural state after subagent work.
  • SessionEnd: Roll up session telemetry.

Adapters (6)

  • TypeScript adapter: ts-morph + eslint-plugin-boundaries + dependency-cruiser
  • Python adapter: libcst + import-linter
  • Go adapter: go-parser structural checks + shared eval runner
  • Rust adapter: rust-lexer structural checks + shared eval runner
  • Swift adapter: swift-lexer structural checks + shared eval runner
  • Kotlin adapter: kotlin-lexer structural checks + shared eval runner

Ownership policy

User-owned files are never clobbered on init or upgrade: CLAUDE.md, AGENTS.md, .harness/docs/architecture.md, .harness/docs/core-beliefs.md, .harness/docs/golden-principles.md, .harness/docs/tech-debt-tracker.md, .harness/feature_list.json, .harness/config.json.

Projects can extend that protected set in .harness/config.json:

{
  "ownership": {
    "userOwnedFiles": [".harness/docs/local-runbook.md"],
    "generatedMutableFiles": [".harness/custom-state.json"]
  }
}

Eval Harness

Four dimensions: outcome / process / style / efficiency


Directory Structure

your-repo/
├── CLAUDE.md                          # 50–80 line table of contents
├── AGENTS.md                          # symlink → CLAUDE.md
├── .claude/
│   ├── settings.json
│   ├── skills/                        # 30 skills with SKILL.md + skill.json contracts
│   ├── agents/                        # 9 reviewer personas
│   └── hooks/hooks.json
├── .harness/
│   ├── config.json
│   ├── permissions.json               # per-skill tool allow/deny matrix
│   ├── skill-registry.json            # version/capability registry
│   ├── feature_list.json              # JSON, not Markdown — Anthropic pattern
│   ├── docs/
│   │   ├── architecture.md
│   │   ├── core-beliefs.md
│   │   ├── golden-principles.md
│   │   ├── telemetry-schema.md
│   │   ├── tech-debt-tracker.md
│   │   └── adr/
│   │       └── 0001-use-agent-harness-kit.md
│   ├── installed.json                 # kit lockfile (sha-tracked)
│   ├── PROGRESS.md                    # session log
│   ├── scripts/
│   │   ├── structural-test-on-edit.sh # PostToolUse hook target
│   │   ├── precompletion-checklist.sh # Stop hook target
│   │   ├── pretooluse-skill-permission-guard.mjs
│   │   ├── check-skill-contracts.mjs
│   │   ├── orchestration-schema-check.mjs
│   │   ├── session-replay.mjs
│   │   ├── cost-tracker.mjs
│   │   ├── dev-up.sh
│   │   ├── pre-push.sh
│   │   └── install-git-hooks.sh
│   └── structural-baseline.json       # existing-violation baseline

Configuration (.harness/config.json)

{
  "version": "0.1.0",
  "language": "typescript",
  "framework": "nextjs",
  "preset": "nextjs",
  "domains": [
    {
      "name": "default",
      "root": "src",
      "layers": ["types", "config", "repo", "service", "runtime", "ui"]
    }
  ],
  "providers": ["auth", "telemetry", "feature-flags"],
  "ownership": {
    "userOwnedFiles": [],
    "generatedMutableFiles": []
  },
  "models": {
    "main": "claude-sonnet-4-6",
    "reviewers": "claude-sonnet-4-6",
    "explore": "claude-haiku-4-5"
  },
  "budgets": { "perRunUsd": 2.0, "perDayUsd": 10.0 }
}

Philosophy (5 Axioms)

1. CLAUDE.md is a table of contents, not an encyclopedia

HumanLayer measured ~150–200 instructions as the reliable cap; OpenAI's own root file is ~100 lines. This kit's CLAUDE.md is 50–80 lines.

2. Every agent failure becomes a permanent harness change

Mitchell Hashimoto's "engineer the harness" discipline. The /propose-harness-improvement skill enforces this.

3. Computational sensors as safety net

Fowler/Böckeler's architectural fitness functions. The TypeScript, Python, Go, Rust, Swift, and Kotlin adapters ship deterministic structural checks; LLM subagents are reserved for semantic judgment.

Note: In our 1-shot bench (n=3, ts-layered), the agent already followed visible seed patterns and produced 0 boundary violations without enforcement. Treat structural tests as a safety net for drift in long sessions, not as a happy-path differentiator.

4. Garbage collection over Friday cleanup, scaled to solo

OpenAI's golden-principles ritual, shrunk to top-3 fixes per week.

5. HTML for human deliverables, Markdown for agent files

  • Markdown is the right format for files an agent reads-and-edits (CLAUDE.md, SKILL.md, ADRs)
  • HTML is the right format for documents a HUMAN reads-and-decides (audit reports, analyses, plans, decision docs)

A long Markdown deliverable invites the human to scroll, miss the conclusion, and ask the agent to clarify — burning more tokens than the HTML markup costs. The /deliver-html skill writes self-contained HTML at repo root with a shared dark-theme CSS; the rule is documented in golden principle #11 and ADR-0002.


CLI Commands

agent-harness-kit init        # scaffold a repo (interactive)
agent-harness-kit init --yes  # accept all detected defaults
agent-harness-kit upgrade     # non-destructive upgrade, preserves user edits
agent-harness-kit doctor      # diagnose installed kit + Claude Code env
agent-harness-kit --version

Token / Cost Expectations

A typical day with the default model split (Sonnet 4.6 main + Haiku 4.5 explore + Sonnet 4.6 reviewers) stays under ~$2 of API traffic for a single developer.

The eval-runner skill enforces a per-run budget set in .harness/config.json.

OpenAI's harness processed ~1 billion tokens per day with 7 engineers. At solo scale, you're looking at ~10-50M tokens/day depending on session intensity.


Support Matrix

| Stack | Adapter | Preset | Dev command | Status | | ------------------------------ | ------------ | ----------- | -------------------------------------- | ------ | | Next.js 14 + TypeScript | typescript | nextjs | npm run dev | v0.1 | | Express | typescript | node-api | node ./src/server.js | v0.1 | | Fastify | typescript | node-api | node ./src/server.js | v0.1 | | NestJS | typescript | node-api | npm run start:dev | v0.1 | | FastAPI | python | fastapi | uvicorn app.main:app --reload | v0.1 | | Django | python | django | python manage.py runserver | v0.1 | | Flask | python | flask | flask --app app run --debug | v0.1 | | Go | go | none | go run ./cmd/... | v0.4 | | Rust | rust | none | cargo run | v0.4 | | Swift | swift | none | swift run | v0.7 | | Kotlin | kotlin | none | ./gradlew run | v0.7 |


Dependency Footprint

Runtime dependencies are intentionally split by surface:

| Dependency | Why it is present | Impact if missing | | ---------- | ----------------- | ----------------- | | commander | CLI command routing (init, upgrade, doctor) | CLI cannot start | | @inquirer/prompts | Interactive init/upgrade prompts | Interactive mode fails; --yes paths still avoid most prompts | | @clack/prompts | Polyglot setup selector with cancel handling | Polyglot-root setup falls back poorly | | react + ink | Rich polyglot onboarding renderer only, not the hot scaffold path | Smart setup loses the app map UI; core render/upgrade logic still does not depend on React state | | handlebars | Template rendering | init/upgrade cannot render scaffold files | | picocolors | CLI diagnostics | Output loses structured color but behavior is otherwise unchanged |

Optional peer dependencies are adapter tooling, not core runtime:

| Peer dependency | Used by | When missing | | --------------- | ------- | ------------ | | ts-morph | TypeScript structural runner | npm run harness:check fails with an explicit install message | | eslint-plugin-boundaries | TypeScript ESLint defense-in-depth config | ESLint boundary config cannot run, but the ts-morph runner remains the primary gate | | dependency-cruiser | Optional TypeScript dependency graph checks | Dependency-cruiser reports are unavailable; structural runner still enforces layer direction |

The TypeScript init path patches these peer tools into the target repo's devDependencies non-destructively.


CI: Real-Claude E2E Test (v0.7+)

The kit ships a CI job that spawns the real claude binary against a fresh init of itself and asserts that the SessionStart hook actually fires (with the expected additionalContext payload).

This catches the class of bug that v0.6's silent-no-op hooks fell into — every synthetic test passed for seven releases while not a single hook ever triggered inside a real Claude Code session.

The release gate also runs a real /orchestrate --run E2E against a freshly initialized kit. That path verifies fanout/fanin runtime output, schema validation, transcript capture, telemetry export, session replay, cost attribution, and cache read/write bucket closure.

Behavior:

  • Locally: npm test runs the real-Claude E2E case. The machine must have the claude binary installed and authenticated through either local Claude Code auth or ANTHROPIC_API_KEY.
  • CI: the normal test job runs non-Claude tests; the required e2e-claude job installs @anthropic-ai/claude-code globally and exercises one claude turn (~$0.01–0.05). Missing auth is a failed environment, not a skipped test.

For GitHub Actions, configure the ANTHROPIC_API_KEY repository secret before enabling the E2E job.

Local run (uses whatever auth the claude binary already has):

node scripts/e2e-claude-cli.mjs
node scripts/e2e-orchestrate-claude.mjs

Honest Expectations

What this kit DOES differentiate from bare claude-cli (anecdotal + design-level):

  • ✅ Opinionated CLAUDE.md template (50–80 lines) so context isn't blown on style
  • ✅ 30 skills that codify Hashimoto/OpenAI rituals
  • ✅ 9 read-only review subagents for cheap second-opinion passes
  • .harness/feature_list.json + ADR template + GC ritual for solo-scale planning hygiene
  • ✅ Solo-dev cost defaults (~$2/day) and per-run budget enforcement

What it does NOT measurably differentiate (5 consecutive null benches, May 2026):

  • Structural enforcement on happy-path 1-shot tasks. When seed code shows the layer pattern, claude-cli follows it — the boundaries lint has nothing to catch. We measured 0/6 ui→repo violations across bare and kit arms on the ts-layered fixture.

Where the structural test MIGHT still earn its keep (untested, listed for honesty, not as a claim):

  • Long multi-turn sessions where pattern context drifts
  • Adversarial "make it fast" pressure that tempts shortcuts
  • Greenfield code with no existing pattern to follow
  • Weaker model substrates (haiku, gpt-4o-mini)

Use the lint as a safety net, not as the reason you adopted the kit.


The Harness Engineering Trend (2025-2026)

Timeline:

  • August 2025: OpenAI's Frontier Product Exploration team starts the 1M-LOC experiment
  • February 2026: OpenAI publishes "Harness engineering: leveraging Codex in an agent-first world"
  • February 2026: Mitchell Hashimoto publishes "My AI Adoption Journey" coining "engineer the harness" as Step 5
  • March 2026: LangChain demonstrates +13.7pp Terminal-Bench improvement via harness changes alone
  • Q1 2026: Anthropic, Stripe, and other teams publish details about their agent harnesses

Key Insight:

"The work moved from writing code to building infrastructure that makes agents reliable at scale." — OpenAI

Industry Adoption:

Within 90 days of Hashimoto's post, "harness engineering" became the standard term for the infrastructure layer around AI agents. Teams at Anthropic, OpenAI, LangChain, Stripe, and others published their patterns.

Why It Matters for Solo Devs:

You don't need a 7-person team or a billion-token-per-day budget to benefit. The patterns scale down:

  • CLAUDE.md as table of contents (not encyclopedia) — Anthropic pattern
  • JSON feature tracking — machine-readable planning
  • Garbage collection ritual — top-3 fixes per week instead of enterprise-scale cleanup
  • Review subagents — cheap second opinions without human bottlenecks
  • Structural tests — safety net for long sessions

This kit is those patterns, packaged for hobby-project scale.


References


License

MIT


Contributing

Issues and PRs welcome at github.com/tuanle96/agent-harness-kit

Found a bug? Open an issue.
Have a pattern from your own harness? Submit a PR with the skill or hook.
Want to add a language adapter? Check docs/adding-an-adapter.md and the existing TypeScript/Python/Go/Rust/Swift/Kotlin adapters.