@marcusbarcelos/spec-agent

v0.1.4

Published

19 days ago

Stop your coding agent from shipping broken code — a verification gate, project rules, and durable learning for Claude Code, Copilot, Cursor, and Codex. Installs via npx; runs alongside your agent.

Downloads

493

0High
0Medium
0Low

marcusbarcelos

ai-agent agent-harness llm claude copilot codex cursor vendor-agnostic scaffold governance context-economy skill-forge

spec-agent

Stop letting your coding agent ship broken code. spec-agent adds verification gates, project rules, and durable learning to Claude Code, Copilot, Cursor, and Codex — without replacing your agent. Install via npx.

It's not another prompt framework. It's the governance layer that's left when the model is already capable: a deterministic gate that blocks what the model can't see, project rules it actually applies, and a learning loop that turns recurring mistakes into durable, reusable knowledge — running inside the agent you already use.

spec-agent is the quality contract between your repository and your coding agent. The gate verifies the contract, skill-forge learns its rules, council calibrates risky calls — and your agent can't say done until the contract holds. An agent without a gate is an overconfident junior; spec-agent is the automatic tech lead that says "No. This doesn't pass."

See it work

The agent says "done." The gate says "not yet." It's a deterministic checkpoint between your agent finishing and the code landing — not a prompt.

agent finishes a change
        │
        ▼
  spec-agent gate  ──  lint · typecheck · tests · project invariants
        │
   ┌────┴─────┐
 pass        fail
   │           │
 ✓ lands   error handed back to the agent ─→ agent fixes ─→ gate re-runs

A concrete before/after — your project's invariant is "each whitespace becomes one dash":

  // the agent's final diff — looks fine, ships wrong:
- const slug = name.trim().replace(/\s+/g, "-")   // "a  b" → "a-b"  ✗ collapses runs

  // spec-agent gate:
  //   ✗ FAILED — slug: whitespace invariant — finish blocked, error returned to the agent

  // the agent re-runs and corrects it:
+ const slug = name.trim().replace(/\s/g, "-")    // "a  b" → "a--b"  ✓
  //   ✓ gate passed — change can land

Without spec-agent, that first diff ships. The agent thought it was done; the gate disagreed — and the agent fixed it before you ever saw the PR.

▶ Run it yourself in 30 seconds: examples/idempotency-demo/ — a real domain contract (a commission ledger that must be idempotent), the gate blocking the bug, the fix, the gate passing.

Quickstart

# scaffold .spec/ + adapters into the current repo
npx @marcusbarcelos/spec-agent init --id my-project --agents claude,agents-md

# re-project adapters when the engine evolves (never touches your durable state)
npx @marcusbarcelos/spec-agent sync

# add an agent later (re-projects everything, preserves your durable state)
npx @marcusbarcelos/spec-agent sync --agents copilot

# run the gate yourself (CI / PR) — verdict + non-zero exit if blocked
npx @marcusbarcelos/spec-agent verify

Requires Node ≥ 20.

CI / pull requests

spec-agent verify runs your project's gate and prints a verdict that reads like a code review — green in CI, blocking on a PR:

# .github/workflows/spec-agent.yml
- run: npx @marcusbarcelos/spec-agent verify

SPEC-AGENT VERDICT: BLOCKED

  ✗ domain contract  node --test
      idempotency invariant failed: duplicate ledger entry for sale-1

Blocked: 1 check(s) failed. Fix and re-run `spec-agent verify`.

The checks live in .spec/manifest.yaml (checks: [{ name, cmd }]) — your tests, lint, typecheck, or any command. This is where spec-agent stops being just an assistant and becomes the guardian of the repo.

What's inside

| Mechanism | What it does | Priority | |---|---|---| | verification gate | A deterministic Stop-hook that blocks "done" while lint/typecheck/tests fail on what was touched. Catches the error the model can't see in itself. | core — clearest ROI | | project learning & skill-forge | Turns a recurring project mistake into an imperative, reusable rule the model actually applies — where, without it, it knows the rule and breaks it anyway. | the long-term differentiator | | context-economy | Token discipline in prompt, tool input, and output. A code graph instead of re-reading files. | core | | agent-council | Multi-perspective review reserved for high-risk, ambiguous calls (DB migrations, contract breaks, security, architecture). Not for everyday turns. | advanced |

skill-forge — durable learning, in action

  // without the learned rule — the model even cites it, then breaks it:
- const d = dto.expirationDate ?? dto.expiration_date   // defensive fallback, violates the project rule

  // with the skill in context:
+ const d = dto.expirationDate                          // fix the type at the source, no fallback

What we measured

A small, reproducible, tamper-isolated benchmark (the agent never sees the checkers). It's a method plus a first signal, not proof — small N, stated openly.

The gate is the concrete win. It recovered an objective failure the model shipped — targeted tasks went 80% → 100% via the fix-loop. Prompt rules alone moved ~0 on a capable model.
Durable knowledge changes behavior. A learned project rule flipped a wrong answer to right — the model knew the rule and violated it without the skill.
Council's niche is calibration. On ambiguous-but-sound trade-offs it never false-blocked (0/4), where a single pass did. Real, but narrow — hence "advanced."

With a capable base model, the differential shows at the margins — in the objective gate and durable project knowledge, not in expensive orchestration by default.

Full method, numbers, and caveats: RESULTS · SKILLFORGE-RESULTS · COUNCIL-RESULTS. Small N — indicative, not statistical proof.

thin + sync model

The engine (rules + governance skills) ships inside the package. Your repo only holds:

the adapters per agent (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md) — marked GENERATED, regenerable;
the engine's skills in the agent's native location (.claude/skills/) where supported;
your project's durable state in .spec/ (manifest.yaml, learning/, skills/).

sync re-projects the adapters from the updated engine and never touches .spec/learning/ or .spec/skills/ — the knowledge you accumulate is yours.

your-repo/
├─ CLAUDE.md                       # GENERATED adapter (Claude Code)
├─ AGENTS.md                       # GENERATED adapter (Codex / generic)
├─ .github/copilot-instructions.md # GENERATED adapter (Copilot)
├─ .claude/skills/                 # engine skills, in the agent's native location
└─ .spec/                          # YOUR durable state (sync never touches)
   ├─ manifest.yaml
   ├─ learning/                    # lessons, pitfalls, patterns, glossary, _pending/
   └─ skills/                      # skills generated by skill-forge

Works with your agent (honest loss-model)

Full harness on Claude Code; on other agents it runs degraded-but-functional, with the gaps written down in the manifest's loss_report.

| capability | Claude Code | other agents | |---|---|---| | verification gate | Stop hook | git pre-commit / CI | | durable learning | native skills + memory | .spec/learning/ | | multi-agent (council) | native subagents | single-thread simulation | | code graph | graphify CLI | graphify CLI |

Agent-specific enhancers (claude-mem, superpowers, rtk, graphify) are optional — never dependencies.

Status

v1 — init / sync (Claude Code / AGENTS.md / Copilot) + reproducible benchmark + landing page. See examples/hello-project/ for a scaffolded tree.

Roadmap — Cursor / Codex / Gemini adapters; auto-promotion of _pending → learning; proof on larger real repos (TS + tests, Python + mypy/pytest).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme