evalgate

v3.2.0

Published

21 days ago

Eval-gated todos for agentic coding. Your agents can't tick the checkbox until the verifier passes.

0High
0Medium
0Low

jorgejac

claude-code codex mcp ai-agents agentic-workflow todo evals verifier coding-agent

evalgate

Eval-gated todos for agentic coding. Your agents can't tick the checkbox until the verifier passes.

Multi-agent coding is getting real. Claude Code subagents, Codex, Octogent, OpenHarness — they all let you run more agents in parallel. The problem nobody has solved is that more agents means more plausible-looking output that's actually wrong. The completion signal today is "the agent said it was done," which scales terribly.

evalgate fixes that with one primitive: a todo item can't be flipped to [x] until its attached verifier exits 0.

Zero runtime dependencies. Plain markdown. Plugs into any agent or CI pipeline.

30-second demo

Your todo.md:

- [ ] Implement add(a, b)
  - eval: `npm run test:add`
  - retries: 2

- [ ] Implement subtract(a, b)
  - eval: `npm run test:subtract`
  - retries: 3

Run evalgate check:

evalgate · checking 2 contracts in todo.md

  ▸ Implement add(a, b) (implement-add) ... ✓ passed (412ms)
  ▸ Implement subtract(a, b) (implement-subtract) ... ✗ failed (exit 1, 388ms)
    │ subtract
    │   ✖ expected 2, got 8
    │   at file:///.../subtract.test.js:6:10

Summary: 1 passed, 1 failed

add flips to [x]. subtract stays [ ] — and the failure output is right there for the agent to read and retry.

Why this matters

Today's agent orchestrators give you more workers. evalgate gives those workers a contract: each todo is a unit of work with a built-in quality gate. That unlocks:

Auto-retries with context. The agent sees the verifier output and fixes the root cause instead of re-asking the human.
Honest progress bars. [x] means the tests actually passed.
Safe parallelism. Spawn 8 workers on 8 todos; only the ones that actually work commit their checkboxes.
Budget enforcement per task. Each contract declares its token budget; evalgate tracks spend and can report overruns.
24/7 autonomous operation. Trigger contracts on a schedule, file change, or webhook — no human needed to kick things off.

Install

# Global CLI
npm install -g evalgate

# Or clone + link for development
git clone https://github.com/jorgejac1/evalgate
cd evalgate
pnpm install && pnpm build && npm link

Quick start

cd examples/basic
npm install
evalgate list          # show contracts + verifiers
evalgate check         # run pending verifiers, flip checkboxes

Contract format

Contracts live in any markdown file (convention: todo.md). A contract is a GFM task-list item with indented sub-bullet fields:

- [ ] Task title
  - eval: `shell command`
  - eval.all: `cmd1` | `cmd2`
  - eval.any: `cmd1` | `cmd2`
  - eval.llm: judge prompt as plain text
  - eval.diff: src/file.ts contains "new pattern"
  - eval.http: http://localhost:3000/health
  - eval.http.status: 200
  - eval.http.contains: "healthy"
  - eval.http.timeout: 5000
  - eval.schema: output.json {"type":"object","required":["id"]}
  - retries: 3
  - retry-if: exit-code > 1
  - budget: 50k
  - id: stable-slug
  - on: schedule: "0 * * * *"
  - on: watch: "src/**/*.ts"
  - on: webhook: "/deploy-done"

Field reference

| Field | | ----------- | | eval | eval.all | yes* | eval.any | yes* | eval.llm | yes* | eval.diff | eval.http | eval.http.status | eval.http.contains | eval.http.timeout | eval.schema | retries | retry-if | budget | id | provider | no | role | no | mcp | on Required | Description | -------- | ----------- | | yes* | Shell command. Exit 0 = pass, anything else = fail. | | Pipe-separated commands — all must exit 0. | | Pipe-separated commands — any one must exit 0. | | Natural-language prompt judged by Claude. Answers PASS or FAIL. | | yes* | Assert a structural change in a file: contains, not contains, deleted, created, changed. Zero deps. | | yes* | HTTP health check — GET request; passes if status matches (default 200). Uses built-in fetch (Node 18+). | | no | Expected HTTP status code. Defaults to 200. | | no | Substring that must appear in the response body. | | no | Request timeout in ms. Defaults to 10000. | | yes* | JSON schema validator — <file> <inline-json-schema>. Checks type, required[], properties[].type. Zero deps. | | no | Max retry attempts hint for orchestrators. | | no | Only consume a retry slot when condition is true: exit-code <op> <n>. Operators: == != > < >= <=. | | no | Token budget: 50k, 1.5m, or raw integer. | | no | Stable slug for references and logs. Defaults to slugified title. | | Preferred model: opus, sonnet, or haiku. | | Agent role hint: coordinator, worker, or linter. | | no | Comma-separated list of MCP servers this contract may use. | | no | Trigger: schedule: "<cron>", watch: "<glob>", or webhook: "<path>". |

* Exactly one eval variant is required for a gated contract. Items without any eval are ungated — they behave like normal checkboxes.

Verifier variants

Shell — a single command:

- [ ] Tests pass
  - eval: `npm test`

Composite all — every step must exit 0:

- [ ] Build and lint pass
  - eval.all: `npm run build` | `npm run lint` | `npm test`

Composite any — at least one step must exit 0:

- [ ] At least one mirror is up
  - eval.any: `curl -f https://mirror-1/health` | `curl -f https://mirror-2/health`

LLM judge — Claude evaluates the output; useful for prose, API contracts, or anything hard to assert mechanically:

- [ ] README explains the feature clearly
  - eval.llm: Does the README at ./README.md explain the auth flow in plain English?

Requires ANTHROPIC_API_KEY. Defaults to claude-haiku-4-5-20251001.

Semantic diff — assert that a specific structural change appeared in a file. Passes if the pattern matches the diff; fails if the file is unchanged or the pattern is absent. Zero external dependencies:

- [ ] Add rate-limit header to responses
  - eval.diff: src/middleware.ts contains "X-RateLimit-Remaining"

- [ ] Remove legacy auth module
  - eval.diff: src/auth-legacy.ts deleted

Supported assertions: contains "<text>", not contains "<text>", deleted, created, changed.

HTTP verifier — issues a GET request and checks status code + optional body substring. Uses Node's built-in fetch (requires Node 18+). Zero external dependencies:

- [ ] Service is healthy
  - eval.http: http://localhost:3000/health

- [ ] API returns correct status
  - eval.http: http://localhost:3000/api/status
  - eval.http.status: 201
  - eval.http.contains: "ready"
  - eval.http.timeout: 5000

| Sub-field | Description | | --------- | ----------- | | eval.http: <url> | GET request to <url>. Passes if status matches (default 200). | | eval.http.status: <n> | Expected HTTP status code. | | eval.http.contains: <text> | Response body must contain this substring. | | eval.http.timeout: <ms> | Request timeout in ms (default 10000). |

Schema verifier — reads a JSON file and validates its structure against an inline schema. Zero external dependencies:

- [ ] Agent produced valid output
  - eval.schema: dist/output.json {"type":"object","required":["id","score"]}

- [ ] Response is an array
  - eval.schema: response.json {"type":"array"}

- [ ] Fields have correct types
  - eval.schema: result.json {"type":"object","properties":{"id":{"type":"string"},"score":{"type":"number"}}}

Inline schema supports: type (object, array, string, number, boolean), required (array of field names), properties with per-field type checks.

Conditional retry — only consume a retry slot when the exit-code condition is met. Useful when certain exit codes indicate a transient failure (worth retrying) vs a permanent one (not worth retrying):

- [ ] Flaky network test — retry only on timeout (exit code 2)
  - eval: `./scripts/smoke-test.sh`
  - retries: 3
  - retry-if: exit-code == 2

- [ ] Build — retry on any non-zero exit code
  - eval: `npm run build`
  - retries: 2
  - retry-if: exit-code != 0

Supported operators: == != > < >= <=. Without retry-if, every failure unconditionally consumes a retry slot (existing behaviour).

Trigger variants

# Run on a cron schedule
- on: schedule: "*/30 * * * *"

# Re-check when source files change
- on: watch: "src/**/*.ts"

# Fire when a webhook hits the daemon
- on: webhook: "/deploy-done"

CLI reference

| Command | Description | | ------- | ----------- | | evalgate check [path] | Run verifiers on all pending contracts. | | evalgate list [path] | List all contracts with status and verifier. | | evalgate retry <id> [path] | Rerun a single contract, injecting last failure as context. | | evalgate log [path] | Show run history. Flags: --contract=<id>, --failed, --limit=N. | | evalgate msg send <from> <to> <kind> [payload-json] [path] | Send a structured message between agents. | | evalgate msg list [path] | List messages. Flags: --to=<agent>, --kind=<kind>. | | evalgate serve [cwd] | Start the MCP server on stdio. | | evalgate watch [path] | Start the trigger daemon (schedule / watch / webhook). | | evalgate ui [path] [--port=N] | Launch web dashboard at localhost:7777. | | evalgate dash [path] | ANSI terminal dashboard — live contract status. | | evalgate budget [path] | Show token spend vs budget per contract. | | evalgate budget <id> <tokens> [path] | Record token usage for a contract. | | evalgate suggest "<title>" [path] | Find similar past completions for a new contract. | | evalgate patterns [path] | Analyse failure patterns across all contracts. | | evalgate export [path] [--format=json\|md] | Export full project snapshot. | | evalgate diff <snap1.json> <snap2.json> [--format=text\|json\|md] | Compare two snapshots. | | evalgate swarm [path] [--concurrency=N] [--resume] [--agent=cmd] | Spawn parallel agent workers. | | evalgate swarm status [path] | Show last swarm run status. |

Swarm Cockpit

evalgate swarm spawns parallel agent workers — one per pending contract — each in its own git worktree. Workers implement their contract independently, then evalgate runs the verifier in the worktree. Only workers whose verifier passes get merged back.

# Run swarm with up to 3 parallel workers (default)
evalgate swarm todo.md

# Custom concurrency
evalgate swarm todo.md --concurrency=5

# Resume a previous run (skip already-done workers)
evalgate swarm todo.md --resume

# Use a custom agent command
evalgate swarm todo.md --agent="claude --model opus"

Output during a swarm run:

evalgate swarm · todo.md · concurrency 3

  ✓ Implement add(a, b) (implement-add) done
  ✗ Implement subtract(a, b) (implement-subtract) failed
  ✓ Add TypeScript types (add-typescript-types) done

Swarm summary: 2 merged, 1 failed, 0 skipped

Checking swarm status

After a run, inspect each worker's outcome:

evalgate swarm status todo.md

evalgate swarm status · swarm-a1b2c3d4 (2024-01-15 10:32:00)

✓ Implement add(a, b)  done (implement-add)
  duration: 8420ms
  verifier: passed  log: .evalgate/swarm/logs/implement-add.log

✗ Implement subtract(a, b)  failed (implement-subtract)
  duration: 12300ms
  verifier: failed  log: .evalgate/swarm/logs/implement-subtract.log

Retrying a failed worker

# Retry a single failed contract (shows last failure output first)
evalgate retry implement-subtract todo.md

The retry command pulls the last failure output from the durable run log and displays it before re-running the verifier, giving the agent concrete context to fix the root cause.

Web UI

# Browser dashboard at localhost:7777
evalgate ui todo.md

# Custom port
evalgate ui todo.md --port=8080

The web UI shows live contract status (auto-refreshing via SSE), run history, failure output per contract, and budget gauges. It serves from Node's built-in http module — no framework, no external dependencies.

MCP server

evalgate serve exposes 15 tools over stdio MCP. Any MCP client (Claude Desktop, Cursor, Windsurf) can invoke contracts as tools without touching the CLI.

Add to Claude Desktop

In ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "evalgate": {
      "command": "evalgate",
      "args": ["serve", "/path/to/your/project"]
    }
  }
}

Named workspaces (v0.11+)

Expose multiple todo.md files as named workspaces — each tool accepts an optional workspace parameter to select which file to operate on:

{
  "mcpServers": {
    "evalgate": {
      "command": "evalgate",
      "args": ["serve", "/path/to/project",
        "--workspace", "auth=/path/to/auth/todo.md",
        "--workspace", "payments=/path/to/payments/todo.md"
      ]
    }
  }
}

Or programmatically:

import { startMcpServer } from "evalgate";

startMcpServer(process.cwd(), {
  workspaces: {
    auth:     "/project/.conductor/tracks/auth/todo.md",
    payments: "/project/.conductor/tracks/payments/todo.md",
  },
});

Call list_workspaces from any MCP client to enumerate configured workspaces. All 15 tools accept workspace: "<name>" to target a specific file.

MCP tools

| Tool | Description | | ---- | ----------- | | list_workspaces | List all configured named workspaces. | | list_all | All contracts with status. | | list_pending | Contracts not yet passing. | | list_triggers | Contracts with on: triggers and their next fire time. | | run_eval | Run a single contract by id. | | check_all | Run all pending contracts. | | get_retry_context | Get last failure output formatted as a retry prompt. | | get_run_history | Full run log, filterable by contract or status. | | get_last_failure | Last failure details for a contract. | | send_message | Send a structured agent message. | | list_messages | List messages, filterable by recipient or kind. | | get_provider_hints | Provider and role hints for all contracts. | | report_token_usage | Record token spend for a contract. | | suggest_template | Find similar past completions for a new task. | | get_patterns | Failure pattern analysis across all contracts. | | export_state | Full project snapshot as JSON or markdown. |

Claude Code integration

Wire the provided PostToolUse hook to auto-check todo.md whenever Claude Code edits it:

chmod +x hooks/claude-code-posttooluse.sh

Add to ~/.claude/settings.json (or .claude/settings.json in your repo):

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write|MultiEdit",
        "hooks": [
          { "type": "command", "command": "/abs/path/to/hooks/claude-code-posttooluse.sh" }
        ]
      }
    ]
  }
}

Now whenever Claude Code edits todo.md, evalgate check runs automatically. If a verifier fails, the hook exits 2 and Claude Code feeds the failure output back into the agent's next turn.

Triggers

Start the trigger daemon with evalgate watch. It handles three trigger kinds:

Schedule — cron expression, runs the contract verifier at the specified interval:

- [ ] Sync exchange rates
  - eval: `node scripts/sync-rates.js`
  - on: schedule: "0 * * * *"

Watch — glob pattern, re-checks when matching files change:

- [ ] Auth tests must pass
  - eval: `pnpm test src/auth`
  - on: watch: "src/auth/**"

Webhook — HTTP endpoint, fires when a POST hits the daemon:

- [ ] Deploy smoke test
  - eval: `./scripts/smoke-test.sh staging`
  - on: webhook: "/deploy-done"

The webhook server listens on localhost:7778 by default.

Agent messaging

Agents in a multi-worker setup can exchange structured messages through evalgate's message bus. Messages persist to .evalgate/messages.ndjson.

# Coordinator tells worker-2 the schema is ready
evalgate msg send coordinator worker-2 schema_ready '{"table":"users"}'

# Worker lists its inbox
evalgate msg list --to=worker-2

# Or via MCP
send_message({ from: "coordinator", to: "worker-2", kind: "schema_ready", payload: {...} })
list_messages({ to: "worker-2" })

Message envelopes carry from, to, kind, payload, and a correlation_id for tracing request/response pairs across turns.

Budget tracking

Declare a token budget on any contract. Workers report their spend; evalgate tracks cumulative usage and warns on overrun.

- [ ] Generate migration SQL
  - eval: `pnpm db:validate`
  - budget: 50k
  - id: gen-migration

# Worker reports spend after its turn
evalgate budget gen-migration 12400

# See all contracts vs budget
evalgate budget
#   gen-migration   12,400 / 50,000   24%
#   auth-refactor   61,200 / 50,000   122%  ⚠ over budget

The report_token_usage MCP tool lets agents self-report without shelling out.

Terminal dashboard

# ANSI live dashboard in the terminal
evalgate dash todo.md

The ANSI dashboard is useful inside tmux or when you want a heads-up display without leaving the terminal. It refreshes on every run event via the durable log.

Memory and learning

evalgate indexes successful contract completions so future contracts can get a head start:

# Find templates similar to a new task
evalgate suggest "migrate users table to Postgres"
#   85% match: migrate_products_table — passed in 1 attempt
#   72% match: migrate_orders_table   — passed in 2 attempts, retries=3

# See failure patterns across all contracts
evalgate patterns
#   Most common failure: exit 1 in test:subtract — 8 occurrences
#   Fastest to fix after failure: lint contracts — avg 1.2 retries

The suggest_template and get_patterns MCP tools expose the same data to agents directly, so they can self-tune without human intervention.

Persistence

All state lives in .evalgate/ at the project root:

.evalgate/
  runs.ndjson           — full run history (contract id, exit code, output, duration)
  messages.ndjson       — agent message log
  budget.ndjson         — token spend per contract
  swarm-state.json      — last swarm run state
  swarm/logs/           — per-worker agent session logs

NDJSON format — human-readable, grep-friendly, no database required.

Swarm hooks (v2.3+)

runSwarm accepts optional inline callbacks alongside the existing swarmEvents emitter. Hooks fire per-worker and are useful for orchestrators that need inline reactions without setting up a global listener.

import { runSwarm, estimateUsd } from "evalgate";
import type { BudgetExceededEvent } from "evalgate";

const controller = new AbortController();

await runSwarm({
  todoPath: ".evalgate/todo.md",
  concurrency: 4,

  // Called when a worker starts; worker.id / worker.contractTitle are populated.
  onWorkerStart(worker) {
    console.log(`→ started ${worker.contractTitle}`);
  },

  // Called when a worker reaches a terminal state (done / failed / timeout).
  onWorkerComplete(worker, { status, failureKind }) {
    console.log(`← ${status} ${worker.contractTitle}${failureKind ? ` (${failureKind})` : ""}`);
  },

  // Called when a contract's cumulative spend crosses its declared budget.
  // Abort the controller here to stop spawning new workers.
  onBudgetExceeded(evt: BudgetExceededEvent) {
    console.warn(`budget exceeded: ${evt.totalTokens} tok / $${evt.estimatedUsd.toFixed(4)}`);
    controller.abort();
  },

  // Pass an AbortSignal to stop new-worker spawning without killing in-flight workers.
  // In-flight workers run to completion (or their agentTimeoutMs).
  signal: controller.signal,
});

resumeSwarm is a thin convenience wrapper that picks up pending workers from a prior run:

import { resumeSwarm } from "evalgate";

// stateFile is the path to .evalgate/swarm-state.json
await resumeSwarm(".evalgate/swarm-state.json", { concurrency: 2 });

estimateUsd replaces hardcoded pricing math in downstream consumers:

import { estimateUsd } from "evalgate";

const usd = estimateUsd(inputTokens, outputTokens);            // default: sonnet4 rates
const usd2 = estimateUsd(inputTokens, outputTokens, "opus4");  // opus4 rates
const usd3 = estimateUsd(inputTokens, outputTokens, "haiku4"); // haiku4 rates

Programmatic API

import { parseTodo, runContract, updateTodo } from "evalgate";
import { readFileSync, writeFileSync } from "node:fs";

const src = readFileSync("todo.md", "utf8");
const contracts = parseTodo(src);

const results = [];
for (const c of contracts.filter((c) => !c.checked && c.verifier)) {
  results.push(await runContract(c, process.cwd()));
}

writeFileSync("todo.md", updateTodo(src, results));

Full export surface:

// Core
import { parseTodo, runContract, runShell, updateTodo } from "evalgate";

// Swarm orchestration
import { runSwarm, resumeSwarm, retryWorker, loadState, swarmEvents } from "evalgate";
import type { SwarmOptions, SwarmResult, SwarmState, SwarmEvent, BudgetExceededEvent } from "evalgate";

// Persistence
import { appendRun, queryRuns, getLastFailure, getLastRun, onRun } from "evalgate";
import { sendMessage, listMessages } from "evalgate";
import { reportTokenUsage, queryBudgetRecords, getTotalTokens, getBudgetSummary, estimateUsd } from "evalgate";

// Memory + analysis
import { suggest, detectPatterns, exportSnapshot, snapshotToMarkdown, diffSnapshots, diffToMarkdown } from "evalgate";

// Servers
import { startMcpServer, startUiServer, startWatcher, startDash } from "evalgate";
import type { McpServerOptions } from "evalgate";

// Cron helpers
import { parseCron, matchesCron, nextFireMs } from "evalgate";

Other integrations

Codex / any CLI / git hooks

evalgate check is provider-agnostic. Wire it anywhere a shell command runs:

# post-commit hook inside a worktree
evalgate check todo.md || echo "Contracts failed — review before merging."

CI (GitHub Actions)

- name: Check evalgate contracts
  run: npx evalgate check

Roadmap

| Version | Feature | Status | | ------- | ------- | ------ | | v0.1 | Parser, shell verifier, CLI (check, list), Claude Code hook | Shipped | | v0.2 | MCP server (15 tools), evalgate serve | Shipped | | v0.3 | Triggers (schedule, watch, webhook), evalgate watch | Shipped | | v0.4 | Durable run log, structured agent messaging, evalgate retry | Shipped | | v0.5 | Web UI (evalgate ui), ANSI dashboard (evalgate dash) | Shipped | | v0.6 | Budget tracking, provider/role hints, MCP-scoped contracts | Shipped | | v0.7 | Memory/learning (suggest, patterns), export, failure analysis | Shipped | | v0.8 | Composite verifiers (eval.all, eval.any), LLM-judge (eval.llm), diff, GitHub Actions CI, Biome linter | Shipped | | v0.9 | Swarm orchestrator — parallel workers, git worktrees, verifier-gated merge | Shipped | | v0.10 | Export swarm/worktree/spawn APIs for orchestrator consumers, retryWorker with failure-context injection | Shipped | | v0.11 | MCP named workspaces — expose multiple todo.md files as a single MCP server with workspace routing | Shipped | | v0.12 | Structured swarm events — "eval-result", "cost", "task-complete" typed events on swarmEvents; SwarmEvent discriminated union exported | Shipped | | v0.13 | Re-check watch mode — evalgate check --watch re-runs failing contracts on file change; TDD inner loop for agents | Shipped | | v0.14 | Semantic-diff verifier — eval.diff kind: assert a structural change happened in a file (pattern/hash-based, zero deps) | Shipped | | v1.0 | API stability declaration — stable public surface, VERSION export, coordinated with conductor v1.0. Agent-agnostic context injection (taskContext on SpawnOpts/SwarmOptions), {task} placeholder in agentArgs for non-Claude CLIs, concurrent merge fix (mutex serializes commit+merge to eliminate todo.md conflicts at any concurrency) | Shipped | | v2.0 | Repo-level merge mutex, -X theirs conflict resolution, DiffVerifier, VERSION re-export | Shipped | | v2.1 | FailureKind typed errors (worktree-create, agent-crash, agent-timeout, verifier-fail, verifier-timeout, merge-conflict), failureKind on WorkerState + TaskCompleteEvent, agent timeout (agentTimeoutMs on SpawnOpts), verifier timeouts (shell + LLM + composite aggregate timeoutMs), "worker-start" + "worker-retry" events on swarmEvents | Shipped | | v2.2 | Eval type expansion — eval.http: (HTTP health check, uses built-in fetch), eval.schema: (JSON schema validation), conditional retry (retry-if: exit-code > 1) | Shipped | | v2.3 | Plugin hooks + Resume API — onWorkerStart / onWorkerComplete / onBudgetExceeded hooks in SwarmOptions; AbortSignal support (signal option stops new-worker spawning cleanly); resumeSwarm(stateFile) public API; estimateUsd(input, output, model?) replaces hardcoded pricing math; log pagination (offset, from, to on queryRuns) | Shipped |

v1.0 Stability

As of v1.0.0 the public API surface exported from evalgate (src/index.ts) is stable and follows semantic versioning:

Breaking changes require a major version bump
New exports are added in minor releases
Bug fixes ship as patches

Stable public exports: runSwarm, resumeSwarm, retryWorker, swarmEvents, estimateUsd, parseTodo, runContract, runShell, budget.*, log.*, telegram.*, startMcpServer, startCheckWatch, parseCron, matchesCron, nextFireMs, worktree.*, and all exported types from types.ts

Also exported: VERSION — the current package version as a string, useful for downstream consumers that want to display or validate the evalgate version.

import { VERSION } from "evalgate";
console.log(VERSION); // "1.0.0"

Prior art and positioning

conductor — multi-agent orchestrator built on top of evalgate. If you want track-scoped parallel workers with a web dashboard and CLI, use conductor. evalgate is its quality gate engine.
Octogent / OpenHarness — orchestrate multiple Claude Code sessions. evalgate slots underneath them as the quality-gate layer they're missing.
promptfoo / braintrust — eval LLM outputs at the prompt level. evalgate evals agent-produced changes at the task level.
Claude Code native subagents — invisible spawning, no quality gate. evalgate contracts are plain markdown you can read, edit, and version.

conductor           ← orchestrator built on evalgate (tracks, UI, retry)
    ↓
evalgate            ← primitive, no deps, quality gate layer
    ↑
Octogent            ← orchestrator, Claude Code only, no quality gate
OpenHarness         ← full harness, multi-provider

Contributing

PRs welcome. Keep the zero-dependency constraint — it's a hard rule, not a preference. New verifier kinds belong in types.ts first, then verifier.ts, then a test. The parser is the most critical file; edge cases matter more than features.

Development setup

pnpm install       # install dev deps (typescript, tsx, biome)
pnpm build         # compile TypeScript → dist/
pnpm test          # run all unit tests
pnpm typecheck     # TypeScript strict check, no emit
pnpm lint          # biome lint check (src/ and test/)
pnpm lint:fix      # auto-fix lint issues

A pre-commit hook runs typecheck + lint automatically on every commit. If it blocks, run pnpm lint:fix to auto-fix, then re-stage.

Adding a verifier kind

Add the interface to src/types.ts and union it into Verifier
Handle it in src/parser.ts (buildVerifier)
Handle it in src/verifier.ts (runContract)
Add tests in test/parser.test.ts and a new test/<kind>.test.ts
Export from src/index.ts

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

evalgate

30-second demo

Why this matters

Install

Quick start

Contract format

Field reference

Verifier variants

Trigger variants

CLI reference

Swarm Cockpit

Checking swarm status

Retrying a failed worker

Web UI

MCP server

Add to Claude Desktop

Named workspaces (v0.11+)

MCP tools

Claude Code integration

Triggers

Agent messaging

Budget tracking

Terminal dashboard

Memory and learning

Persistence

Swarm hooks (v2.3+)

Programmatic API

Other integrations

Codex / any CLI / git hooks

CI (GitHub Actions)

Roadmap

v1.0 Stability

Prior art and positioning

Contributing

Development setup

Adding a verifier kind

License