@color4pen/specrunner

v0.3.3

Published

4 days ago

AI CI/CD runner — submit request.md, get a PR back

0High
0Medium
0Low

color4pen

SpecRunner

request.md in, pull request out — a self-hosted AI CI/CD runner powered by Anthropic Claude.

Verdicts are derived, not self-reported. Review agents return findings; the CLI derives approved / needs-fix from them, verifies that every referenced file:line actually exists, and owns all loop budgets and transitions. Agents are never asked to judge their own work.
State lives in your repository, not in a process. Job history is branch-borne, decisions live on GitHub issues, knowledge is committed files. Kill the process, reboot the machine, close the laptop — the next scheduled run picks up exactly where things stood.
Runs anywhere Node runs. npm install -D @color4pen/specrunner and one crontab line. No daemon, no Docker, no SaaS contract, no IDE switch.

The reasoning behind these choices is in docs/design-philosophy.md.

Built by itself

Every feature here was implemented, reviewed, and merged by this pipeline running unattended on its own repository — including declarative reviewer activation, the post-review regression gate, and step output contracts.

Stability

SpecRunner is 0.x. While it is used in production for this project's own development, the state and config file formats may receive breaking changes between any two releases.

Migrations are provided when formats change, but they ship in minor releases — not majors. Upgrade notes are included in each release changelog.

How the Pipeline Works

SpecRunner reads a request.md file and drives a multi-step pipeline that produces a GitHub PR.

Happy path

request-review — validates the request; escalates if the request is unclear or rejected
design — creates the branch and generates specification files
spec-review — reviews the spec; loops with spec-fixer until approved
test-case-gen — generates test case definitions from the approved spec
implementer — writes the implementation
verification — runs build / typecheck / test / lint; loops with build-fixer until passed
code-review — reviews the implementation; loops with code-fixer until approved
conformance — checks architecture conformance; returns to implementer if fixes are needed
adr-gen — generates an ADR when request.adr is true; passes through otherwise
pr-create — opens the GitHub PR

Judge loops and escalation

Each judge step (spec-review, code-review) returns either approved or needs-fix. A needs-fix verdict routes to the paired fixer step and back to the judge, repeating until the judge approves or the iteration budget is exhausted.

verification works the same way: a failed result routes to build-fixer, then back to verification. A conformance needs-fix returns execution to implementer (full impl-phase re-entry).

Escalation is not a failure. It means an agent reached a point that requires human judgment — an ambiguous request, unresolved findings, or a build it cannot repair. When a job escalates, its state is preserved and can be resumed:

specrunner job resume <slug>

request-review is the front gate: needs-discussion and reject verdicts escalate immediately without looping, signalling that the request needs human clarification before the pipeline can proceed.

Installation

# As a dev dependency (recommended for project use)
npm install -D @color4pen/specrunner

# Or globally
npm install -g @color4pen/specrunner

Provider SDKs (@anthropic-ai/claude-agent-sdk for the local runtime, @openai/codex-sdk for Codex) ship as optional dependencies and install by default, so a standard install runs out of the box. To slim the install (skip the unused provider's SDK), install with --omit=optional and add only the SDK for the provider you use:

npm install -D --omit=optional @color4pen/specrunner
npm install -D @anthropic-ai/claude-agent-sdk   # Claude (local runtime, default)
# or
npm install -D @openai/codex-sdk                # Codex

If a required provider SDK is missing at run time, specrunner stops the step with the exact install command.

Quick Start

# 1. Initialize config scaffold + project directories
npx specrunner init

# 2. Authenticate with GitHub
npx specrunner login

# 3. Create a new request from template
npx specrunner request new my-feature

# 4. Edit the generated request file
#    specrunner/drafts/my-feature/request.md

# 5. Start the pipeline
npx specrunner run my-feature

# 6. Archive when awaiting-archive (merge + archive in one step)
npx specrunner job archive --with-merge my-feature

Failure / resume flow

npx specrunner job ls                    # Find the failed job
npx specrunner job resume my-feature     # Resume from last checkpoint

Environment Variables

| Variable | Required | Description | |---|---|---| | SPECRUNNER_API_KEY | Managed runtime only | Anthropic API key. Not needed for local runtime. | | GH_TOKEN | See GitHub Authentication | GitHub token (highest priority). Used for automation contexts (cron, CI). Overrides GITHUB_TOKEN and stored credentials. | | GITHUB_TOKEN | See GitHub Authentication | GitHub token (second priority). Automatically injected by GitHub Actions. |

GitHub Authentication

SpecRunner resolves a GitHub token in order of priority: GH_TOKEN env → GITHUB_TOKEN env → gh auth token (gh CLI) → credentials.json.

Three authentication paths are supported:

| Path | Context | Token type | Setup | |---|---|---|---| | specrunner login | Interactive (device flow) | User access token (ghu_) | Run specrunner login; token is stored in ~/.config/specrunner/credentials.json. | | GitHub Actions | Unattended CI | Installation token (GITHUB_TOKEN) | Set env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} in your workflow step. Token is injected automatically per run with scoped permissions. | | Self-hosted server / cron | Unattended automation | Fine-grained PAT | Create a fine-grained PAT in GitHub Settings with the minimum required repository permissions. Set GH_TOKEN=<pat> in the environment. Note: fine-grained PATs expire after at most 1 year and must be rotated. |

Automation contexts (cron, CI, always-on schedulers) cannot run device flow and typically cannot reach the interactive keychain. Use the GH_TOKEN env var path for these contexts — it is independent of specrunner login.

Run specrunner doctor to see which source is currently resolved.

Command Reference

Request commands (static document operations)

specrunner request new <slug>              Create request.md from template
specrunner request generate "<text>"       Generate request.md via LLM
specrunner request ls                      List active requests
specrunner request validate <file|slug>    Validate request.md syntax (static, no LLM)
specrunner request template                Print scaffold template to stdout

See docs/request-authoring.md for how to write effective requests (premise verification, granularity, splitting).

Job commands (stateful execution)

specrunner job start <request-slug|file>   Start pipeline, issue jobId
specrunner job ls                          List all jobs
specrunner job show <jobId|slug>           Show job state details
specrunner job cancel <jobId>              Cancel job and cleanup
specrunner job resume <slug>               Resume a halted job
specrunner job archive <slug>              Archive change folder, teardown worktree, update status

Environment commands

specrunner init                            Initialize config scaffold
specrunner login                           GitHub Device Flow OAuth
specrunner doctor                          Diagnose environment / config / auth
specrunner runtime setup                   Set up Anthropic Managed Agents (managed runtime)
specrunner runtime status                  Show managed runtime status
specrunner runtime reset                   Reset managed runtime config

Inbox commands (GitHub issue automation)

specrunner inbox run                       Poll approved issues, start / resume jobs

Extension commands (rules / custom reviewers)

specrunner rules new <step> <slug>         Scaffold a rules file (extra discipline injected into a step)
specrunner reviewers new <name>            Scaffold a custom reviewer definition

Aliases

specrunner run <slug|file>                 Alias for: job start <slug|file>

Inbox — Automated Issue-to-Job Routing

specrunner inbox run polls your GitHub repository for issues with the approval label (default: specrunner-approved) and:

Starts new jobs from unlinked issues whose body is a valid request.md
Resumes jobs in awaiting-resume state when a qualifying /resume comment is found
Rejects issues whose body fails request.md validation (posts a comment with the error)

Idempotency

Each run is safe to call any number of times without side-effects:

Issue linkage: once a job has been started for an issue, that issue is skipped on every subsequent run regardless of job status
Resume gating: only /resume comments posted strictly after the escalation marker (the comment SpecRunner posts when a job escalates) are considered; earlier comments and bot-generated comments are ignored

Approval label workflow

Create a GitHub issue whose body follows the request.md format (see specrunner request template)
Apply the approval label (specrunner-approved by default) to the issue
On the next inbox run, SpecRunner writes the issue body as a draft request.md and starts the pipeline

To stop an issue from being processed, remove the label before the next run.

`/resume` workflow

When a job escalates and requires human input, SpecRunner posts an escalation comment on the linked issue. To resume:

Read the escalation comment to understand what decision is needed

Post a new issue comment starting with /resume followed by your instructions:

/resume Use option B. Skip the cache layer and go with the simpler approach.

On the next inbox run, SpecRunner resumes the job with your prompt as context

Only comments by users with OWNER, MEMBER, or COLLABORATOR association on the repository are accepted.

Scheduling

Run inbox run on a schedule so issues are processed automatically.

cron (Linux/macOS)

*/5 * * * * cd /path/to/repo && npx specrunner inbox run --quiet

launchd (macOS)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.yourteam.specrunner-inbox</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/npx</string>
    <string>specrunner</string>
    <string>inbox</string>
    <string>run</string>
  </array>
  <key>WorkingDirectory</key>
  <string>/path/to/repo</string>
  <key>StartInterval</key>
  <integer>300</integer>
  <key>RunAtLoad</key>
  <true/>
</dict>
</plist>

GitHub Actions

Three complementary triggers cover the common automation patterns. The concurrency group prevents overlapping runs.

GITHUB_TOKEN is injected automatically by GitHub Actions for each run (see GitHub Authentication), so no manual secret configuration is needed.

name: SpecRunner Inbox

on:
  schedule:
    - cron: "*/10 * * * *"         # poll every 10 minutes
  issues:
    types: [labeled]               # fire immediately when label is applied
  issue_comment:
    types: [created]               # fire immediately on new comments

concurrency:
  group: specrunner-inbox
  cancel-in-progress: false        # let in-flight runs finish; queue the next

jobs:
  inbox-run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run inbox
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: npx specrunner inbox run

For the issues.labeled trigger you may add a filter so the workflow only fires on the approval label:

  issues:
    types: [labeled]
# in the job:
    if: github.event.label.name == 'specrunner-approved'

Trust boundary

Issue bodies and /resume comment text are passed directly to agent prompts.

The approval label is the entry gate: only issues a repository member explicitly labels are processed
/resume comments are gated to OWNER, MEMBER, and COLLABORATOR associations; external contributors cannot inject prompts

Running inbox run on repositories with untrusted issue content is not recommended. A repository member with label-apply permission could craft a malicious issue body.

See docs/operations.md for the full unattended-loop runbook: the three authentication layers required for cron (API token / git transport / agent runtime), crontab pitfalls, failure-resilience behavior, and diagnostics.

Extending the Review Chain

The review side of the pipeline is extensible without code changes:

Rules (specrunner/rules/<step>/*.md) add extra discipline to an existing step's prompt. Cheap (no extra session), but shares the step's convergence loop.
Custom reviewers (specrunner/reviewers/<name>.md) add an independent review lens with its own convergence loop, budget, and optional model override. Declared as data (purpose / criteria / judgment sections in markdown), validated at job start, and run serially after code-review. Activation can be scoped declaratively with paths globs and requestTypes so a lens only runs when the diff touches its domain.
When custom reviewers are present, a regression gate runs automatically after the chain: it re-checks every finding that was reported and fixed during review against the final code, catching fixes that were silently undone by later changes.

Scaffold a definition with specrunner reviewers new <name> — the template documents the format.

The data-extensible surface is the review chain. The pipeline's shape itself — which steps exist and in what order — is code, not configuration: changing it means changing SpecRunner.

Where SpecRunner Sits

Nearby tools solve adjacent problems at different layers:

| Layer | What it manages | Tools | |---|---|---| | Context & task management | Shared knowledge bases and task lists for interactive coding agents | Archon | | Spec authoring frameworks | How to write specs and hand them to an agent | GitHub Spec Kit, BMAD | | Execution pipeline | The unattended run from spec to verified PR — judge loops, budgets, escalation, state | SpecRunner, GitHub Copilot coding agent |

SpecRunner is the execution layer: it assumes a spec (request.md) and owns everything between it and an open PR. Compared with platform-bound agents, it is self-hosted, model-configurable per step, and leaves an auditable state trail inside your repository.

Configuration

User global config

SpecRunner stores its configuration at ~/.config/specrunner/config.json (XDG_CONFIG_HOME). Run specrunner init to create the initial scaffold.

Project local config (per-repo override)

Place a partial config at <repo-root>/.specrunner/config.json to override settings for a specific repository. The project local config is deep-merged on top of the user global config — you only need to specify the fields you want to change.

specrunner init configures .gitignore with .specrunner/* + !.specrunner/config.json, so config.json can be committed and shared with your team while machine-generated state (jobs/, logs/, etc.) stays ignored.

// <repo-root>/.specrunner/config.json
{
  "version": 1,
  "steps": {
    "defaults": { "model": "claude-sonnet-4-6" },
    "design": {
      "byRequestType": {
        "spec-change": { "model": "claude-opus-4-6[1m]" },
        "new-feature": { "model": "claude-opus-4-6[1m]" }
      }
    }
  }
}

This example uses opus for design on spec-change / new-feature requests and sonnet for everything else (from user global).

Inbox configuration

// <repo-root>/.specrunner/config.json
{
  "inbox": {
    "approveLabel": "specrunner-approved", // label to poll for; default: "specrunner-approved"
    "maxStartsPerRun": 3                   // max new jobs started per run; 0 = resume-only; default: 3
  }
}

| Key | Default | Description | |---|---|---| | inbox.approveLabel | "specrunner-approved" | GitHub label name that marks an issue as ready to start | | inbox.maxStartsPerRun | 3 | Maximum number of new jobs started in one inbox run invocation. 0 disables new starts (resume-only mode). |

byRequestType — per-request-type model selection

Each step config supports a byRequestType object to select a different model based on the request type:

{
  "steps": {
    "code-review": {
      "model": "claude-sonnet-4-6",
      "byRequestType": {
        "spec-change": { "model": "claude-opus-4-6[1m]" }
      }
    }
  }
}

Resolution order (first defined wins):

steps.<step>.byRequestType.<requestType>.<field>
steps.<step>.<field>
steps.defaults.byRequestType.<requestType>.<field>
steps.defaults.<field>
Step hardcoded default
SDK default

Note: under the managed runtime, model / byRequestType.model are ignored — managed agents use their pre-registered model. These fields are effective only under the local runtime.

Runtime Modes

Local runtime (default)

Runs agents locally via the Claude Agent SDK. No additional API key needed beyond the GitHub token.

specrunner init
specrunner login
specrunner job start my-feature

Managed runtime (Anthropic Managed Agents)

Runs agents in Anthropic's cloud. Requires SPECRUNNER_API_KEY (Anthropic API key).

specrunner init
specrunner login
export SPECRUNNER_API_KEY=sk-ant-...
specrunner runtime setup
specrunner job start my-feature

Cost

The model used by each pipeline step is configurable (see Configuration). Actual cost depends on request complexity, the number of fixer iterations, and the model selected; cache reads dominate token volume, so applying the cache-read discount is essential for accurate projection. Measured figures from this project's own runs are in docs/cost.md.

Assumptions & Supported Scope

Trust model

request.md is treated as trusted input. SpecRunner is designed for solo use where the person who writes request.md also reviews and merges the resulting PR. Feeding request.md files authored by untrusted third parties is outside the supported use case.

Verification gate coverage

By default (no verification.commands set), the verification step detects and runs the build, typecheck, test, and lint scripts from your package.json. Node.js / Bun projects are the primary supported target for this default mode. If no matching scripts are found and verification.commands is also unset, the verification gate is a no-op and code quality relies entirely on the review agents' judgment.

For projects in other languages (Python, Go, Rust, etc.), set verification.commands in your project config to run arbitrary verification commands:

// .specrunner/config.json
{
  "verification": {
    "commands": ["ruff check", { "name": "type", "run": "mypy" }, "pytest -v"]
  }
}

Test file placement

By default (no tests.placement set), the implementer agent follows the existing test placement pattern it finds in the project. This works well for projects where the LLM reliably infers the convention, but can produce "dead" test files in projects with strict include patterns (e.g. pnpm monorepos with vitest configured to only pick up files under a specific directory).

Set tests.placement in your project config to declare the convention explicitly:

// .specrunner/config.json — sibling (test next to source)
{
  "tests": {
    "placement": {
      "style": "sibling"
      // optional: "suffix": ".spec.ts"   (default: ".test.ts")
    }
  }
}

// .specrunner/config.json — mirror (tests/ tree mirrors src/)
{
  "tests": {
    "placement": {
      "style": "mirror",
      "testsRoot": "tests",
      "sourceRoot": "src"
      // optional: "suffix": ".spec.ts"   (default: ".test.ts")
    }
  }
}

| style | Placement rule | Example | |-------|---------------|---------| | sibling | Same directory as the source file | src/foo/bar.ts → src/foo/bar.test.ts | | mirror | Under testsRoot/, stripping sourceRoot/ prefix | src/foo/bar.ts → tests/foo/bar.test.ts |

When sourceRoot is omitted from a mirror config, the full source path is preserved under testsRoot/ (e.g. src/foo/bar.ts → tests/src/foo/bar.test.ts).

Commit history trust

In repositories with external contributors, git log and git diff output is included in agent prompts. Running SpecRunner on repositories with untrusted commit history is not recommended, as malicious content in commit messages or diff output could influence agent behavior.

Troubleshooting

Lint failure in verification pipeline

If bun run lint (or a custom lint command in verification.commands) fails during verification:

Run auto-fix to resolve mechanical issues automatically:
```
bun run lint --fix
```
Review remaining warnings manually — these require human judgment (e.g. intentional any usage, complex control flow).
Prefix intentionally unused variables with _ to suppress no-unused-vars warnings (e.g. _unused).
Re-run bun run lint to confirm 0 warnings / 0 errors before committing.

Silent exit (process exits without error)

If specrunner run or specrunner job resume exits unexpectedly without error output:

Enable pipeline diagnostic logging:

SPECRUNNER_DEBUG=pipeline specrunner run <request>

Check which boundary log point was the last one emitted — this identifies where the event loop exited prematurely.
The job state will have been transitioned to awaiting-resume by the exit guard. Run specrunner job resume <slug> to continue.