review-orchestra

v0.1.2

Published

17 days ago

Multi-model automated code review orchestration for Claude Code

0High
0Medium
0Low

ckailash

code-review claude codex orchestration

review-orchestra

Multi-model code review orchestration for Claude Code. Runs multiple AI reviewers (Claude + Codex by default) in parallel, consolidates findings, and presents them to the user. The orchestrator Claude fixes code directly with user guidance in a supervised loop.

How it works

          +-------------------+
          |  Detect Scope     |  git diff / branch diff / PR diff
          +--------+----------+
                   |
          +--------v----------+
          |  Parallel Review   |  Claude + Codex (headless, concurrent)
          +--------+----------+
                   |
          +--------v----------+
          |  Consolidate       |  Dedup, classify confidence × impact, compute P-level
          +--------+----------+
                   |
          +--------v----------+
          |  Return findings   |  ReviewResult JSON → skill → user
          +-------------------+

Scope detection — auto-detects uncommitted changes, branch diff vs main, or open PR diff.
Parallel review — launches all configured reviewers as headless CLI processes concurrently.
Consolidation — deduplicates findings across reviewers, classifies each on two axes (confidence × impact), computes a P-level (P0–P3), tags pre-existing issues outside the diff, and compares findings against previous rounds.
Return findings — the CLI returns a ReviewResult JSON on stdout. The skill presents findings to the user, who decides what to fix.

After receiving findings, the orchestrator Claude (who wrote the code and has full context) fixes issues directly with user guidance. The user controls what gets fixed and when to re-review.

Installation

Prerequisites

Node.js >= 22
Claude Code CLI (claude)
Codex CLI (codex) — optional, can be disabled

From npm (recommended)

npm install -g review-orchestra
review-orchestra setup

From source (contributors)

git clone https://github.com/ckailash/review-orchestra.git
cd review-orchestra
npm install
npm run build
review-orchestra setup

review-orchestra setup validates your environment (Node version, required CLIs), creates the skill symlink for Claude Code, and configures .gitignore. Run review-orchestra doctor anytime to diagnose issues.

Usage

Invoke via the /review-orchestra skill in Claude Code. Arguments are natural language — no flags.

/review-orchestra                                  # auto-detect scope, all defaults
/review-orchestra src/auth/ src/api/               # only review these paths
/review-orchestra fix quality issues too            # extend threshold to P2
/review-orchestra only use claude                   # single reviewer
/review-orchestra skip codex                       # disable a specific reviewer

Defaults (when no arguments are provided):

Auto-detect diff scope
Both Claude + Codex reviewers
Stop-at threshold: P1 (all critical + functional issues recommended for fixing)

CLI subcommands

review-orchestra review                            # run reviewers + consolidate, return findings (default)
review-orchestra review src/services/              # with scope args
review-orchestra review only claude                # natural language filtering
review-orchestra stale                             # check if worktree changed since last review (exit 0=fresh, 1=stale, 2=no session)
review-orchestra reset                             # clear session state
review-orchestra setup                             # first-time install + repair broken install
review-orchestra doctor                            # diagnose issues without modifying anything

Running with no subcommand (just scope args) is equivalent to review-orchestra review.

Supervised workflow

The CLI runs review + consolidation and returns a ReviewResult JSON. The skill then enters the supervised loop:

Review — run review-orchestra review (spawns reviewers, consolidates, returns findings)
Present — skill shows findings grouped by severity with progressive disclosure
User decides — user selects which findings to fix (by ID, severity band, or "all")
Confirm — skill echoes back planned actions, waits for confirmation
Fix — orchestrator Claude fixes code directly using Edit/Write tools
Re-review — if requested, run review-orchestra review again (back to step 1)
Done — summarize session, suggest next action (commit, push, PR)

The user controls the loop at every step. Escalation is implicit: the user sees all findings and decides.

Session artifacts

All session data lives in .review-orchestra/ in the project root. Reviewer outputs are persisted as flat files (no per-round subdirectories) so debug evidence survives parse failures and crashes:

.review-orchestra/
├── session.json                      # Session state (sessionId, status, scope, rounds[], reviews, consolidated, reviewerErrors)
├── state.lock                        # PID file for concurrent-run prevention
├── progress.json                     # Live reviewer status during a run (deleted on round complete)
├── round-1-claude-raw.txt            # Raw stdout from each reviewer, written before parsing
├── round-1-codex-raw.txt             # On parse failure renamed to *.raw.txt; on codex failure renamed *.failed
└── round-2-claude-raw.txt

Everything is kept. Never auto-deleted. Users can rm -rf .review-orchestra/ or review-orchestra reset when done. setup adds .review-orchestra/ to .gitignore automatically.

Concurrent runs of review-orchestra review in the same project are blocked by state.lock (atomic-rename release protocol with PID re-check) so two terminals don't trample each other's session.

Session state (`session.json`)

{
  "sessionId": "20260315-143022",
  "status": "active",
  "scope": { "type": "branch", "baseBranch": "main", "description": "branch feat/auth vs main" },
  "currentRound": 2,
  "worktreeHash": "abc123",
  "rounds": [
    {
      "number": 1,
      "phase": "complete",
      "worktreeHash": "def456",
      "reviews": { "claude": { "findings": [], "metadata": { } }, "codex": { } },
      "consolidated": [],
      "reviewerErrors": [],
      "findingsPersisted": true,
      "startedAt": "2026-03-15T14:30:22Z",
      "completedAt": "2026-03-15T14:32:11Z"
    }
  ],
  "startedAt": "2026-03-15T14:30:22Z",
  "completedAt": null
}

Key fields:

sessionId — timestamp-based unique ID (e.g. 20260315-143022)
status — active, expired (scope base changed, requires reset), or completed
scope — diff scope detected at session creation (with pathFilters if any)
currentRound — most recent round number
worktreeHash — SHA-256 over HEAD + staged + unstaged + untracked files; per-round snapshot for stale detection
rounds[] — per-round artifacts: phase, reviews (per-reviewer findings + metadata), consolidated, reviewerErrors (preserved across crash recovery), findingsPersisted, timestamps
startedAt — session creation timestamp

Session lifecycle:

review-orchestra review → creates or continues session, runs reviewers + consolidation, returns findings
review-orchestra reset → clears the session (equivalent to rm -rf .review-orchestra/)
Session auto-expires if the scope base changes (e.g., new commits on main) — stale session error, user must reset and start fresh
Detached HEAD: baseBranch becomes detached@<sha7> (not the literal "HEAD") so cross-round comparison stays stable

Crash recovery: if a round was interrupted mid-reviewing or mid-consolidating, the next review invocation resumes from the persisted phase. Saved reviewer findings and reviewer errors carry forward.

Configuration

Configuration lives in config/default.json:

{
  "reviewers": {
    "claude": {
      "enabled": true,
      "command": "claude -p - --allowedTools \"Read,Grep,Glob,Bash\" --output-format json",
      "outputFormat": "json"
    },
    "codex": {
      "enabled": true,
      "command": "codex exec - --output-last-message {outputFile} --json",
      "outputFormat": "json"
    }
  },
  "thresholds": {
    "stopAt": "p1"
  }
}

Thresholds

| Setting | Default | Description | |---|---|---| | stopAt | p1 | Suggests which findings to fix. p0 = critical only, p1 = critical + functional, p2 = + quality, p3 = fix everything. The user controls the loop and decides when to stop. |

Finding comparison

Cross-round finding comparison uses LLM-based semantic matching by default (via Claude Haiku). This handles renamed files, shifted line numbers, and reworded descriptions across review rounds. On LLM failure or timeout, it falls back to heuristic matching (file + title.toLowerCase()). Configure via findingComparison in config/default.json:

| Setting | Default | Description | |---|---|---| | method | "llm" | Comparison method: "llm" for semantic matching, "heuristic" for file+title matching | | model | "claude-haiku-4-5" | Model to use for LLM comparison | | timeoutMs | 60000 | Timeout for LLM comparison call (ms) | | fallback | "heuristic" | Fallback method when LLM fails |

Severity model

Findings are classified on two independent axes, then a P-level is derived:

| | Critical | Functional | Quality | Nitpick | |---|---|---|---|---| | Verified | P0 | P1 | P2 | P3 | | Likely | P0 | P1 | P2 | P3 | | Possible | P1 | P2 | P3 | P3 | | Speculative | P2 | P3 | P3 | P3 |

Pre-existing findings (outside the diff hunks) are tagged and excluded from recommendations. In supervised mode, the user can explicitly request fixing a pre-existing finding.

Component overview

review-orchestra/
├── src/
│   ├── orchestrator.ts               # Main orchestration: preflight → reviewers → consolidate → return ReviewResult
│   ├── reviewers/
│   │   ├── types.ts                  # Reviewer interface (+ ReviewerCallContext)
│   │   ├── claude.ts                 # Claude headless reviewer
│   │   ├── codex.ts                  # Codex headless reviewer
│   │   ├── command.ts                # Command template parsing (handles \" and \\ escapes)
│   │   ├── prompt.ts                 # Review prompt builder
│   │   ├── raw-output.ts             # persistRawOutput helper — saves reviewer stdout before parsing
│   │   └── index.ts                  # Registry / factory (GenericReviewer for custom reviewers)
│   ├── consolidator.ts              # Dedup, classify, merge findings
│   ├── findings-store.ts            # Persistent cross-session finding storage (~/.review-orchestra/findings.jsonl)
│   ├── fuzzy-match.ts               # Fuzzy matching for cross-reviewer dedup (tokenize, Jaccard similarity, isFuzzyMatch)
│   ├── scope.ts                     # Diff scope auto-detection
│   ├── config.ts                    # Configuration loading & defaults
│   ├── types.ts                     # Shared types (Finding, Round, SessionState, ReviewResult, etc.)
│   ├── state.ts                     # SessionManager: session-based state tracking
│   ├── worktree-hash.ts             # Worktree hash computation and stale detection
│   ├── finding-comparison.ts        # Cross-round finding comparison (new/persisting/resolved)
│   ├── reviewer-parser.ts           # Parse/normalize reviewer output
│   ├── parse-args.ts                # Natural language CLI argument parsing
│   ├── process.ts                   # Process spawning with streaming
│   ├── toolchain.ts                 # Project tech stack detection
│   ├── progress.ts                  # Progress file (reviewer status during review, progress.json)
│   ├── preflight.ts                 # Validates required binaries
│   ├── checks.ts                    # Shared check functions for setup/doctor
│   ├── setup.ts                     # Setup command (runs checks + fixes)
│   ├── doctor.ts                    # Doctor command (runs checks + reports)
│   ├── json-utils.ts                # JSON extraction & envelope unwrapping
│   └── log.ts                       # Logging utilities
├── config/
│   └── default.json                  # Default configuration (reviewers, thresholds)
├── skill/
│   └── SKILL.md                      # Claude Code skill entry point (supervised flow)
├── prompts/
│   └── review.md                     # Template for reviewer agents
├── schemas/
│   └── findings.schema.json          # JSON schema for structured findings output
├── test/                             # Unit/integration tests (Vitest)
└── evals/                            # LLM eval harness

The orchestrator (src/orchestrator.ts) runs a single pass: preflight → parallel reviewers → consolidation → finding comparison → return ReviewResult. There is no loop — the user controls re-review decisions via the skill.

State is managed by SessionManager (src/state.ts), which tracks sessions across multiple CLI invocations. Each invocation creates or continues a session, persisting round artifacts and worktree hashes for stale-detection and finding comparison.

Adding custom reviewers

Any CLI tool that accepts a prompt and produces output can be a reviewer. Implement the Reviewer interface:

// src/reviewers/types.ts
interface Reviewer {
  name: string;
  review(
    prompt: string,
    scope: DiffScope,
    context: ReviewerCallContext, // { roundNumber } — used to write round-N-<name>-raw.txt before parsing
  ): Promise<{ findings: Finding[]; rawOutput: string; elapsedMs?: number }>;
}

Or add a generic reviewer via config — any command with a {prompt} placeholder works:

{
  "reviewers": {
    "my-linter": {
      "enabled": true,
      "command": "my-tool review \"{prompt}\"",
      "outputFormat": "json"
    }
  }
}

The consolidator normalizes different output formats into the standard findings schema. Custom reviewers registered in config are handled by the GenericReviewer class, which executes the command and parses the output.

Tests and evals

# Lint (type-check)
npm run lint

# Unit/integration tests (Vitest)
npm test

# Run all evals (LLM-as-judge against synthetic repos)
npm run eval

# Run a single eval fixture
npm run eval -- sql-injection

# Override the judge model
npm run eval -- --judge-model claude-sonnet-4-6

Tests cover deterministic components (scope detection, consolidator, config, session management, reviewer parser, finding comparison, worktree hashing, CLI subcommands) using TDD. LLM-facing components (reviewer adapters, orchestrator wiring) have integration tests. The eval harness uses synthetic repos with planted bugs and LLM-as-judge scoring for precision, recall, and severity accuracy.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

review-orchestra

How it works

Installation

Prerequisites

From npm (recommended)

From source (contributors)

Usage

CLI subcommands

Supervised workflow

Session artifacts

Session state (session.json)

Configuration

Thresholds

Finding comparison

Severity model

Component overview

Adding custom reviewers

Tests and evals

License

Session state (`session.json`)