review-orchestra
v0.1.2
Published
Multi-model automated code review orchestration for Claude Code
Maintainers
Readme
review-orchestra
Multi-model code review orchestration for Claude Code. Runs multiple AI reviewers (Claude + Codex by default) in parallel, consolidates findings, and presents them to the user. The orchestrator Claude fixes code directly with user guidance in a supervised loop.
How it works
+-------------------+
| Detect Scope | git diff / branch diff / PR diff
+--------+----------+
|
+--------v----------+
| Parallel Review | Claude + Codex (headless, concurrent)
+--------+----------+
|
+--------v----------+
| Consolidate | Dedup, classify confidence × impact, compute P-level
+--------+----------+
|
+--------v----------+
| Return findings | ReviewResult JSON → skill → user
+-------------------+- Scope detection — auto-detects uncommitted changes, branch diff vs main, or open PR diff.
- Parallel review — launches all configured reviewers as headless CLI processes concurrently.
- Consolidation — deduplicates findings across reviewers, classifies each on two axes (confidence × impact), computes a P-level (P0–P3), tags pre-existing issues outside the diff, and compares findings against previous rounds.
- Return findings — the CLI returns a
ReviewResultJSON on stdout. The skill presents findings to the user, who decides what to fix.
After receiving findings, the orchestrator Claude (who wrote the code and has full context) fixes issues directly with user guidance. The user controls what gets fixed and when to re-review.
Installation
Prerequisites
- Node.js >= 22
- Claude Code CLI (
claude) - Codex CLI (
codex) — optional, can be disabled
From npm (recommended)
npm install -g review-orchestra
review-orchestra setupFrom source (contributors)
git clone https://github.com/ckailash/review-orchestra.git
cd review-orchestra
npm install
npm run build
review-orchestra setupreview-orchestra setup validates your environment (Node version, required CLIs), creates the skill symlink for Claude Code, and configures .gitignore. Run review-orchestra doctor anytime to diagnose issues.
Usage
Invoke via the /review-orchestra skill in Claude Code. Arguments are natural language — no flags.
/review-orchestra # auto-detect scope, all defaults
/review-orchestra src/auth/ src/api/ # only review these paths
/review-orchestra fix quality issues too # extend threshold to P2
/review-orchestra only use claude # single reviewer
/review-orchestra skip codex # disable a specific reviewerDefaults (when no arguments are provided):
- Auto-detect diff scope
- Both Claude + Codex reviewers
- Stop-at threshold: P1 (all critical + functional issues recommended for fixing)
CLI subcommands
review-orchestra review # run reviewers + consolidate, return findings (default)
review-orchestra review src/services/ # with scope args
review-orchestra review only claude # natural language filtering
review-orchestra stale # check if worktree changed since last review (exit 0=fresh, 1=stale, 2=no session)
review-orchestra reset # clear session state
review-orchestra setup # first-time install + repair broken install
review-orchestra doctor # diagnose issues without modifying anythingRunning with no subcommand (just scope args) is equivalent to review-orchestra review.
Supervised workflow
The CLI runs review + consolidation and returns a ReviewResult JSON. The skill then enters the supervised loop:
- Review — run
review-orchestra review(spawns reviewers, consolidates, returns findings) - Present — skill shows findings grouped by severity with progressive disclosure
- User decides — user selects which findings to fix (by ID, severity band, or "all")
- Confirm — skill echoes back planned actions, waits for confirmation
- Fix — orchestrator Claude fixes code directly using Edit/Write tools
- Re-review — if requested, run
review-orchestra reviewagain (back to step 1) - Done — summarize session, suggest next action (commit, push, PR)
The user controls the loop at every step. Escalation is implicit: the user sees all findings and decides.
Session artifacts
All session data lives in .review-orchestra/ in the project root. Reviewer outputs are persisted as flat files (no per-round subdirectories) so debug evidence survives parse failures and crashes:
.review-orchestra/
├── session.json # Session state (sessionId, status, scope, rounds[], reviews, consolidated, reviewerErrors)
├── state.lock # PID file for concurrent-run prevention
├── progress.json # Live reviewer status during a run (deleted on round complete)
├── round-1-claude-raw.txt # Raw stdout from each reviewer, written before parsing
├── round-1-codex-raw.txt # On parse failure renamed to *.raw.txt; on codex failure renamed *.failed
└── round-2-claude-raw.txtEverything is kept. Never auto-deleted. Users can rm -rf .review-orchestra/ or review-orchestra reset when done. setup adds .review-orchestra/ to .gitignore automatically.
Concurrent runs of review-orchestra review in the same project are blocked by state.lock (atomic-rename release protocol with PID re-check) so two terminals don't trample each other's session.
Session state (session.json)
{
"sessionId": "20260315-143022",
"status": "active",
"scope": { "type": "branch", "baseBranch": "main", "description": "branch feat/auth vs main" },
"currentRound": 2,
"worktreeHash": "abc123",
"rounds": [
{
"number": 1,
"phase": "complete",
"worktreeHash": "def456",
"reviews": { "claude": { "findings": [], "metadata": { } }, "codex": { } },
"consolidated": [],
"reviewerErrors": [],
"findingsPersisted": true,
"startedAt": "2026-03-15T14:30:22Z",
"completedAt": "2026-03-15T14:32:11Z"
}
],
"startedAt": "2026-03-15T14:30:22Z",
"completedAt": null
}Key fields:
sessionId— timestamp-based unique ID (e.g.20260315-143022)status—active,expired(scope base changed, requires reset), orcompletedscope— diff scope detected at session creation (withpathFiltersif any)currentRound— most recent round numberworktreeHash— SHA-256 over HEAD + staged + unstaged + untracked files; per-round snapshot for stale detectionrounds[]— per-round artifacts:phase,reviews(per-reviewer findings + metadata),consolidated,reviewerErrors(preserved across crash recovery),findingsPersisted, timestampsstartedAt— session creation timestamp
Session lifecycle:
review-orchestra review→ creates or continues session, runs reviewers + consolidation, returns findingsreview-orchestra reset→ clears the session (equivalent torm -rf .review-orchestra/)- Session auto-expires if the scope base changes (e.g., new commits on main) — stale session error, user must reset and start fresh
- Detached HEAD:
baseBranchbecomesdetached@<sha7>(not the literal"HEAD") so cross-round comparison stays stable
Crash recovery: if a round was interrupted mid-reviewing or mid-consolidating, the next review invocation resumes from the persisted phase. Saved reviewer findings and reviewer errors carry forward.
Configuration
Configuration lives in config/default.json:
{
"reviewers": {
"claude": {
"enabled": true,
"command": "claude -p - --allowedTools \"Read,Grep,Glob,Bash\" --output-format json",
"outputFormat": "json"
},
"codex": {
"enabled": true,
"command": "codex exec - --output-last-message {outputFile} --json",
"outputFormat": "json"
}
},
"thresholds": {
"stopAt": "p1"
}
}Thresholds
| Setting | Default | Description |
|---|---|---|
| stopAt | p1 | Suggests which findings to fix. p0 = critical only, p1 = critical + functional, p2 = + quality, p3 = fix everything. The user controls the loop and decides when to stop. |
Finding comparison
Cross-round finding comparison uses LLM-based semantic matching by default (via Claude Haiku). This handles renamed files, shifted line numbers, and reworded descriptions across review rounds. On LLM failure or timeout, it falls back to heuristic matching (file + title.toLowerCase()). Configure via findingComparison in config/default.json:
| Setting | Default | Description |
|---|---|---|
| method | "llm" | Comparison method: "llm" for semantic matching, "heuristic" for file+title matching |
| model | "claude-haiku-4-5" | Model to use for LLM comparison |
| timeoutMs | 60000 | Timeout for LLM comparison call (ms) |
| fallback | "heuristic" | Fallback method when LLM fails |
Severity model
Findings are classified on two independent axes, then a P-level is derived:
| | Critical | Functional | Quality | Nitpick | |---|---|---|---|---| | Verified | P0 | P1 | P2 | P3 | | Likely | P0 | P1 | P2 | P3 | | Possible | P1 | P2 | P3 | P3 | | Speculative | P2 | P3 | P3 | P3 |
Pre-existing findings (outside the diff hunks) are tagged and excluded from recommendations. In supervised mode, the user can explicitly request fixing a pre-existing finding.
Component overview
review-orchestra/
├── src/
│ ├── orchestrator.ts # Main orchestration: preflight → reviewers → consolidate → return ReviewResult
│ ├── reviewers/
│ │ ├── types.ts # Reviewer interface (+ ReviewerCallContext)
│ │ ├── claude.ts # Claude headless reviewer
│ │ ├── codex.ts # Codex headless reviewer
│ │ ├── command.ts # Command template parsing (handles \" and \\ escapes)
│ │ ├── prompt.ts # Review prompt builder
│ │ ├── raw-output.ts # persistRawOutput helper — saves reviewer stdout before parsing
│ │ └── index.ts # Registry / factory (GenericReviewer for custom reviewers)
│ ├── consolidator.ts # Dedup, classify, merge findings
│ ├── findings-store.ts # Persistent cross-session finding storage (~/.review-orchestra/findings.jsonl)
│ ├── fuzzy-match.ts # Fuzzy matching for cross-reviewer dedup (tokenize, Jaccard similarity, isFuzzyMatch)
│ ├── scope.ts # Diff scope auto-detection
│ ├── config.ts # Configuration loading & defaults
│ ├── types.ts # Shared types (Finding, Round, SessionState, ReviewResult, etc.)
│ ├── state.ts # SessionManager: session-based state tracking
│ ├── worktree-hash.ts # Worktree hash computation and stale detection
│ ├── finding-comparison.ts # Cross-round finding comparison (new/persisting/resolved)
│ ├── reviewer-parser.ts # Parse/normalize reviewer output
│ ├── parse-args.ts # Natural language CLI argument parsing
│ ├── process.ts # Process spawning with streaming
│ ├── toolchain.ts # Project tech stack detection
│ ├── progress.ts # Progress file (reviewer status during review, progress.json)
│ ├── preflight.ts # Validates required binaries
│ ├── checks.ts # Shared check functions for setup/doctor
│ ├── setup.ts # Setup command (runs checks + fixes)
│ ├── doctor.ts # Doctor command (runs checks + reports)
│ ├── json-utils.ts # JSON extraction & envelope unwrapping
│ └── log.ts # Logging utilities
├── config/
│ └── default.json # Default configuration (reviewers, thresholds)
├── skill/
│ └── SKILL.md # Claude Code skill entry point (supervised flow)
├── prompts/
│ └── review.md # Template for reviewer agents
├── schemas/
│ └── findings.schema.json # JSON schema for structured findings output
├── test/ # Unit/integration tests (Vitest)
└── evals/ # LLM eval harnessThe orchestrator (src/orchestrator.ts) runs a single pass: preflight → parallel reviewers → consolidation → finding comparison → return ReviewResult. There is no loop — the user controls re-review decisions via the skill.
State is managed by SessionManager (src/state.ts), which tracks sessions across multiple CLI invocations. Each invocation creates or continues a session, persisting round artifacts and worktree hashes for stale-detection and finding comparison.
Adding custom reviewers
Any CLI tool that accepts a prompt and produces output can be a reviewer. Implement the Reviewer interface:
// src/reviewers/types.ts
interface Reviewer {
name: string;
review(
prompt: string,
scope: DiffScope,
context: ReviewerCallContext, // { roundNumber } — used to write round-N-<name>-raw.txt before parsing
): Promise<{ findings: Finding[]; rawOutput: string; elapsedMs?: number }>;
}Or add a generic reviewer via config — any command with a {prompt} placeholder works:
{
"reviewers": {
"my-linter": {
"enabled": true,
"command": "my-tool review \"{prompt}\"",
"outputFormat": "json"
}
}
}The consolidator normalizes different output formats into the standard findings schema. Custom reviewers registered in config are handled by the GenericReviewer class, which executes the command and parses the output.
Tests and evals
# Lint (type-check)
npm run lint
# Unit/integration tests (Vitest)
npm test
# Run all evals (LLM-as-judge against synthetic repos)
npm run eval
# Run a single eval fixture
npm run eval -- sql-injection
# Override the judge model
npm run eval -- --judge-model claude-sonnet-4-6Tests cover deterministic components (scope detection, consolidator, config, session management, reviewer parser, finding comparison, worktree hashing, CLI subcommands) using TDD. LLM-facing components (reviewer adapters, orchestrator wiring) have integration tests. The eval harness uses synthetic repos with planted bugs and LLM-as-judge scoring for precision, recall, and severity accuracy.
License
MIT
