@shpitdev/codexharness

v0.0.5

Published

3 months ago

Codex conductor, nanny, and TUI runtime harness.

0High
0Medium
0Low

dimethylpant

codex-orchestration

Browser-observable multi-agent orchestration on Codex app-server.

What This Repo Does

codex-conductor: runs multi-agent implementation loops (solutionlead -> engineer -> tester) and captures full run telemetry.
codex-nanny: separate thread watcher for human Codex sessions; sends idle nudges and follow-up prompts.
Browser monitor: live + rewind visualization of run events, handoffs, messages, and checkpoints.

The runtime is app-server only (no SDK backend path).

How Conductor Works

Start a run from a spec file or --prompt.
Runner executes role turns through Codex app-server threads.
Every event is persisted to run artifacts (events.jsonl, per-role logs, turn snapshots, reports).
A local monitor service serves the web viewer from apps/web build artifacts for live state and rewind.
Monitor stays alive after completion so runs can be reviewed later.

Getting Started

bun install
bun link

After bun link, commands are available globally:

codex-conductor
codex-nanny
codex-tui

Build host-native CLI binaries:

bun run build:cli

Outputs:

dist/codex-conductor
dist/codex-nanny
dist/codex-tui
dist/monitor-web/ (staged copy of apps/web/dist used by monitor serving)

Compiled binaries include runner/nanny/TUI internals (no runtime dependency on src/*.ts paths).

Fresh-machine binary flow check (isolated bundle, no source-path dependency):

bun run test:e2e:fresh-binary -- --dist-dir ./dist

npm packaging plan and publish gates:

docs/npm-packaging-plan.md
RELEASE.md

Target public package: @shpitdev/codexharness.

Monorepo Scaffold

The repo now includes first-pass app/package boundaries for the long-term split:

apps/web — Solid web monitor viewer (bun run dev:web)
apps/tui — OpenTUI run list + status badges + live events tail + turn detail inspector + final gate panel (bun run dev:tui)
packages/core — core runner/state/policy/audit runtime
packages/cli — conductor/nanny/monitor command and daemon surfaces
packages/monitor-api — monitor run/event discovery + API response shaping

Runtime now lives in packages/core + packages/cli.

Current extracted core modules:

packages/core/src/state.ts
packages/core/src/threadTypes.ts
packages/core/src/audit.ts
packages/core/src/policy.ts
packages/core/src/runDirs.ts
packages/core/src/threadBackend.ts
packages/core/src/appServerClient.ts
packages/core/src/threadEvents.ts
packages/core/src/threadBackendAppServer.ts
packages/core/src/evidence.ts
packages/core/src/artifacts.ts
packages/core/src/io.ts
packages/core/src/report.ts
packages/core/src/runner.ts
packages/core/src/agentDocs.ts
packages/core/src/chime.ts
packages/core/src/cliArgs.ts
packages/core/src/env.ts
packages/core/src/gitignore.ts
packages/core/src/schema.ts
packages/core/src/threadState.ts
packages/core/src/todos.ts

Current extracted CLI modules:

packages/cli/src/codexConductor.ts
packages/cli/src/codexNanny.ts
packages/cli/src/conductorMonitor.ts
packages/cli/src/nanny.ts
packages/cli/src/nannyPolicy.ts
packages/cli/src/nannyState.ts
packages/cli/src/thread-cli.ts
packages/cli/src/report-cli.ts

Current extracted monitor API modules:

packages/monitor-api/src/index.ts

Source-root compatibility shims have been removed; scripts/tests now import package modules directly.

Quickstart

Run conductor in current folder:

codex-conductor -p "implement X in this repo"

Run from spec:

codex-conductor specs/v1/example.md --fresh

Defaults:

workdir: current directory
monitor: auto-start detached local daemon process
model: gpt-5.3-codex-spark
reasoning effort: xhigh
verification gate: tester must provide verification evidence when should_test=true

At run end, conductor prints the monitor URL for that run.

Usage:

codex-conductor <spec-file|-p|--prompt ...> [runner flags]
codex-conductor monitor <start|status|stop|open> [--workdir <path>] [--port <n>]
codex-conductor tui [--workdir <path>] [--monitor-port <n>] [--monitor-base-url <url>]
codex-tui [--workdir <path>] [--monitor-port <n>] [--monitor-base-url <url>]

Model override example:

codex-conductor --prompt "implement X" --model gpt-5.3-codex-spark --model-reasoning-effort xhigh

Reviewing A Run

Start a run with codex-conductor.
Open the printed monitor URL (or run codex-conductor monitor open --workdir .).
Use the rewind slider for event-by-event replay.
Select stage nodes or trace cards to inspect parsed turn detail (trigger/actions/output/next).
Switch Trace/Raw to compare consolidated turn cards vs raw event rows.
Expand Todo Overview to inspect latest per-role todos and todo update history.
Review final artifacts under .runner-state/runs/<runId>/.

Monitor Commands

codex-conductor monitor status
codex-conductor monitor start --workdir . --port 42427
codex-conductor monitor open --workdir .
codex-conductor monitor stop --workdir .

Notes:

Monitor home lists recent run metadata with stage/status and prompt previews.

TUI Command

codex-conductor tui --workdir .
codex-tui --workdir .
codex-tui --monitor-base-url http://127.0.0.1:42427

Notes:

codex-conductor tui and codex-tui auto-start monitor for --workdir unless --monitor-base-url is provided.

Nanny (Separate Interaction)

Thread watcher examples:

bun run nanny -- --dry-run --once
bun run nanny -- --idle-seconds 240 --cooldown-seconds 900

Tmux launcher examples:

codex-nanny .
codex-nanny --workdir . -- --model gpt-5.3-codex

codex-nanny starts a tmux session with two panes:

left pane: nanny monitor process
right pane: Codex interactive session

Local State And Artifacts

Per target repo:

<workdir>/.runner-state/
<workdir>/.runner-state/runs/<runId>/

Per-run artifacts:

manifest.json
state.json
events.jsonl (canonical timeline)
events-by-role/*.jsonl
turn-*.json
report.md / report.json / mermaid outputs

Web monitor assets are built once under apps/web/dist (and staged to dist/monitor-web by bun run build:cli).

Regenerate report:

bun run report -- --artifacts .runner-state --svg

Validation

Typecheck:

bunx tsc -p tsconfig.json --noEmit

Tests:

bun run test:unit

Unit tests are intentionally unit/component scope only. They do not claim full real-run end-to-end verification.

Unit test files follow *.unit.test.ts under tests/.

Manual end-to-end scenario stub:

docs/scenarios/real-run-e2e.scenario.stub.md

Real-run e2e harness (executes a real prompt, then asserts high-level scenario checks):

bun run test:e2e:real -- --workdir . --prompt "implement X and verify"

Run the real e2e harness against a compiled conductor binary artifact:

bun run test:e2e:real -- --workdir . --prompt "implement X and verify" --conductor-bin ./dist/codex-conductor

Optional expected output artifact check:

bun run test:e2e:real -- --workdir . --run-id <runId> --expected-output-path output/result.json

PTY-driven TUI e2e harness (real monitor API + real OpenTUI process in a pseudo-terminal):

bun run test:e2e:tui

Keep the seeded workdir for debugging:

bun run test:e2e:tui -- --keep-workdir

Hard real-generation validation:

scripts/validate-real-generations.sh /path/to/target/repo

CI notes:

CI / build-cli builds dist/codex-conductor + dist/codex-nanny + dist/codex-tui and uploads them as workflow artifacts.
CI / build-cli also runs bun run test:e2e:fresh-binary -- --dist-dir ./dist before artifact upload.
CI / e2e-tui-pty runs PTY-driven TUI e2e (bun run test:e2e:tui) and uploads .memory/tui-pty-e2e logs/artifacts.
CI / e2e-real-binary runs on every PR using --conductor-bin ./dist/codex-conductor, cost-tuned model candidates (codex-mini-latest first), low reasoning effort, bounded retries, and publishes full command output + run artifacts.
CI / e2e-real-binary requires repository/org secret OPENAI_API_KEY so codex app-server can run in CI.
CodeQL must stay green (no new code scanning alerts on changed code).
required check list for branch protection: docs/required-checks.md

CLI productization roadmap:

docs/cli-roadmap.md

Advanced Direct Runner

bun run runner -- specs/v1/example.md --workdir .

Screenshots

Conductor monitor:

Codex Conductor monitor

Nanny interaction:

Codex Nanny interaction

Thread Lifecycle CLI

bun run threads -- list --limit 20
bun run threads -- read --thread-id <threadId> --include-turns
bun run threads -- archive --thread-id <threadId>
bun run threads -- unarchive --thread-id <threadId>
bun run threads -- compact --thread-id <threadId>