ralph-research
v0.1.6
Published
Local-first runtime for recursive research improvement.
Maintainers
Readme
ralph-research
Local-first runtime for recursive research improvement over real artifacts.
ralph-research ships an actual CLI and stdio MCP server that run a bounded loop:
- load a manifest
- generate one candidate change
- evaluate it with trusted signals
- persist the run, decision, and frontier state
- promote only verified improvements
flowchart LR
M[Manifest<br/>ralph.yaml] --> P[Proposer]
P -->|candidate change<br/>in worktree| E[Experiment]
E -->|outputs| X[Metric extractor]
X --> R{Ratchet}
R -->|wins frontier| A[Accept → main]
R -->|else| J[Reject]
A -.->|persists| S[(.ralph/<br/>runs · decisions · frontier)]
J -.->|persists| SIf your viewer does not render Mermaid: the diagram is just the five
numbered steps above, with every transition writing to durable state under
.ralph/. That's the bit that makes the loop resumable.
The current product bar is reliability, not breadth. The bundled success path is the writing template, while the runtime itself is manifest-driven and reusable for other local workflows.
Trust Signals
- Actual shipped surfaces: CLI binary
rrxand stdio MCP server - Development verification commands:
npm test,npm run typecheck,npm run build - Persisted runtime evidence: runs, decisions, frontier, and lock metadata
- Recovery semantics are enforced by code and persisted state, not described only in prompts
- Supported onboarding path is intentionally narrower than the full manifest surface
What It Is
- A Node/TypeScript runtime with a real CLI:
rrx - A stdio MCP server backed by the same service layer as the CLI
- A Git-aware candidate execution loop with persisted run, decision, and frontier state
- A local-first system designed to be resumed, inspected, and trusted after interruptions
What It Is Not
- Not a no-config autonomous agent for arbitrary domains out of the box
- Not a hosted service
- Not a prompt-only protocol with undocumented runtime behavior
- Not broader than the shipped contract: one bundled template (
writing) and three MCP tools
Quick Decision Guide
| If you want to... | Use |
| --- | --- |
| Check whether a repo is runnable | rrx validate then rrx doctor |
| Materialize the bundled example project | rrx init --template writing (or --template code) |
| Run a disposable end-to-end demo | rrx demo writing (or rrx demo code) |
| Launch the v1 goal-driven orchestrator | rrx "improve the holdout top-3 model" |
| Launch the v1 goal-driven orchestrator explicitly | rrx launch "improve the holdout top-3 model" |
| Resume a persisted TUI research session | rrx resume latest |
| Execute one cycle | rrx run --json |
| Resume the latest recoverable run | rrx run |
| Force a fresh run id | rrx run --fresh |
| Inspect runtime and recovery state | rrx status --json |
| Inspect why one run was accepted or rejected | rrx inspect <runId> --json |
| Review the current accepted frontier | rrx frontier --json |
| Serve the same contract over MCP stdio | rrx serve-mcp --stdio |
Five-Minute Start
Option A: disposable demo
npx ralph-research demo writingThis creates a temporary Git repo, runs one accepted cycle, and prints the temp path plus the run id.
Option B: initialize a local repo
npx ralph-research init --template writing
npx ralph-research doctor
npx ralph-research run --json
npx ralph-research status --json
npx ralph-research inspect run-0001 --jsonThis is the current truth contract for the bundled template: init -> run -> inspect should succeed quickly on a local machine.
rrx "goal" now creates or refreshes the launch draft session and drops into the v1 TUI shell. Initial launch does not start an autonomous research cycle until the shell tells it to continue.
When you submit the review step, the shell materializes a real research session and hands control to the selected agent runtime. Once that interactive run returns, launch-draft is removed. The remaining persisted session is the only runtime record you need to inspect or resume.
Resume semantics are intentionally narrow:
rrx resume <sessionId>only works for sessions that ended after a completed cycle checkpoint- interrupted sessions are resumable because the runtime has durable evidence for the next cycle boundary
- a clean agent exit without
goal_achievedor a completed checkpoint is treated as terminal and is not resumable
Runtime Model
The runtime is manifest-driven. ralph.yaml defines the project, proposer, experiment, metrics, ratchet, and storage root. The service layer then:
- loads and validates the manifest
- acquires a durable lock
- classifies recovery against the latest persisted run
- executes or resumes a candidate
- writes run, decision, and frontier state under the storage root
See docs/operation-model.md for the full lifecycle and recovery model.
Current Scope
- Bundled templates:
writing(prose ratchet) andcode(test-pass ratchet over a tiny calculator module) - Default template metric: local command metric, no API key required
- Optional judge path: pairwise LLM judge packs
- MCP tools:
run_research_cycleget_research_statusget_frontier
The runtime supports broader manifests than the bundled template demonstrates, but the shipped onboarding path is intentionally narrow until those flows are equally reliable.
Bundled Templates
Writing template
Self-contained prose improvement loop:
docs/draft.md: sample draftscripts/propose.mjs: bounded rewritescripts/experiment.mjs: output materializationscripts/metric.mjs: local heuristic metricprompts/judge.md: pairwise judge prompt starter
templates/writing/ralph.yaml uses a local command metric by default, so the first run works without model credentials.
Code template
Self-contained test-pass ratchet over a tiny calculator module:
src/calculator.mjs: deliberately-brokensum/multiplytests/calculator.test.mjs: four assertions using the built-innode:testrunnerscripts/propose.mjs: writes the fixed calculator implementationscripts/experiment.mjs: runsnode --test --test-reporter=tapand persists the pass/fail countsscripts/metric.mjs: emits the pass count as thetests_passedmetric
rrx demo code materializes the template, runs one cycle, and shows the ratchet promoting the candidate from tests_passed: 0 to tests_passed: 4.
Progressive Runs
rrx run executes one cycle by default and auto-resumes the latest recoverable run when one exists.
Progressive stop modes are opt-in:
--fresh: start a newrunIdinstead of auto-resuming the latest recoverable run--until-target: keep iterating untilmanifest.stopping.targetis met--until-no-improve N: stop afterNconsecutive cycles without frontier improvement--cycles Nwith a progressive flag: treatNas a max-cycle cap instead of an exact count
The bundled writing template ships with stopping.target commented out, so enable that block in ralph.yaml before using --until-target.
npx ralph-research run --until-target --until-no-improve 3 --jsonMore Docs
- docs/quickstart.md: five-minute walkthrough from
npx ralph-research demo writingto inspecting the persisted decision evidence - docs/operation-model.md: lifecycle, persisted state, recovery classes
- docs/playbook.md: situation-to-command operator guide
- docs/examples.md: quickstart and manifest examples pulled from shipped templates and fixtures
- docs/examples-catalog.md: broader scenario catalog grounded in shipped templates and test fixtures
- docs/comparison.md: why this runtime is narrower and more stateful than prompt-only loop systems
- docs/faq.md: common runtime, recovery, and inspection questions
- docs/knowledge/INDEX.md: project knowledge log
CLI
rrx "improve the holdout top-3 model"
rrx launch "improve the holdout top-3 model"
rrx resume latest
rrx validate
rrx doctor
rrx init --template writing
rrx demo writing
rrx run
rrx run --fresh
rrx run --until-target
rrx run --until-no-improve 3
rrx run --until-target --until-no-improve 3
rrx status
rrx frontier
rrx inspect <runId>
rrx accept <runId>
rrx reject <runId>
rrx serve-mcp --stdioCore Concepts
Manifest:ralph.yamldefines the research programMetric: how candidate quality is measuredFrontier: the currently accepted best candidate setRatchet: the acceptance policy that decides whether the frontier advancesProposer: how a bounded candidate change is generatedJudge: how qualitative outputs are compared when numeric metrics are not enough
Development
npm install
npm test
npm run typecheck
npm run buildSupport the Project
If ralph-research saves you from wiring up your own write-evaluate-accept loop:
- Star the repo on GitHub. It is the single clearest signal that the runtime is worth maintaining and helps surface it to other people who need the same shape of tool.
- File issues with concrete reproductions. The issue templates ask for the version, OS, and exact commands so they convert quickly into fixes.
- Open a PR for the gaps you actually hit.
CONTRIBUTING.mdcovers the local loop; the bar is a Vitest regression that fails against the previous code. - If you want to talk shape and direction rather than file an issue, the manifest schema (
src/core/manifest/schema.ts) and the recovery classifier (src/core/state/research-session-recovery-classifier.ts) are the two surfaces I most want feedback on.
License
MIT
