@mittalsuraj18/opencode-auto-research
v0.2.0
Published
Autoresearch plugin for OpenCode - automated benchmark-driven optimization loop
Maintainers
Readme
@mittalsuraj18/opencode-auto-research
An OpenCode plugin that implements an automated benchmark-driven optimization loop — a lightweight, OpenCode-native take on the autoresearch pattern popularized by Andrej Karpathy.
What is Autoresearch?
The autoresearch pattern — introduced by karpathy/autoresearch (83K+ stars) — is a simple but powerful idea: give an AI agent a measurable goal and let it experiment autonomously. The agent modifies code, runs a benchmark, checks if the result improved, keeps or discards the change, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better codebase.
This plugin brings that pattern directly into OpenCode as a first-class plugin — no external scripts, no manual orchestration. The loop runs inside your existing OpenCode session with built-in git isolation, auto-compaction, and scope enforcement.
How it differs from other tools
| Tool | Approach | Key Difference | |------|----------|----------------| | This plugin | OpenCode-native plugin | Runs inside OpenCode session; auto-compaction; git branch isolation; scope enforcement | | karpathy/autoresearch | Standalone script + program.md | Single-file optimization (ML training); agent reads markdown instructions | | ratchet | CLI orchestrator | Generates agent prompts; handles git/benchmark externally; multi-armed bandit strategy selection | | auto-optimize | Claude Code skill | Structured reasoning pipeline (Opus planner); noise-floor validation; disassembly analysis | | darwin-derby | CLI orchestrator | Swarm mode; git-push-based proposals; evaluation hidden from agents | | Artificial General Research | Claude Code skill | Fresh context per iteration; Metric+Guard+Rework pattern; stuck detection protocol | | VeRO | Python evaluation harness | Optimizes LLM-based agent code; subprocess-isolated evaluations | | Maleick/AutoResearch | OpenCode + Hermes plugin | Subagent-first; multi-runtime (OpenCode + Hermes); recursive self-improvement; 15+ slash commands |
Overview
This plugin enables OpenCode agents to systematically optimize code performance through:
- Benchmark harness (
autoresearch.sh) — measures target metrics - Experiment loop — modify code, run benchmark, evaluate, keep or discard
- Auto-compaction — context is compacted after every iteration to prevent overflow
- Git integration — commits on keep, resets on discard, dedicated branches
- Scope enforcement — restrict which files the agent can and cannot modify
- Confidence scoring — MAD-based statistical confidence in improvements
Installation
From npm (recommended)
npm install @mittalsuraj18/opencode-auto-researchAdd to your opencode.json:
{
"plugin": ["@mittalsuraj18/opencode-auto-research"]
}Restart OpenCode. The /autoresearch command and all four tools are now available.
From local files
Place the built plugin in .opencode/plugins/ or ~/.config/opencode/plugins/. See OpenCode plugin docs for details.
Quick Start
Option 1: The /autoresearch command (easiest)
Just type in OpenCode:
/autoresearch optimize compile timeWhat happens
- If an experiment is already active → resumes it with the new goal
- If no experiment is active → starts a new one:
- Creates
autoresearch.shif missing - Calls
init_experimentwith an appropriate benchmark name and metric - Runs the baseline with
run_experiment - Logs the baseline with
log_experiment keep - The auto-iteration loop begins
- Creates
Resume behavior
/autoresearchWithout a goal, it continues the existing experiment. With a goal, it updates the experiment's goal and continues.
No manual opencode.json configuration needed — the command is registered automatically for autocomplete.
Option 2: Direct tool usage
Create
autoresearch.shin your project root that prints metrics:#!/bin/bash # Run your benchmark here... METRIC compile_time_ms=1200 METRIC bundle_size_bytes=45000 ASI hypothesis=reduced_loop_iterations ASI next_action_hint=try_unrolling_factor_4Call
init_experimentwith your benchmark name and metricCall
run_experimentto run the baselineCall
log_experimentwithstatus: keepto establish the baselineThe agent will auto-iterate, optimizing the metric
Tools
init_experiment
Initialize a new autoresearch experiment session.
| Parameter | Type | Description |
|-----------|------|-------------|
| name | string | Name of the experiment |
| goal | string? | What to optimize |
| primary_metric | string | Main metric to track |
| metric_unit | string? | Unit (ms, bytes, etc.) |
| direction | "lower" | "higher" | Better direction |
| scope_paths | string[]? | Files agent may modify |
| off_limits | string[]? | Files agent must NOT modify |
| max_iterations | number? | Max experiments per segment |
| new_segment | boolean? | Start fresh segment |
Auto-behaviors:
- Creates
autoresearch.mdif missing - Creates
autoresearch/*branch if not on one - Auto-commits harness as baseline on autoresearch branch
run_experiment
Run the benchmark harness (bash autoresearch.sh).
| Parameter | Type | Description |
|-----------|------|-------------|
| timeout_seconds | number? | Max runtime (default: 600) |
Output:
- Parsed metrics from
METRIC name=valuelines - ASI (Agent State Info) from
ASI key=valuelines - Truncated raw output (4KB / 10 lines max)
- Full log saved to
~/.opencode-autoresearch/<project>/runs/<id>/benchmark.log
log_experiment
Log the result and update experiment state.
| Parameter | Type | Description |
|-----------|------|-------------|
| metric | number | Primary metric value |
| status | "keep" | "discard" | "crash" | "checks_failed" | Whether to keep changes |
| description | string | What this run tested |
| metrics | Record<string, number>? | Additional metrics |
| asi | Record<string, unknown>? | Agent state info |
| justification | string? | Why this status |
Auto-behaviors:
keepon autoresearch branch → commits changesdiscard/crashon autoresearch branch →git reset --hard HEAD+git clean -fddiscard/crashon other branch → only reverts run-modified files- Detects scope deviations (modified files outside scope_paths or in off_limits)
- Updates
autoresearch.md - Checks max_iterations; disables mode if reached
update_notes
Update experiment notes.
| Parameter | Type | Description |
|-----------|------|-------------|
| body | string? | Replace entire notes |
| append_idea | string? | Append bullet to ideas |
Benchmark Harness Format
Your autoresearch.sh must print metrics in this format:
#!/bin/bash
# Run your benchmark here...
METRIC compile_time_ms=1200
METRIC bundle_size_bytes=45000
ASI hypothesis=reduced_loop_iterations
ASI next_action_hint=try_unrolling_factor_4METRIC <name>=<value>— one per line, numeric values onlyASI <key>=<value>— optional, any string value (hypothesis, next_action_hint, rollback_reason, etc.)- Exit code 0 = success, non-zero = failure
How the Loop Works
┌─────────────────────────────────────────────────────────┐
│ AUTORESEARCH LOOP │
│ │
│ 1. init_experiment → create branch, set goal & metric │
│ 2. run_experiment → execute autoresearch.sh │
│ 3. Agent analyzes results │
│ 4. log_experiment (keep/discard/crash) │
│ ├─ keep → commit changes, update best │
│ ├─ discard → git reset --hard HEAD + clean │
│ └─ crash → git reset --hard HEAD + clean │
│ 5. Auto-compact session context │
│ 6. Continue from step 2 │
└─────────────────────────────────────────────────────────┘Each iteration is fully automated. The agent modifies code within the configured scope, runs the benchmark, evaluates the result, and decides whether to keep or discard. After logging, the session is compacted to prevent context overflow, and the loop continues indefinitely until max iterations are reached or the user interrupts.
Git Workflow
init_experimentcreates branch:autoresearch/<goal>-<YYYYMMDD>run_experimentruns benchmark, records modified fileslog_experiment keepcommits changes with formatted messagelog_experiment discardresets worktree to HEAD- At any point:
git logshows the experiment history
This mirrors the git workflow from karpathy/autoresearch's program.md, but is handled automatically by the plugin rather than requiring the agent to manually manage git commands.
Auto-Compaction
After every log_experiment, the plugin automatically triggers a session compaction via client.summarize(). This:
- Summarizes the conversation history
- Preserves experiment context (goal, baseline, best result)
- Injects a synthetic "continue" message via
experimental.compaction.autocontinue - Keeps the agent loop running indefinitely without context overflow
Without auto-compaction, each iteration adds context until the model's context window fills up and the loop degrades. This plugin solves that by compacting after every iteration while preserving the essential experiment state through experimental.session.compacting hooks.
Scope Enforcement
The plugin tracks which files the agent modifies during each experiment:
scope_paths— restrict modifications to only these pathsoff_limits— explicitly forbid modifications to these paths- Deviation detection —
log_experimentflags any modifications outside the declared scope
This prevents the agent from accidentally modifying critical files (e.g., test fixtures, config files, lock files) during its optimization loop.
Confidence Scoring
The plugin uses a Median Absolute Deviation (MAD)-based confidence score to evaluate whether an improvement is statistically meaningful:
- Low confidence → the improvement may be within measurement noise
- High confidence → the improvement is likely real
This helps the agent make informed decisions about whether to keep aggressive changes or revert to the baseline.
Storage
- SQLite:
~/.opencode-autoresearch/<encoded-project-path>.dbsessionstable: experiment configurationrunstable: benchmark results
- Logs:
~/.opencode-autoresearch/<project>/runs/<id>/benchmark.log - Markdown:
./autoresearch.mdin project root
Configuration
No configuration required. The plugin auto-detects:
- Current git branch
- Project directory
- Available models (for compaction)
Source Layout
| File / Dir | Role |
|------------|------|
| src/index.ts | Plugin entry point. Registers 4 tools, the /autoresearch command, system-prompt injection, and compaction hooks. |
| src/types.ts | Central type definitions (ExperimentState, AutoresearchRuntime, etc.). |
| src/state.ts | Runtime state helpers (createRuntimeStore, buildExperimentState). |
| src/storage.ts | SQLite persistence layer. |
| src/git.ts | Git branch detection, commit/reset helpers. |
| src/helpers.ts | Shared formatting and parsing utilities. |
| src/tools/init-experiment.ts | init_experiment tool — creates branch, session, baseline. |
| src/tools/run-experiment.ts | run_experiment tool — executes autoresearch.sh, parses METRIC/ASI lines. |
| src/tools/log-experiment.ts | log_experiment tool — commits on keep, resets on discard/crash, updates autoresearch.md. |
| src/tools/update-notes.ts | update_notes tool — appends to experiment notes. |
| src/prompts/system.md | Template for injected system prompt when autoresearch mode is active. |
| src/prompts/setup.md | Template used during experiment setup. |
Features
- Automated experiment loop with keep/discard decisions
- Git branch isolation (
autoresearch/*) — experiment safely without touching main - Auto-commit on keep / auto-reset on discard — clean state after every iteration
- Scope deviation detection — restrict what the agent can modify
- Confidence scoring (MAD-based) — distinguish real improvements from noise
- Max iteration enforcement — prevents runaway loops
- Auto-compaction after every iteration — unlimited iterations without context overflow
- Secondary metric tracking — monitor additional metrics alongside the primary one
- ASI (Agent State Info) logging — pass hypotheses and hints between iterations
- Persistent experiment notes —
update_notespersists across compactions autoresearch.mdauto-generation and updates — experiment log in your repo/autoresearchslash command — start or resume experiments with one command- OpenCode-native — no external scripts, runs inside your existing session
Comparison with the Original Autoresearch Pattern
Karpathy's autoresearch introduced a simple loop: modify code → run benchmark → keep or revert → repeat. The agent reads a program.md file with instructions and manages git manually. This plugin builds on that foundation with several OpenCode-native improvements:
| Aspect | karpathy/autoresearch | This Plugin |
|--------|----------------------|-------------|
| Runtime | Standalone (any agent) | OpenCode plugin |
| Git management | Manual by agent | Automatic (plugin handles commit/reset) |
| Context management | Agent-dependent (often degrades) | Auto-compaction with state preservation |
| Metric parsing | Agent reads raw output | Structured METRIC/ASI protocol |
| Scope control | Single-file convention | Explicit scope_paths / off_limits |
| Confidence | None (manual threshold) | MAD-based statistical confidence |
| Session persistence | Git log only | SQLite + git + markdown |
| Branch isolation | Manual by agent | Automatic (autoresearch/* branch) |
| Command interface | Prompt-based | /autoresearch slash command + 4 tools |
Limitations
- No TUI dashboard widget (OpenCode server plugins cannot render UI)
- No synthetic auto-resume messages (mitigated by strong system prompt + compaction auto-continue)
- No custom tool renderers (standard OpenCode tool output)
- No multi-agent swarm mode (see darwin-derby for swarm experiments)
- No multi-armed bandit strategy selection (see ratchet for bandit-based approaches)
- No built-in noise-floor validation (see auto-optimize for variance checks)
License
MIT
