harness-evolver
v3.2.1
Published
LangSmith-native autonomous agent optimization for Claude Code
Maintainers
Readme
Harness Evolver
LangSmith-native autonomous agent optimization. Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
Inspired by Meta-Harness (Lee et al., 2026). The scaffolding around your LLM produces a 6x performance gap on the same benchmark. This plugin automates the search for better scaffolding.
Install
Claude Code Plugin (recommended)
/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolverUpdates are automatic. Python dependencies (langsmith, langsmith-cli) are installed on first session start via hook.
npx (first-time setup or non-Claude Code runtimes)
npx harness-evolver@latestInteractive installer that configures LangSmith API key, creates Python venv, and installs all dependencies. Works with Claude Code, Cursor, Codex, and Windsurf.
Both install paths work together. Use npx for initial setup (API key, venv), then the plugin marketplace handles updates automatically.
Quick Start
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude
/evolver:setup # explores project, configures LangSmith
/evolver:evolve # runs the optimization loop
/evolver:status # check progress
/evolver:deploy # tag, push, finalizeHow It Works
Commands
| Command | What it does |
|---|---|
| /evolver:setup | Explore project, configure LangSmith (dataset, evaluators), run baseline |
| /evolver:evolve | Run the optimization loop (5 parallel proposers in worktrees) |
| /evolver:status | Show progress, scores, history |
| /evolver:deploy | Tag, push, clean up temporary files |
Agents
| Agent | Role | Color | |---|---|---| | Proposer | Modifies agent code in isolated worktrees based on trace analysis | Green | | Evaluator | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow | | Architect | Recommends multi-agent topology changes | Blue | | Critic | Validates evaluator quality, detects gaming | Red | | TestGen | Generates test inputs for LangSmith datasets | Cyan |
Evolution Loop
/evolver:evolve
|
+- 1. Read state (.evolver.json + LangSmith experiments)
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
+- 1.8 Analyze per-task failures (adaptive briefings)
+- 2. Spawn 5 proposers in parallel (each in a git worktree)
+- 3. Run target for each candidate (client.evaluate() -> code-based evaluators)
+- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
+- 4. Compare experiments -> select winner + per-task champion
+- 5. Merge winning worktree into main branch
+- 5.5 Test suite growth (add regression examples to dataset)
+- 6. Report results
+- 6.5 Auto-trigger Critic (if score jumped >0.3)
+- 7. Auto-trigger Architect (if stagnation or regression)
+- 8. Check stop conditionsArchitecture
Plugin hook (SessionStart)
└→ Creates venv, installs langsmith + langsmith-cli, exports env vars
Skills (markdown)
├── /evolver:setup → explores project, runs setup.py
├── /evolver:evolve → orchestrates the evolution loop
├── /evolver:status → reads .evolver.json + LangSmith
└── /evolver:deploy → tags and pushes
Agents (markdown)
├── Proposer (x5) → modifies code in git worktrees
├── Evaluator → LLM-as-judge via langsmith-cli
├── Critic → detects evaluator gaming
├── Architect → recommends topology changes
└── TestGen → generates test inputs
Tools (Python + langsmith SDK)
├── setup.py → creates datasets, configures evaluators
├── run_eval.py → runs target against dataset
├── read_results.py → compares experiments
├── trace_insights.py → clusters errors from traces
└── seed_from_traces.py → imports production tracesRequirements
- LangSmith account +
LANGSMITH_API_KEY - Python 3.10+
- Git (for worktree-based isolation)
- Claude Code (or Cursor/Codex/Windsurf)
Dependencies (langsmith, langsmith-cli) are installed automatically by the plugin hook or the npx installer.
Framework Support
LangSmith traces any AI framework. The evolver works with all of them:
| Framework | LangSmith Tracing |
|---|---|
| LangChain / LangGraph | Auto (env vars only) |
| OpenAI SDK | wrap_openai() (2 lines) |
| Anthropic SDK | wrap_anthropic() (2 lines) |
| CrewAI / AutoGen | OpenTelemetry (~10 lines) |
| Any Python code | @traceable decorator |
References
- Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
- Darwin Godel Machine — Sakana AI
- AlphaEvolve — DeepMind
- LangSmith Evaluation — LangChain
- Traces Start the Agent Improvement Loop — LangChain
License
MIT
