harness-evolver

v6.4.2

Published

2 months ago

LangSmith-native autonomous agent optimization for Claude Code

0High
0Medium
0Low

raphaelchristi

claude-code langsmith llm optimization agent evolution meta-harness

Harness Evolver

Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.

Install

Claude Code Plugin (recommended)

/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver

npx (first-time setup or non-Claude Code runtimes)

npx harness-evolver@latest

Works with Claude Code, Cursor, Codex, and Windsurf.

Quick Start

cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude

/harness:setup      # explores project, configures LangSmith
/harness:health     # check dataset quality (auto-corrects issues)
/harness:evolve     # runs the optimization loop
/harness:status     # check progress (rich ASCII chart)
/harness:deploy     # tag, push, finalize

What It Looks Like

Tested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):

xychart-beta
    title "agno-deepknowledge: 0.575 → 1.000 (+74%)"
    x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
    y-axis "Correctness" 0 --> 1
    line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
    bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]

| Iter | Score | Merged? | What the proposer did | |---|---|---|---| | baseline | 0.575 | — | Original agent — hallucinations, broken tool calls, no retry logic | | v001 | 0.333 | Yes | Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits) | | v002 | 0.950 | Yes | Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits | | v003 | 0.720 | No | Attempted hybrid retrieval — regressed, rejected by constraint gate | | v004 | 0.875 | No | Response completeness fix — improved one case but regressed others | | v005 | 0.680 | No | Reduced tool calls — broke edge cases, rejected | | v006 | 0.880 | Yes | Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive | | v007 | 1.000 | Yes | One-shot example injection + rubric-aligned responses — perfect on held-out |

The line shows best score (only goes up — regressions aren't merged). The bars show each candidate's raw score. 4 merged, 3 rejected by gate checks. Not every iteration improves — that's the point.

How It Works

| | | |---|---| | LangSmith-Native | No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI. | | Real Code Evolution | Proposers modify actual code in isolated git worktrees. Winners merge automatically. | | Self-Organizing Proposers | Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant. | | Rubric-Based Evaluation | LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison. | | Smart Gating | Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection. |

Full feature list

Evolution Loop

/harness:evolve
  |
  +- 1. Preflight  (validate state + dataset health + baseline scoring)
  +- 2. Analyze    (trace insights + failure clusters + strategy synthesis)
  +- 3. Propose    (spawn N proposers in git worktrees, two-wave)
  +- 4. Evaluate   (canary → run target → auto-spawn LLM-as-judge → rate-limit abort)
  +- 5. Select     (held-out comparison → Pareto front → efficiency gate → constraint gate → merge)
  +- 6. Learn      (archive candidates + regression guards + evolution memory)
  +- 7. Gate       (plateau → target check → critic/architect → continue or stop)

Detailed loop with all sub-steps

Agents

| Agent | Role | |---|---| | Proposer | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | | Evaluator | LLM-as-judge — rubric-aware scoring via langsmith-cli, few-shot calibration | | Architect | ULTRAPLAN mode — deep topology analysis with Opus model | | Critic | Active — detects evaluator gaming, implements stricter evaluators | | Consolidator | Cross-iteration memory — anchored summarization, garbage collection | | TestGen | Generates test inputs with rubrics + adversarial injection |

Requirements

LangSmith account + LANGSMITH_API_KEY
Python 3.10+ · Git · Claude Code (or Cursor/Codex/Windsurf)

Dependencies installed automatically by the plugin hook or npx installer.

LangSmith traces any AI framework: LangChain/LangGraph (auto), OpenAI/Anthropic SDK (wrap_*, 2 lines), CrewAI/AutoGen (OpenTelemetry), any Python (@traceable).

Companion: LangSmith Tracing

For full observability into what each proposer does during evolution (every file read, edit, and commit), install the LangSmith tracing plugin:

/plugin marketplace add langchain-ai/langsmith-claude-code-plugins
/plugin install langsmith-tracing@langsmith-claude-code-plugins

With both plugins installed, the evolution loop traces to LangSmith as a hierarchy: iteration → proposers → tool calls.

References

Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
Self-Organizing LLM Agents Outperform Designed Structures — Dochkina, 2026
Hermes Agent Self-Evolution — NousResearch
Agent Skills for Context Engineering — Koylan
A-Evolve: Automated Agent Evolution — Amazon (5-stage evolution loop, git-tagged mutations)
Meta Context Engineering via Agentic Skill Evolution — Ye et al., Peking University, 2026
EvoAgentX: Evolving Agentic Workflows — Wang et al., 2026
Darwin Godel Machine — Sakana AI
AlphaEvolve — DeepMind
LangSmith Evaluation — LangChain
Harnessing Claude's Intelligence — Martin, Anthropic, 2026
Traces Start the Agent Improvement Loop — LangChain

License

MIT