@arkone_ai/autoresearch
v1.0.0
Published
Claude Code skill: apply the autoresearch methodology to any LLM application or model training project
Maintainers
Readme
claude-skill-autoresearch
A Claude Code skill that applies the autoresearch methodology to any LLM-powered project — API routes, prompt iteration, agents, evaluation pipelines, or full model training.
Install
npm install -g claude-skill-autoresearchThe skill is automatically copied to ~/.claude/skills/autoresearch.md on install.
Usage
In any Claude Code session:
/autoresearchClaude will apply the autoresearch discipline to whatever AI work you're doing.
What it does
The autoresearch methodology was originally designed for autonomous LLM training experiments (run overnight, keep/discard based on a single metric). This skill generalizes those principles to any AI work.
The universal structure — every AI task maps to the same three parts:
| Role | LLM Application | Model Training |
|---|---|---|
| Modifiable | Prompt, context, schema, model params | Architecture, optimizer, hyperparameters |
| Locked | Evaluation function, test set, metric | Dataloader, tokenizer, time budget |
| Findings log | findings.json | results.tsv |
Core principles (apply to everything):
- Single metric, chosen upfront, never changed mid-experiment
- One change at a time — isolate what caused the improvement
- Keep/discard via
git reset— no exceptions - Simplicity criterion — equal metric with less code is a win
- Autonomous loop — never stop to ask, run until manually halted
Mode A: LLM Application — for API routes, prompt chains, agents:
findings.jsonpattern: log every run's change, metric, status, observations- History injection: inject prior runs into the prompt so the LLM tracks its own progress
- Context accumulation: failures → system prompt constraints; successes → few-shot examples
Mode B: Model Training — for PyTorch/JAX training loops:
- Modern transformer architecture defaults (RoPE, Flash Attention, GQA, RMSNorm, softcap)
- Heterogeneous optimizer (Muon for matrices, AdamW for embeddings/scalars)
- Training loop hygiene (GC management, loss explosion fast-fail, time budget)
Origin
Based on Andrej Karpathy's autoresearch project — a single-GPU LLM training setup designed for autonomous overnight experimentation by AI agents.
License
MIT
