pi-prompt-autoresearch
v0.1.1
Published
A pi extension that iteratively improves prompts with execution-based evaluation and keep/discard decisions.
Maintainers
Readme
pi prompt autoresearch
A pi extension that iteratively improves prompts using execution-based evaluation, blind A/B comparison, and keep/discard decisions.
- Generates an eval suite from your goal
- Runs each prompt candidate across the suite and scores actual outputs
- Performs blind A/B comparisons between incumbent and candidate
- Keeps or discards each iteration based on eval scores and comparator preference
- Benchmarks repeated runs and reports variance
Install
pi install npm:pi-prompt-autoresearchFrom the public git repo:
pi install git:github.com/NicoAvanzDev/pi-prompt-autoresearchFrom a local clone:
pi install .Load without installing:
pi --no-extensions -e ./index.tsQuick start
/autoresearch Write a prompt that produces a concise, factual summary of a long technical article.That single command kicks off the full optimization loop. The extension will:
- Generate an initial prompt from your goal
- Build an eval suite tailored to the task
- Iterate — rewrite, evaluate, compare, keep or discard — for 10 rounds (configurable)
- Write the best prompt to
AUTORESEARCH_PROMPT.mdin your working directory
A live progress widget shows iteration count, scores, elapsed time, and ETA while it runs. When a new best prompt is found you get a milestone update in chat.
Example session
> /autoresearch Write a prompt that turns raw meeting transcripts into structured JSON notes with attendees, action items, and decisions.
Autoresearch ━━━━━━━━━━━━━━━━━━━━ 100% 10/10 iterations
Goal Turn meeting transcripts into structured JSON notes
Score 0.92 (best) — +38% vs baseline
Status Completed in 4m 12s
✓ Best prompt written to AUTORESEARCH_PROMPT.mdYou can also benchmark an existing prompt to measure consistency:
> /autoresearch-benchmark --runs 5 Write a prompt that extracts structured meeting notes as JSON.
Benchmark complete — 5 runs
Mean 0.88 · Min 0.84 · Max 0.91 · StdDev 0.03How it works
Improve mode
For each /autoresearch run, the extension:
- generates an initial prompt from the user goal
- generates a small eval suite for the user goal
- runs the initial prompt on every eval case
- scores each case and computes an aggregate score
- generates a revised prompt candidate
- runs that candidate on every eval case
- evaluates the candidate across the full suite
- performs a blind A/B comparison between incumbent and candidate outputs
- keeps the candidate only if:
- the eval says
keep - the aggregate score beats the current best
- the blind comparator prefers the candidate
- the eval says
Benchmark mode
The benchmark workflow:
- generates an eval suite
- runs the prompt multiple times across that suite
- records per-run aggregate scores
- reports:
- mean score
- min/max score
- variance
- standard deviation
Commands
Run autoresearch
/autoresearch <goal>Example:
/autoresearch Write a prompt that produces a concise, factual summary of a long technical article.Override iterations for one run:
/autoresearch --iterations 20 Write a prompt that generates a JSON API migration checklist.Benchmark a prompt
/autoresearch-benchmark <goal>Example:
/autoresearch-benchmark --runs 5 Write a prompt that extracts structured meeting notes as JSON.Change the default iteration count
/autoresearch-iterations 20Control a running job
/autoresearch-pause
/autoresearch-resume
/autoresearch-kill
/autoresearch-statusThe interactive extension now shows:
- a persistent progress widget above the editor
- an AI-generated goal summary
- iteration and case progress
- elapsed time and ETA, refreshed live while a job is running
- current score, best score, and percentage improvement vs baseline
- milestone updates in chat when a new best prompt is found, or when the job is paused/resumed/completed
During a run, the extension writes AUTORESEARCH_PROMPT.md in the current working directory with the raw best prompt text, updated at each iteration. Progress state is kept internal to the extension (pi session entries and the live UI widget).
Pause takes effect at the next safe checkpoint between long-running steps.
Tools
The extension exposes LLM-callable tools:
run_prompt_autoresearchbenchmark_prompt_autoresearch
run_prompt_autoresearch
Parameters:
goal: stringiterations?: numberevalCases?: number
benchmark_prompt_autoresearch
Parameters:
goal: stringruns?: numberevalCases?: number
Notes
- default improve iterations: 10
- users can increase iterations up to 100
- default benchmark runs: 3
- benchmark runs can go up to 10
- default eval cases: 5
- eval cases can go up to 8
- in interactive mode,
/autoresearchcopies the best prompt into the editor when finished
