modelab
v0.6.3
Published
Autonomous AI research engine — run experiments, score outputs, remember lessons, recall history, and plan what to test next
Maintainers
Readme
Built by DARKSOL 🌑
modelab 🌑 — Repeatable Multi-Model Research
Run the same research question across multiple models, score the results, remember what worked, and get a concrete recommendation for what to test next.
modelab is used internally at DARKSOL for systematic research — comparing model strategies, pressure-testing prompts, auditing model behavior, and generating reproducible reports.
It is intentionally model-agnostic underneath and workflow-opinionated on top. The point is not just to run prompts. The point is to make research runs repeatable, comparable, and easier to improve over time.
What it does best
modelab is strongest when you want to:
- compare multiple model approaches to the same task
- keep a memory of what worked on similar past questions
- inspect routing choices instead of blindly trusting a default model
- get a suggested next experiment instead of just a pile of logs
If you only want a single prompt runner, this is overkill. If you want a lightweight research loop with memory and planning, this is the lane.
Product Thesis
modelab is built around one loop:
- Run parallel experiments
- Score outputs against a rubric
- Remember lessons and summaries
- Recall similar prior work
- Plan the next best experiment
If it only ran experiments, it would be another eval harness. The goal is to become an engine that learns from prior runs and gets sharper every cycle.
Install
npm install modelab
modelab config --wizard # guided setup for preset + export defaultsQuick Start
# Run a research question across 3 model strategies in parallel
modelab run --goal "What causes migraines and what do the best treatments look like?"
# Check setup health before a real run
modelab doctor
# Ask modelab what it should test next
modelab plan --goal "Compare Postgres vs DynamoDB for a startup"
# Let modelab propose the next run and execute it immediately
modelab next --goal "Compare Postgres vs DynamoDB for a startup" --run
# Explain why routing picked a model
modelab route --task "Compare Postgres vs DynamoDB for a startup"
# Compare specific models head-to-head
modelab run --goal "Explain ZK rollup economics" --arms balanced,reasoning
# Use a built-in template
modelab run --goal "Review my REST API design" --template code-review
# Watch streaming tokens arrive in real-time
modelab run --goal "Write a生日贺卡" --template creative --stream
# Export a finished run to a shareable HTML report
modelab export <run-id> --format html --shareWhy it exists
Most AI tooling stops at execution. modelab is for when you want a system that can:
- compare multiple approaches instead of trusting one answer
- preserve lessons instead of starting fresh every run
- route work using learned performance, not just keywords
- surface similar past work when a new question looks familiar
- recommend the next experiment instead of leaving you with raw logs
- turn that recommendation into an actual run without manual copy/paste
That makes it useful for:
- prompt and workflow R&D
- model comparison and provider audits
- research-heavy product decisions
- repeatable internal evaluation loops
First-run checklist
npm install modelab
modelab config --wizard
modelab doctor
modelab plan --goal "Review my API design"
modelab run --goal "Review my API design"If modelab doctor shows missing auth, add the right provider key and run it again before blaming the package.
If you want a different starting lane without prompts, use a preset directly:
modelab config --presets
modelab config --init --preset cheap
# or
modelab config --wizard --yes --preset research-heavy --default-format html --output-dir shared-reportsWhat it does not do yet
- It does not automatically guarantee better research outcomes without a decent scoring setup.
- It does not replace a human reviewer for high-stakes decisions.
- It does not magically infer perfect provider credentials or budgets.
- Its learned routing is only as good as the run history you have built.
What’s new in v0.6
Research Planning
modelab plan turns routing, run history, lessons, model insight profiles, and semantic recall into a recommended next experiment. Instead of only telling you what happened, modelab now suggests what to run next: template, iterations, threshold, arms, a copy-paste command, and where to inspect/share the run after it finishes.
One-Step Next Run
modelab next closes the gap between planning and execution. It uses the same routing, lessons, history, and semantic recall as plan, but can immediately launch the recommended experiment with --run. That makes the core workflow feel more like a real research loop instead of a planning screen plus manual follow-up.
Guided Setup + Shareable Reports
modelab now has a guided first-run path via modelab config --wizard and a cleaner report handoff via modelab export <run-id> --format html --share. That means you can go from setup → plan → run → share without stitching together a bunch of one-off commands.
Actionable Self-Iteration (lesson_engine)
Previous versions stored "lessons" as advisory notes nobody read. v0.4 closes the loop: after each run, the lesson_engine parses the scorer output, writes actual router adjustments to the DB, and the routing layer applies them before the next run. Score below 4? Model gets a -1 penalty. Score above 8? It gets boosted for that task type. Lessons become config changes — automatically.
Model Profiles
Every model maintains a performance profile updated after each run: avg_score, avg_latency_ms, avg_cost_usd, strengths[], weaknesses[]. The router reads this, not just keyword matching.
Semantic Memory (embedding_store)
Run summaries and lessons are embedded using TF-IDF vectors (Ollama nomic-embed-text when available) and stored in SQLite. Query past experiments semantically: modelab recall "what did we learn about coding tasks?"
Learned Routing (routing_v2)
Replaces the keyword router with a performance-based router that considers: model strengths for the task type, historical scores, active adjustments, and similar past runs.
Campaign Layer
Multi-run research campaigns: modelab campaign new "My Hypothesis" --runs 10 --synthesize. Coordinates a series of runs, synthesizes findings, and generates reports.
Task Complexity Profiling
Goals are analyzed for complexity (question marks, structure, length) before routing. High-complexity tasks get more iterations; low-complexity get faster models.
LCM Memory v2 — Cross-Run Persistence + Cross-Iteration Learning
Every run writes iteration summaries and full run summaries to SQLite. Before running, modelab loads prior context for the same goal ID — so lessons from last week actually influence today's experiment. The iteration_context template variable carries these lessons into new prompts automatically.
Tiktoken Token Counting
Replaced rough length/4 estimation with the GPT-2 vocab Tiktoken encoder. Token counts and cost estimates are now accurate to the actual tokenization scheme of the model being used.
Proactive Rate-Limit Backoff
RateLimitTracker tracks 429 responses and their Retry-After headers. Before making a call, the system checks whether the endpoint is currently throttled and waits proactively — not just retrying after failure but preventing it.
TTFT Latency Stats
Time-to-first-token is measured per arm per call and aggregated into p50/p95 stats. These are shown in comparison tables and stored in run logs. Arms can be configured with a latencyTargetMs to skip models that are historically slow.
GLM-5.0 Routing
The keyword router now recognizes glm, glm-5, glm5, 智谱, zhipu and routes to the GLM-5.1 model — enabling research on Chinese-language models and frontier Chinese providers.
Cross-Run Learning System
outputPreview(first 200 chars) andoutputTruncated(boolean) stored in DB and cacheexperimentscommand:modelab experiments --sort score|cost|datefor at-a-glance run historyreviewcommand:modelab review <run-id>for detailed latency + lesson breakdown- migration hardening for older local SQLite installs so new memory fields upgrade cleanly instead of breaking existing environments
Structured Scorer with Optional Sub-Fields
The LLM judge returns { score, reasoning, clarity?, correctness?, completeness? }. Sub-fields are optional — the scorer still works if the judge skips them.
CLI Commands
modelab run --goal "..." [--iterations N] [--threshold N] [--arms m1,m2]
[--template id] [--format json|md|html] [--output path]
[--stream] [--no-cache]
Run a research experiment
modelab experiments [--sort score|cost|date] View all runs at a glance
modelab experiments --goal-id "..." --status completed
modelab review <run-id> Deep-dive: latency + lesson breakdown
modelab report <run-id> Winner summary + per-arm breakdown
modelab recall "what did we learn about..." Semantic search across runs + lessons
modelab plan --goal "..." Recommend the next best experiment
modelab next --goal "..." [--run] Generate the next best experiment and optionally run it now
modelab memory inspect --goal-id "..." Inspect stored runs + lessons for a goal
modelab history Show run history
modelab best [--goal-id] Show best result for a goal
modelab templates List built-in prompt templates
modelab export <run-id> [--format json|md|html] [--share] Re-export a past run with executive summary
modelab route --task "..." Show model routing decision
modelab cache --clear Clear the result cache
modelab config --init Create ~/.modelab/config.json
modelab config --wizard Guided setup for preset + export defaults
modelab config --presets Show starter config presets
modelab config --list Show current configBuilt-in Templates
| Template | Use case |
|----------|----------|
| research | Deep multi-perspective research |
| code-review | Bugs, security, performance review |
| architecture | System design and trade-offs |
| bug-hunt | Adversarial failure-mode analysis |
| compare | A/B decisions with scoring |
| quick-answer | Fast, concise responses |
| creative | Brainstorming and ideation |
Model Routing
Routing is no longer just static keyword matching. modelab combines:
- task-shape heuristics
- learned model profiles
- active lesson adjustments
- prior run performance
- similar historical runs
The keyword layer still exists as a fallback/default shape:
| Task | Keywords | Routes to |
|------|----------|-----------|
| coding | code, refactor, bug, build, PR, function, class | coding |
| reasoning | proof, logic, analysis, theorem, 证明,推理 | reasoning |
| glm | glm, glm-5, glm5, 智谱, zhipu | glm |
| quick | quick, summary, what is, define, 什么是 | fast |
| default | everything else | balanced |
Override with --arms fast,balanced,reasoning or configure explicitly in config.
Opinionated by workflow, not locked by provider
modelab should stay broad at the infrastructure layer:
- bring your own providers
- bring your own models
- bring your own scoring budgets
But at the product layer, it is intentionally opinionated:
- experiments should be comparable
- lessons should persist
- memory should influence future runs
- planning should be explicit
- research quality should compound over time
Config
~/.modelab/config.json:
{
"models": {
"fast": { "provider": "openai", "model": "gpt-4o-mini", "costPerMillionInput": 0.15, "costPerMillionOutput": 0.60 },
"balanced": { "provider": "anthropic", "model": "claude-sonnet-4-6", "costPerMillionInput": 3, "costPerMillionOutput": 15 },
"reasoning": { "provider": "openai", "model": "o1", "costPerMillionInput": 15, "costPerMillionOutput": 60 },
"coding": { "provider": "ollama", "model": "qwen3-coder", "baseUrl": "http://localhost:11434" },
"glm": { "provider": "openai", "model": "glm-z1-air", "baseUrl": "https://open.bigmodel.cn/api/paas/v4" },
"groq": { "provider": "groq", "model": "llama-3.3-70b-versatile", "costPerMillionInput": 0.2, "costPerMillionOutput": 0.8 }
},
"evalModel": "balanced",
"budget": { "maxPerRun": 2.0, "maxPerExperiment": 0.5, "trackCosts": true },
"parallelism": 3
}Use as a Library
import { ResearchOrchestrator } from 'modelab';
const orch = new ResearchOrchestrator({
models: {
balanced: { provider: 'anthropic', model: 'claude-sonnet-4-6', costPerMillionInput: 3, costPerMillionOutput: 15 },
reasoning: { provider: 'openai', model: 'o1', costPerMillionInput: 15, costPerMillionOutput: 60 },
},
budget: { maxPerRun: 2.0, maxPerExperiment: 0.5, trackCosts: true },
evalModel: 'balanced',
parallelism: 3,
onProgress: msg => console.log(msg),
onArmComplete: r => console.log(`Arm done: ${r.armId} → ${r.score}/10`),
});
const log = await orch.run({
id: 'my-goal',
question: 'What is the optimal block time for Ethereum L2s?',
goal: 'Provide a technically rigorous analysis',
qualityThreshold: 7.5,
maxIterations: 3,
arms: [
{ id: 'arm-1', name: 'balanced', model: 'balanced', promptTemplate: '...' },
{ id: 'arm-2', name: 'reasoning', model: 'reasoning', promptTemplate: '...' },
],
});
console.log(log.bestResult);
console.log(`Total cost: $${log.totalCostUsd}`);You can also use the planning surface directly:
import { buildResearchPlan } from 'modelab';Environment Variables
| Variable | Provider |
|---------|---------|
| OPENAI_API_KEY | OpenAI, Groq, OpenRouter, Perplexity |
| ANTHROPIC_API_KEY | Anthropic |
| MINIMAX_API_KEY | MiniMax |
| GROQ_API_KEY | Groq |
| GEMINI_API_KEY | Google Gemini |
| PERPLEXITY_API_KEY | Perplexity |
Architecture
ResearchOrchestrator
├── router.ts — keyword heuristic → best-fit model (GLM-5.0 aware)
├── routing_v2.ts — learned routing: performance-based, reads from lesson_engine
├── evaluator.ts — streaming calls across 10 providers, rate-limit tracking
├── scorer.ts — LLM judge: structured rubric, Zod validation, retry
├── orchestrator.ts — parallel arms, quality gate, budget guard, TTFT tracking
├── planner.ts — recommends the next best experiment from routing + memory + recall
├── lesson_engine.ts — self-iteration: parses lessons, writes router adjustments
├── embedding_store.ts — TF-IDF / Ollama embeddings for semantic memory
├── complexity.ts — task complexity profiling
├── memory.ts — SQLite: ~/.modelab/memory.db (LCM Memory v2)
├── cache.ts — SHA-256 hash cache: ~/.modelab/cache.json
├── templates.ts — 7 built-in prompt templates
└── export.ts — json / markdown / html reportsLinks + License
- npm: https://www.npmjs.com/package/modelab
- Issues: https://github.com/darks0l/modelab/issues
- Changelog: CHANGELOG.md
MIT License
Built with teeth. 🌑
