modelab

v0.6.3

Published

5 days ago

Autonomous AI research engine — run experiments, score outputs, remember lessons, recall history, and plan what to test next

0High
0Medium
0Low

darksol

ai research agent llm openai anthropic evaluation experimentation memory routing

Built by DARKSOL 🌑

modelab 🌑 — Repeatable Multi-Model Research

Run the same research question across multiple models, score the results, remember what worked, and get a concrete recommendation for what to test next.

modelab is used internally at DARKSOL for systematic research — comparing model strategies, pressure-testing prompts, auditing model behavior, and generating reproducible reports.

It is intentionally model-agnostic underneath and workflow-opinionated on top. The point is not just to run prompts. The point is to make research runs repeatable, comparable, and easier to improve over time.

What it does best

modelab is strongest when you want to:

compare multiple model approaches to the same task
keep a memory of what worked on similar past questions
inspect routing choices instead of blindly trusting a default model
get a suggested next experiment instead of just a pile of logs

If you only want a single prompt runner, this is overkill. If you want a lightweight research loop with memory and planning, this is the lane.

Product Thesis

modelab is built around one loop:

Run parallel experiments
Score outputs against a rubric
Remember lessons and summaries
Recall similar prior work
Plan the next best experiment

If it only ran experiments, it would be another eval harness. The goal is to become an engine that learns from prior runs and gets sharper every cycle.

Install

npm install modelab
modelab config --wizard   # guided setup for preset + export defaults

Quick Start

# Run a research question across 3 model strategies in parallel
modelab run --goal "What causes migraines and what do the best treatments look like?"

# Check setup health before a real run
modelab doctor

# Ask modelab what it should test next
modelab plan --goal "Compare Postgres vs DynamoDB for a startup"

# Let modelab propose the next run and execute it immediately
modelab next --goal "Compare Postgres vs DynamoDB for a startup" --run

# Explain why routing picked a model
modelab route --task "Compare Postgres vs DynamoDB for a startup"

# Compare specific models head-to-head
modelab run --goal "Explain ZK rollup economics" --arms balanced,reasoning

# Use a built-in template
modelab run --goal "Review my REST API design" --template code-review

# Watch streaming tokens arrive in real-time
modelab run --goal "Write a生日贺卡" --template creative --stream

# Export a finished run to a shareable HTML report
modelab export <run-id> --format html --share

Why it exists

Most AI tooling stops at execution. modelab is for when you want a system that can:

compare multiple approaches instead of trusting one answer
preserve lessons instead of starting fresh every run
route work using learned performance, not just keywords
surface similar past work when a new question looks familiar
recommend the next experiment instead of leaving you with raw logs
turn that recommendation into an actual run without manual copy/paste

That makes it useful for:

prompt and workflow R&D
model comparison and provider audits
research-heavy product decisions
repeatable internal evaluation loops

First-run checklist

npm install modelab
modelab config --wizard
modelab doctor
modelab plan --goal "Review my API design"
modelab run --goal "Review my API design"

If modelab doctor shows missing auth, add the right provider key and run it again before blaming the package.

If you want a different starting lane without prompts, use a preset directly:

modelab config --presets
modelab config --init --preset cheap
# or
modelab config --wizard --yes --preset research-heavy --default-format html --output-dir shared-reports

What it does not do yet

It does not automatically guarantee better research outcomes without a decent scoring setup.
It does not replace a human reviewer for high-stakes decisions.
It does not magically infer perfect provider credentials or budgets.
Its learned routing is only as good as the run history you have built.

What’s new in v0.6

Research Planning

modelab plan turns routing, run history, lessons, model insight profiles, and semantic recall into a recommended next experiment. Instead of only telling you what happened, modelab now suggests what to run next: template, iterations, threshold, arms, a copy-paste command, and where to inspect/share the run after it finishes.

One-Step Next Run

modelab next closes the gap between planning and execution. It uses the same routing, lessons, history, and semantic recall as plan, but can immediately launch the recommended experiment with --run. That makes the core workflow feel more like a real research loop instead of a planning screen plus manual follow-up.

Guided Setup + Shareable Reports

modelab now has a guided first-run path via modelab config --wizard and a cleaner report handoff via modelab export <run-id> --format html --share. That means you can go from setup → plan → run → share without stitching together a bunch of one-off commands.

Actionable Self-Iteration (lesson_engine)

Previous versions stored "lessons" as advisory notes nobody read. v0.4 closes the loop: after each run, the lesson_engine parses the scorer output, writes actual router adjustments to the DB, and the routing layer applies them before the next run. Score below 4? Model gets a -1 penalty. Score above 8? It gets boosted for that task type. Lessons become config changes — automatically.

Model Profiles

Every model maintains a performance profile updated after each run: avg_score, avg_latency_ms, avg_cost_usd, strengths[], weaknesses[]. The router reads this, not just keyword matching.

Semantic Memory (embedding_store)

Run summaries and lessons are embedded using TF-IDF vectors (Ollama nomic-embed-text when available) and stored in SQLite. Query past experiments semantically: modelab recall "what did we learn about coding tasks?"

Learned Routing (routing_v2)

Replaces the keyword router with a performance-based router that considers: model strengths for the task type, historical scores, active adjustments, and similar past runs.

Campaign Layer

Multi-run research campaigns: modelab campaign new "My Hypothesis" --runs 10 --synthesize. Coordinates a series of runs, synthesizes findings, and generates reports.

Task Complexity Profiling

Goals are analyzed for complexity (question marks, structure, length) before routing. High-complexity tasks get more iterations; low-complexity get faster models.

LCM Memory v2 — Cross-Run Persistence + Cross-Iteration Learning

Every run writes iteration summaries and full run summaries to SQLite. Before running, modelab loads prior context for the same goal ID — so lessons from last week actually influence today's experiment. The iteration_context template variable carries these lessons into new prompts automatically.

Tiktoken Token Counting

Replaced rough length/4 estimation with the GPT-2 vocab Tiktoken encoder. Token counts and cost estimates are now accurate to the actual tokenization scheme of the model being used.

Proactive Rate-Limit Backoff

RateLimitTracker tracks 429 responses and their Retry-After headers. Before making a call, the system checks whether the endpoint is currently throttled and waits proactively — not just retrying after failure but preventing it.

TTFT Latency Stats

Time-to-first-token is measured per arm per call and aggregated into p50/p95 stats. These are shown in comparison tables and stored in run logs. Arms can be configured with a latencyTargetMs to skip models that are historically slow.

GLM-5.0 Routing

The keyword router now recognizes glm, glm-5, glm5, 智谱, zhipu and routes to the GLM-5.1 model — enabling research on Chinese-language models and frontier Chinese providers.

Cross-Run Learning System

outputPreview (first 200 chars) and outputTruncated (boolean) stored in DB and cache
experiments command: modelab experiments --sort score|cost|date for at-a-glance run history
review command: modelab review <run-id> for detailed latency + lesson breakdown
migration hardening for older local SQLite installs so new memory fields upgrade cleanly instead of breaking existing environments

Structured Scorer with Optional Sub-Fields

The LLM judge returns { score, reasoning, clarity?, correctness?, completeness? }. Sub-fields are optional — the scorer still works if the judge skips them.

CLI Commands

modelab run --goal "..." [--iterations N] [--threshold N] [--arms m1,m2]
              [--template id] [--format json|md|html] [--output path]
              [--stream] [--no-cache]
              Run a research experiment

modelab experiments [--sort score|cost|date]    View all runs at a glance
modelab experiments --goal-id "..." --status completed
modelab review <run-id>                         Deep-dive: latency + lesson breakdown
modelab report <run-id>                         Winner summary + per-arm breakdown
modelab recall "what did we learn about..."    Semantic search across runs + lessons
modelab plan --goal "..."                      Recommend the next best experiment
modelab next --goal "..." [--run]             Generate the next best experiment and optionally run it now
modelab memory inspect --goal-id "..."         Inspect stored runs + lessons for a goal
modelab history                                 Show run history
modelab best [--goal-id]                        Show best result for a goal
modelab templates                               List built-in prompt templates
modelab export <run-id> [--format json|md|html] [--share] Re-export a past run with executive summary
modelab route --task "..."                      Show model routing decision
modelab cache --clear                           Clear the result cache
modelab config --init                           Create ~/.modelab/config.json
modelab config --wizard                         Guided setup for preset + export defaults
modelab config --presets                        Show starter config presets
modelab config --list                           Show current config

Built-in Templates

| Template | Use case | |----------|----------| | research | Deep multi-perspective research | | code-review | Bugs, security, performance review | | architecture | System design and trade-offs | | bug-hunt | Adversarial failure-mode analysis | | compare | A/B decisions with scoring | | quick-answer | Fast, concise responses | | creative | Brainstorming and ideation |

Model Routing

Routing is no longer just static keyword matching. modelab combines:

task-shape heuristics
learned model profiles
active lesson adjustments
prior run performance
similar historical runs

The keyword layer still exists as a fallback/default shape:

| Task | Keywords | Routes to | |------|----------|-----------| | coding | code, refactor, bug, build, PR, function, class | coding | | reasoning | proof, logic, analysis, theorem, 证明,推理 | reasoning | | glm | glm, glm-5, glm5, 智谱, zhipu | glm | | quick | quick, summary, what is, define, 什么是 | fast | | default | everything else | balanced |

Override with --arms fast,balanced,reasoning or configure explicitly in config.

Opinionated by workflow, not locked by provider

modelab should stay broad at the infrastructure layer:

bring your own providers
bring your own models
bring your own scoring budgets

But at the product layer, it is intentionally opinionated:

experiments should be comparable
lessons should persist
memory should influence future runs
planning should be explicit
research quality should compound over time

Config

~/.modelab/config.json:

{
  "models": {
    "fast":      { "provider": "openai",   "model": "gpt-4o-mini",             "costPerMillionInput": 0.15,  "costPerMillionOutput": 0.60 },
    "balanced":  { "provider": "anthropic", "model": "claude-sonnet-4-6",        "costPerMillionInput": 3,     "costPerMillionOutput": 15 },
    "reasoning": { "provider": "openai",   "model": "o1",                       "costPerMillionInput": 15,    "costPerMillionOutput": 60 },
    "coding":    { "provider": "ollama",   "model": "qwen3-coder",              "baseUrl": "http://localhost:11434" },
    "glm":       { "provider": "openai",   "model": "glm-z1-air",               "baseUrl": "https://open.bigmodel.cn/api/paas/v4" },
    "groq":      { "provider": "groq",     "model": "llama-3.3-70b-versatile",  "costPerMillionInput": 0.2,   "costPerMillionOutput": 0.8 }
  },
  "evalModel": "balanced",
  "budget": { "maxPerRun": 2.0, "maxPerExperiment": 0.5, "trackCosts": true },
  "parallelism": 3
}

Use as a Library

import { ResearchOrchestrator } from 'modelab';

const orch = new ResearchOrchestrator({
  models: {
    balanced: { provider: 'anthropic', model: 'claude-sonnet-4-6', costPerMillionInput: 3, costPerMillionOutput: 15 },
    reasoning: { provider: 'openai', model: 'o1', costPerMillionInput: 15, costPerMillionOutput: 60 },
  },
  budget: { maxPerRun: 2.0, maxPerExperiment: 0.5, trackCosts: true },
  evalModel: 'balanced',
  parallelism: 3,
  onProgress: msg => console.log(msg),
  onArmComplete: r => console.log(`Arm done: ${r.armId} → ${r.score}/10`),
});

const log = await orch.run({
  id: 'my-goal',
  question: 'What is the optimal block time for Ethereum L2s?',
  goal: 'Provide a technically rigorous analysis',
  qualityThreshold: 7.5,
  maxIterations: 3,
  arms: [
    { id: 'arm-1', name: 'balanced', model: 'balanced', promptTemplate: '...' },
    { id: 'arm-2', name: 'reasoning', model: 'reasoning', promptTemplate: '...' },
  ],
});

console.log(log.bestResult);
console.log(`Total cost: $${log.totalCostUsd}`);

You can also use the planning surface directly:

import { buildResearchPlan } from 'modelab';

Environment Variables

| Variable | Provider | |---------|---------| | OPENAI_API_KEY | OpenAI, Groq, OpenRouter, Perplexity | | ANTHROPIC_API_KEY | Anthropic | | MINIMAX_API_KEY | MiniMax | | GROQ_API_KEY | Groq | | GEMINI_API_KEY | Google Gemini | | PERPLEXITY_API_KEY | Perplexity |

Architecture

ResearchOrchestrator
  ├── router.ts         — keyword heuristic → best-fit model (GLM-5.0 aware)
  ├── routing_v2.ts    — learned routing: performance-based, reads from lesson_engine
  ├── evaluator.ts     — streaming calls across 10 providers, rate-limit tracking
  ├── scorer.ts        — LLM judge: structured rubric, Zod validation, retry
  ├── orchestrator.ts  — parallel arms, quality gate, budget guard, TTFT tracking
  ├── planner.ts       — recommends the next best experiment from routing + memory + recall
  ├── lesson_engine.ts — self-iteration: parses lessons, writes router adjustments
  ├── embedding_store.ts — TF-IDF / Ollama embeddings for semantic memory
  ├── complexity.ts    — task complexity profiling
  ├── memory.ts        — SQLite: ~/.modelab/memory.db (LCM Memory v2)
  ├── cache.ts         — SHA-256 hash cache: ~/.modelab/cache.json
  ├── templates.ts     — 7 built-in prompt templates
  └── export.ts        — json / markdown / html reports

Links + License

npm: https://www.npmjs.com/package/modelab
Issues: https://github.com/darks0l/modelab/issues
Changelog: CHANGELOG.md

MIT License

Built with teeth. 🌑