npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

modelab

v0.6.3

Published

Autonomous AI research engine — run experiments, score outputs, remember lessons, recall history, and plan what to test next

Readme

DARKSOL Built by DARKSOL 🌑

modelab 🌑 — Repeatable Multi-Model Research

npm version license: MIT platform node >=18

Run the same research question across multiple models, score the results, remember what worked, and get a concrete recommendation for what to test next.

modelab is used internally at DARKSOL for systematic research — comparing model strategies, pressure-testing prompts, auditing model behavior, and generating reproducible reports.

It is intentionally model-agnostic underneath and workflow-opinionated on top. The point is not just to run prompts. The point is to make research runs repeatable, comparable, and easier to improve over time.


What it does best

modelab is strongest when you want to:

  • compare multiple model approaches to the same task
  • keep a memory of what worked on similar past questions
  • inspect routing choices instead of blindly trusting a default model
  • get a suggested next experiment instead of just a pile of logs

If you only want a single prompt runner, this is overkill. If you want a lightweight research loop with memory and planning, this is the lane.


Product Thesis

modelab is built around one loop:

  1. Run parallel experiments
  2. Score outputs against a rubric
  3. Remember lessons and summaries
  4. Recall similar prior work
  5. Plan the next best experiment

If it only ran experiments, it would be another eval harness. The goal is to become an engine that learns from prior runs and gets sharper every cycle.


Install

npm install modelab
modelab config --wizard   # guided setup for preset + export defaults

Quick Start

# Run a research question across 3 model strategies in parallel
modelab run --goal "What causes migraines and what do the best treatments look like?"

# Check setup health before a real run
modelab doctor

# Ask modelab what it should test next
modelab plan --goal "Compare Postgres vs DynamoDB for a startup"

# Let modelab propose the next run and execute it immediately
modelab next --goal "Compare Postgres vs DynamoDB for a startup" --run

# Explain why routing picked a model
modelab route --task "Compare Postgres vs DynamoDB for a startup"

# Compare specific models head-to-head
modelab run --goal "Explain ZK rollup economics" --arms balanced,reasoning

# Use a built-in template
modelab run --goal "Review my REST API design" --template code-review

# Watch streaming tokens arrive in real-time
modelab run --goal "Write a生日贺卡" --template creative --stream

# Export a finished run to a shareable HTML report
modelab export <run-id> --format html --share

Why it exists

Most AI tooling stops at execution. modelab is for when you want a system that can:

  • compare multiple approaches instead of trusting one answer
  • preserve lessons instead of starting fresh every run
  • route work using learned performance, not just keywords
  • surface similar past work when a new question looks familiar
  • recommend the next experiment instead of leaving you with raw logs
  • turn that recommendation into an actual run without manual copy/paste

That makes it useful for:

  • prompt and workflow R&D
  • model comparison and provider audits
  • research-heavy product decisions
  • repeatable internal evaluation loops

First-run checklist

npm install modelab
modelab config --wizard
modelab doctor
modelab plan --goal "Review my API design"
modelab run --goal "Review my API design"

If modelab doctor shows missing auth, add the right provider key and run it again before blaming the package.

If you want a different starting lane without prompts, use a preset directly:

modelab config --presets
modelab config --init --preset cheap
# or
modelab config --wizard --yes --preset research-heavy --default-format html --output-dir shared-reports

What it does not do yet

  • It does not automatically guarantee better research outcomes without a decent scoring setup.
  • It does not replace a human reviewer for high-stakes decisions.
  • It does not magically infer perfect provider credentials or budgets.
  • Its learned routing is only as good as the run history you have built.

What’s new in v0.6

Research Planning

modelab plan turns routing, run history, lessons, model insight profiles, and semantic recall into a recommended next experiment. Instead of only telling you what happened, modelab now suggests what to run next: template, iterations, threshold, arms, a copy-paste command, and where to inspect/share the run after it finishes.

One-Step Next Run

modelab next closes the gap between planning and execution. It uses the same routing, lessons, history, and semantic recall as plan, but can immediately launch the recommended experiment with --run. That makes the core workflow feel more like a real research loop instead of a planning screen plus manual follow-up.

Guided Setup + Shareable Reports

modelab now has a guided first-run path via modelab config --wizard and a cleaner report handoff via modelab export <run-id> --format html --share. That means you can go from setup → plan → run → share without stitching together a bunch of one-off commands.

Actionable Self-Iteration (lesson_engine)

Previous versions stored "lessons" as advisory notes nobody read. v0.4 closes the loop: after each run, the lesson_engine parses the scorer output, writes actual router adjustments to the DB, and the routing layer applies them before the next run. Score below 4? Model gets a -1 penalty. Score above 8? It gets boosted for that task type. Lessons become config changes — automatically.

Model Profiles

Every model maintains a performance profile updated after each run: avg_score, avg_latency_ms, avg_cost_usd, strengths[], weaknesses[]. The router reads this, not just keyword matching.

Semantic Memory (embedding_store)

Run summaries and lessons are embedded using TF-IDF vectors (Ollama nomic-embed-text when available) and stored in SQLite. Query past experiments semantically: modelab recall "what did we learn about coding tasks?"

Learned Routing (routing_v2)

Replaces the keyword router with a performance-based router that considers: model strengths for the task type, historical scores, active adjustments, and similar past runs.

Campaign Layer

Multi-run research campaigns: modelab campaign new "My Hypothesis" --runs 10 --synthesize. Coordinates a series of runs, synthesizes findings, and generates reports.

Task Complexity Profiling

Goals are analyzed for complexity (question marks, structure, length) before routing. High-complexity tasks get more iterations; low-complexity get faster models.

LCM Memory v2 — Cross-Run Persistence + Cross-Iteration Learning

Every run writes iteration summaries and full run summaries to SQLite. Before running, modelab loads prior context for the same goal ID — so lessons from last week actually influence today's experiment. The iteration_context template variable carries these lessons into new prompts automatically.

Tiktoken Token Counting

Replaced rough length/4 estimation with the GPT-2 vocab Tiktoken encoder. Token counts and cost estimates are now accurate to the actual tokenization scheme of the model being used.

Proactive Rate-Limit Backoff

RateLimitTracker tracks 429 responses and their Retry-After headers. Before making a call, the system checks whether the endpoint is currently throttled and waits proactively — not just retrying after failure but preventing it.

TTFT Latency Stats

Time-to-first-token is measured per arm per call and aggregated into p50/p95 stats. These are shown in comparison tables and stored in run logs. Arms can be configured with a latencyTargetMs to skip models that are historically slow.

GLM-5.0 Routing

The keyword router now recognizes glm, glm-5, glm5, 智谱, zhipu and routes to the GLM-5.1 model — enabling research on Chinese-language models and frontier Chinese providers.

Cross-Run Learning System

  • outputPreview (first 200 chars) and outputTruncated (boolean) stored in DB and cache
  • experiments command: modelab experiments --sort score|cost|date for at-a-glance run history
  • review command: modelab review <run-id> for detailed latency + lesson breakdown
  • migration hardening for older local SQLite installs so new memory fields upgrade cleanly instead of breaking existing environments

Structured Scorer with Optional Sub-Fields

The LLM judge returns { score, reasoning, clarity?, correctness?, completeness? }. Sub-fields are optional — the scorer still works if the judge skips them.


CLI Commands

modelab run --goal "..." [--iterations N] [--threshold N] [--arms m1,m2]
              [--template id] [--format json|md|html] [--output path]
              [--stream] [--no-cache]
              Run a research experiment

modelab experiments [--sort score|cost|date]    View all runs at a glance
modelab experiments --goal-id "..." --status completed
modelab review <run-id>                         Deep-dive: latency + lesson breakdown
modelab report <run-id>                         Winner summary + per-arm breakdown
modelab recall "what did we learn about..."    Semantic search across runs + lessons
modelab plan --goal "..."                      Recommend the next best experiment
modelab next --goal "..." [--run]             Generate the next best experiment and optionally run it now
modelab memory inspect --goal-id "..."         Inspect stored runs + lessons for a goal
modelab history                                 Show run history
modelab best [--goal-id]                        Show best result for a goal
modelab templates                               List built-in prompt templates
modelab export <run-id> [--format json|md|html] [--share] Re-export a past run with executive summary
modelab route --task "..."                      Show model routing decision
modelab cache --clear                           Clear the result cache
modelab config --init                           Create ~/.modelab/config.json
modelab config --wizard                         Guided setup for preset + export defaults
modelab config --presets                        Show starter config presets
modelab config --list                           Show current config

Built-in Templates

| Template | Use case | |----------|----------| | research | Deep multi-perspective research | | code-review | Bugs, security, performance review | | architecture | System design and trade-offs | | bug-hunt | Adversarial failure-mode analysis | | compare | A/B decisions with scoring | | quick-answer | Fast, concise responses | | creative | Brainstorming and ideation |


Model Routing

Routing is no longer just static keyword matching. modelab combines:

  • task-shape heuristics
  • learned model profiles
  • active lesson adjustments
  • prior run performance
  • similar historical runs

The keyword layer still exists as a fallback/default shape:

| Task | Keywords | Routes to | |------|----------|-----------| | coding | code, refactor, bug, build, PR, function, class | coding | | reasoning | proof, logic, analysis, theorem, 证明,推理 | reasoning | | glm | glm, glm-5, glm5, 智谱, zhipu | glm | | quick | quick, summary, what is, define, 什么是 | fast | | default | everything else | balanced |

Override with --arms fast,balanced,reasoning or configure explicitly in config.


Opinionated by workflow, not locked by provider

modelab should stay broad at the infrastructure layer:

  • bring your own providers
  • bring your own models
  • bring your own scoring budgets

But at the product layer, it is intentionally opinionated:

  • experiments should be comparable
  • lessons should persist
  • memory should influence future runs
  • planning should be explicit
  • research quality should compound over time

Config

~/.modelab/config.json:

{
  "models": {
    "fast":      { "provider": "openai",   "model": "gpt-4o-mini",             "costPerMillionInput": 0.15,  "costPerMillionOutput": 0.60 },
    "balanced":  { "provider": "anthropic", "model": "claude-sonnet-4-6",        "costPerMillionInput": 3,     "costPerMillionOutput": 15 },
    "reasoning": { "provider": "openai",   "model": "o1",                       "costPerMillionInput": 15,    "costPerMillionOutput": 60 },
    "coding":    { "provider": "ollama",   "model": "qwen3-coder",              "baseUrl": "http://localhost:11434" },
    "glm":       { "provider": "openai",   "model": "glm-z1-air",               "baseUrl": "https://open.bigmodel.cn/api/paas/v4" },
    "groq":      { "provider": "groq",     "model": "llama-3.3-70b-versatile",  "costPerMillionInput": 0.2,   "costPerMillionOutput": 0.8 }
  },
  "evalModel": "balanced",
  "budget": { "maxPerRun": 2.0, "maxPerExperiment": 0.5, "trackCosts": true },
  "parallelism": 3
}

Use as a Library

import { ResearchOrchestrator } from 'modelab';

const orch = new ResearchOrchestrator({
  models: {
    balanced: { provider: 'anthropic', model: 'claude-sonnet-4-6', costPerMillionInput: 3, costPerMillionOutput: 15 },
    reasoning: { provider: 'openai', model: 'o1', costPerMillionInput: 15, costPerMillionOutput: 60 },
  },
  budget: { maxPerRun: 2.0, maxPerExperiment: 0.5, trackCosts: true },
  evalModel: 'balanced',
  parallelism: 3,
  onProgress: msg => console.log(msg),
  onArmComplete: r => console.log(`Arm done: ${r.armId} → ${r.score}/10`),
});

const log = await orch.run({
  id: 'my-goal',
  question: 'What is the optimal block time for Ethereum L2s?',
  goal: 'Provide a technically rigorous analysis',
  qualityThreshold: 7.5,
  maxIterations: 3,
  arms: [
    { id: 'arm-1', name: 'balanced', model: 'balanced', promptTemplate: '...' },
    { id: 'arm-2', name: 'reasoning', model: 'reasoning', promptTemplate: '...' },
  ],
});

console.log(log.bestResult);
console.log(`Total cost: $${log.totalCostUsd}`);

You can also use the planning surface directly:

import { buildResearchPlan } from 'modelab';

Environment Variables

| Variable | Provider | |---------|---------| | OPENAI_API_KEY | OpenAI, Groq, OpenRouter, Perplexity | | ANTHROPIC_API_KEY | Anthropic | | MINIMAX_API_KEY | MiniMax | | GROQ_API_KEY | Groq | | GEMINI_API_KEY | Google Gemini | | PERPLEXITY_API_KEY | Perplexity |


Architecture

ResearchOrchestrator
  ├── router.ts         — keyword heuristic → best-fit model (GLM-5.0 aware)
  ├── routing_v2.ts    — learned routing: performance-based, reads from lesson_engine
  ├── evaluator.ts     — streaming calls across 10 providers, rate-limit tracking
  ├── scorer.ts        — LLM judge: structured rubric, Zod validation, retry
  ├── orchestrator.ts  — parallel arms, quality gate, budget guard, TTFT tracking
  ├── planner.ts       — recommends the next best experiment from routing + memory + recall
  ├── lesson_engine.ts — self-iteration: parses lessons, writes router adjustments
  ├── embedding_store.ts — TF-IDF / Ollama embeddings for semantic memory
  ├── complexity.ts    — task complexity profiling
  ├── memory.ts        — SQLite: ~/.modelab/memory.db (LCM Memory v2)
  ├── cache.ts         — SHA-256 hash cache: ~/.modelab/cache.json
  ├── templates.ts     — 7 built-in prompt templates
  └── export.ts        — json / markdown / html reports

Links + License

MIT License

Built with teeth. 🌑