npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

thinktank-ai

v0.1.2

Published

Ensemble AI coding — run N parallel Claude Code agents, select the best result via Copeland pairwise scoring

Readme


Run N parallel Claude Code agents on the same task, then select the best result via test execution and Copeland pairwise scoring. Based on the principle that the aggregate of independent attempts outperforms any single attempt — proven in ensemble ML, superforecasting, and LLM code generation research.

Quick start

# Install globally
npm install -g thinktank-ai

# Or run without installing
npx thinktank-ai run "fix the authentication bypass"

# Run 3 parallel agents on a task
thinktank run "fix the authentication bypass"

# Run 5 agents with test verification
thinktank run "fix the race condition" -n 5 -t "npm test"

# Read prompt from a file (avoids shell expansion issues)
thinktank run -f task.md -n 5 -t "npm test"

# Pipe prompt from stdin
echo "refactor the parser" | thinktank run -n 3

# Apply the best result
thinktank apply

# Set persistent defaults
thinktank config set attempts 5
thinktank config set model opus

Requires Claude Code CLI installed and authenticated.

Models

Use --model to select a Claude model: sonnet (default), opus, haiku, or a full model ID like claude-opus-4-6.

Amazon Bedrock: Pass a Bedrock model ID such as anthropic.claude-opus-4-6-v1 and set the standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, etc.). See .env.example for details.

How it works

            ┌───────────────┐
            │   Your task   │
            └───────┬───────┘
                    │
         ┌──────────┼──────────┐
         │          │          │
         ▼          ▼          ▼
    ┌─────────┐┌─────────┐┌─────────┐
    │Agent #1 ││Agent #2 ││Agent #3 │
    │worktree ││worktree ││worktree │
    └────┬────┘└────┬────┘└────┬────┘
         │          │          │
         ▼          ▼          ▼
    ┌──────────────────────────────┐
    │     Test & Convergence       │
    │ ┌────────┐┌────────────────┐ │
    │ │npm test││Agents 1,3 agree│ │
    │ └────────┘└────────────────┘ │
    └──────────────┬───────────────┘
                   │
                   ▼
          ┌─────────────────┐
          │   Best result   │
          │   recommended   │
          └─────────────────┘
  1. Spawns N parallel Claude Code agents, each in an isolated git worktree
  2. Each agent independently solves the task (no shared context = true independence)
  3. Runs your test suite on each result
  4. Analyzes convergence — did the agents agree on an approach?
  5. Recommends the best candidate via Copeland pairwise scoring
  6. You review and thinktank apply

Scoring

The default scoring method is Copeland pairwise ranking. Every agent is compared head-to-head against every other agent across four criteria: tests passed, convergence group size, minimal file scope, and test files contributed. The agent that wins the most pairwise matchups is recommended.

An alternative --scoring weighted method is also available, which assigns point values to tests (100), convergence (50), and diff size (10). A third method, Borda count (rank aggregation), is available for comparison via thinktank evaluate.

Use thinktank evaluate to compare how all three scoring methods rank your results. See docs/scoring-evaluation.md for the full analysis.

Why this works

Every model ever benchmarked shows pass@5 >> pass@1. The gap between "one attempt" and "best of five" is one of the largest free reliability gains in AI coding. But no tool exposes this — until now.

| Metric | Single attempt | 5 parallel attempts | |--------|---------------|---------------------| | Reliability | Whatever pass@1 gives you | Approaches pass@5 | | Confidence | "Did it get it right?" | "4/5 agents agree — high confidence" | | Coverage | One approach explored | Multiple approaches, pick the best |

The key insight: parallel attempts cost more tokens but not more time. All agents run simultaneously.

When to use it

  • High-stakes changes — auth, payments, security, data migrations
  • Ambiguous tasks — multiple valid approaches, need to see the spread
  • Complex refactors — many files, easy to miss something
  • Unfamiliar codebases — multiple attempts reduce the chance of going down the wrong path

Recommended workflows

Two-phase: generate tests, then implement

A single agent can write a wrong test that becomes a false oracle. Use the ensemble to validate your test suite before using it to judge implementations.

Phase 1 — generate tests:

thinktank run "write unit tests for grid.py pathfinding" -n 5 -t "bash run-collect-tests.sh"
thinktank compare 1 2  # compare assertions across agents

If all agents assert the same expected values, the tests are likely correct. If they disagree on a specific assertion (e.g., 3 agents say path length 9, 1 says 13), investigate before proceeding.

Phase 2 — implement:

thinktank apply           # apply the converged test suite
thinktank run "implement A* pathfinding in grid.py" -n 5 -t "python -m pytest"

Why this matters: During development, a single agent wrote a test asserting a shortest path of 13 steps when the correct answer was 9. This wrong test caused 13+ ensemble runs to show 0% pass rate — every agent was right, but the oracle was wrong. Using ensemble test generation would have caught the bad assertion via convergence analysis before it became the ground truth.

Commands

thinktank run [prompt]

Run N parallel agents on a task.

| Flag | Description | |------|-------------| | -n, --attempts <N> | Number of parallel agents (default: 3, max: 20) | | -f, --file <path> | Read prompt from a file | | -t, --test-cmd <cmd> | Test command to verify results | | --test-timeout <sec> | Timeout for test command in seconds (default: 120, max: 600) | | --timeout <sec> | Timeout per agent in seconds (default: 600, max: 1800) | | --model <model> | Claude model: sonnet, opus, haiku, or full ID | | -r, --runner <name> | AI coding tool to use (default: claude-code) | | --scoring <method> | Scoring method: copeland (default) or weighted | | --threshold <number> | Convergence clustering similarity threshold, 0.0–1.0 (default: 0.3) | | --whitespace-insensitive | Ignore whitespace in convergence comparison | | --retry | Re-run only failed/timed-out agents from the last run | | --no-timeout | Disable agent timeout entirely | | --output-format <fmt> | Output format: text (default), json, or diff | | --no-color | Disable colored output | | --verbose | Show detailed agent output |

thinktank init

Set up thinktank in the current project — checks prerequisites and detects your test command.

thinktank apply

Apply the recommended agent's changes to your working tree.

| Flag | Description | |------|-------------| | -a, --agent <N> | Apply a specific agent's result instead of the recommended one | | -p, --preview | Show the diff without applying | | -d, --dry-run | Same as --preview (alias) |

thinktank undo

Reverse the last applied diff.

thinktank list [run-number]

List all past runs, or show details for a specific run.

thinktank compare <agentA> <agentB>

Compare two agents' results side by side.

thinktank stats

Show aggregate statistics across all runs.

| Flag | Description | |------|-------------| | --model <name> | Filter to runs using a specific model | | --since <date> | Show runs from this date onward (ISO 8601) | | --until <date> | Show runs up to this date (ISO 8601) | | --passed-only | Only runs where at least one agent passed tests |

thinktank evaluate

Compare scoring methods (weighted vs Copeland vs Borda) across all runs to see how they differ in recommendations.

thinktank clean

Remove thinktank worktrees and branches. Add --all to also delete .thinktank/ run history.

thinktank config set|get|list

View and update persistent configuration (stored in .thinktank/config.json).

thinktank config set attempts 5    # persistent default
thinktank config set model opus
thinktank config get attempts
thinktank config list              # show all values

Available keys: attempts, model, timeout, runner, threshold, testTimeout.

Pre-flight checks

Before spawning agents, thinktank validates the environment:

  1. Disk space — warns if there isn't enough room for N worktrees
  2. Test suite — if --test-cmd is set, runs the tests once on the main branch to verify the suite passes before spending tokens on parallel agents

Example output

thinktank — ensemble AI coding

  Task:     fix the authentication bypass
  Agents:   5 parallel attempts
  Model:    sonnet

Results
────────────────────────────────────────────────────────────

  Agent    Status    Tests   Files   +/-          Time
  ──────────────────────────────────────────────────────────
>> #1      ok        pass    2       +15/-3       45s
  #2      ok        pass    2       +18/-3       52s
  #3      ok        pass    3       +22/-5       61s
  #4      ok        fail    1       +8/-2        38s
  #5      ok        pass    2       +14/-3       47s

Convergence
────────────────────────────────────────────────────────────
  Agents [1, 2, 5]: ████████████████░░░░ 60%
  Strong consensus — 3/5 agents changed the same files
  Files: src/middleware/auth.ts, tests/auth.test.ts

Copeland Pairwise Scoring
────────────────────────────────────────────────────────────
  Agent   Tests     Converge  Scope     TestCov   Copeland
  ──────────────────────────────────────────────────────────
>> #1     +1        +2        0         +1        +4
  #2      +1        +2        0         0         +3
  #3      +1        -3        -4        0         -6
  #4      -4        -3        +4        -1        -4
  #5      +1        +2        0         0         +3

  Recommended: Agent #1 (Copeland winner)

How it compares

| Approach | Reliability | Cost | Speed | Selection | |----------|-------------|------|-------|-----------| | Single Claude Code run | pass@1 | 1x | Fastest | N/A | | thinktank (N=3) | ~pass@3 | 3x | Same wall time | Copeland pairwise | | thinktank (N=5) | ~pass@5 | 5x | Same wall time | Copeland pairwise | | Manual retry loop | pass@k (sequential) | kx | k × slower | Manual |

References

Ensemble coding research

  • AlphaCode — DeepMind, 2022. Massive parallel generation + clustering + test-based filtering.
  • CodeT — Microsoft, 2022. Dual execution agreement: generate N solutions + N tests, cross-validate.
  • MBR-Exec — 2022. Minimum Bayes Risk via execution consensus.
  • Self-Consistency — Wang et al., 2022. Majority voting across samples improves over single-pass.

LLM planning & verification

Ensemble theory

  • Superforecasting — Tetlock & Gardner. The aggregate of independent forecasters consistently beats individuals.
  • The Wisdom of Crowds — Surowiecki. Independent estimates, when aggregated, converge on truth.

Technical reports

  • Scoring Method Evaluation — Copeland vs Weighted vs Borda across 21 runs. Key finding: Copeland and Borda agree 86%, weighted disagrees ~40%.