npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

mcp-dyno

v0.2.1

Published

Put your MCP server on the dyno — holistic, LLM-driven analysis of efficiency, cost, context-bloat, correctness, and reliability, with rigorous before/after error bars.

Readme

mcp-dyno

npm CI license: MIT

Put your MCP server on the dyno.

mcp-dyno is an open-source CLI that measures how good your Model Context Protocol server is when an LLM actually drives it — with Claude, GPT, Gemini, or any OpenAI-compatible model — across five perspectives in a single run:

  • Efficiency — tokens/task, tool-call & round-trip counts, latency
  • Cost — $/task at real model prices
  • Context-bloat — how much of the window your tool definitions, args, and results actually eat
  • Correctness — task success (LLM-judged)
  • Reliabilitypass^k consistency, hallucinated-tool rate, schema adherence, error recovery
  • Server ergonomics — grades your design, not the model: per-tool result-payload weight (what to paginate) and first-call affordance (which descriptions/schemas the model keeps mis-reading) → a fix-list

Then it lets you prove an optimization worked with rigorous before/after paired statistics — not vibes.

Why

Research benchmarks rank models. Protocol testers check compliance. Eval frameworks make you assemble everything yourself. None of them answer the question an MCP author actually has: "how good is my server under an LLM, and did my change make it better?" That's the gap mcp-dyno fills.

Demo

See demo/ for a real run you can explore without any setup (anonymized) — including the finding that a "cheaper, faster" server was simply doing less: it degraded from 98% correct on a strong driver to 23% on a weak one, while a code-execution server stayed usable on both.

Comparing two MCP servers across five pillars

Quick start

# Analyze a local stdio server (auto-generates a task suite from its tools)
npx mcp-dyno analyze --server "node ./build/index.js"

# Bring your own tasks, and score correctness with an LLM judge
npx mcp-dyno analyze --server "node ./build/index.js" --tasks ./dyno-tasks.yaml --judge

# Use a built-in, versioned task corpus (comparable across servers)
npx mcp-dyno analyze --server "node ./build/index.js" --corpus filesystem@1 --judge

# Grade a run (per-pillar) and emit a committable badge
npx mcp-dyno scorecard --badge mcp-dyno-badge.json

# Prove an optimization: before vs after (paired stats)
npx mcp-dyno compare \
  --base "node ./build/index.js"      \
  --head "node ./build-optimized/index.js"

# Explore every run in a local dashboard
npx mcp-dyno view        # → http://localhost:4000

Use it as a CI gate

# Fail the build if a budget is breached…
npx mcp-dyno assert --config dyno.config.json --summary-md "$GITHUB_STEP_SUMMARY"
# …or if the PR resolvably regresses vs the base (noise never fails):
npx mcp-dyno compare --base "…" --head "…" --tasks tasks.yaml --fail-on-regression

Both exit non-zero only on real signal. See docs/ci.md for a ready-to-copy GitHub Actions workflow.

Models & auth

Pick the driver model with --model. A bare id (e.g. claude-sonnet-4-6) is Claude; otherwise use <provider>/<id>:

npx mcp-dyno analyze --server "…" --model openai/gpt-4o-mini      # needs OPENAI_API_KEY
npx mcp-dyno analyze --server "…" --model google/gemini-2.5-flash # needs GEMINI_API_KEY
npx mcp-dyno analyze --server "…" --model openrouter/meta-llama/llama-3.1-70b-instruct

Providers: anthropic (default), openai, google, openrouter, groq, together, plus any OpenAI-compatible endpoint via <PROVIDER>_BASE_URL. Run the same task set under several models and compare them in the model matrix (dyno view) for a cross-model robustness read.

For Claude specifically, two auth paths trade cost for fidelity:

  • --auth cli (subscription): drives via your existing claude CLI sign-in — no Anthropic API spend. Tool definitions are still measured exactly, but Claude Code's own system-prompt inflates the billable floor, so context-bloat is labeled estimated.
  • --auth api (default): our own agent loop over the Anthropic Messages API (ANTHROPIC_API_KEY). Exact token accounting, minimal floor.

Non-Claude providers always drive over their API (no subscription path) with exact usage accounting. The LLM judge (--judge, off by default) and task auto-generation can run on any provider — point --judge-model at a different family (e.g. drive Claude, judge openai/gpt-4o) for cross-family grading.

Config

dyno init scaffolds a dyno.config.json. It can hold a single server block (for analyze), base and head blocks for compareeach with its own env/headers, so you can compare two servers with different secrets/auth — plus defaults (model, epochs, tasks, prices, …). Point at it with --config dyno.config.json; explicit CLI flags always win.

Status

Early development. See docs/DESIGN.md for the architecture and roadmap.

Metrics

The five pillars and every metric are described in docs/DESIGN.md, and each metric in the dashboard has a hover tooltip with its definition.

Acknowledgements

mcp-dyno was informed by prior work on LLM evaluation. These methods and ideas are credited below; they were implemented independently (no third-party code is bundled):

  • Anthropic — "Adding Error Bars to Evals" — the paired-difference test, minimum-detectable-effect, and required-n power analysis used in compare.
  • τ-bench — the pass^k reliability framing.
  • The function-calling / tool-use benchmark line (e.g. BFCL, ToolBench) and the emerging MCP-evaluation landscape, which shaped the metric set.
  • The normal quantile uses Acklam's algorithm; the regularized incomplete beta is evaluated via the modified Lentz continued fraction (public-domain mathematics).

Built on the Model Context Protocol SDK and the Anthropic SDK.

License

MIT