npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ai-agent-benchmark

v1.0.0

Published

10-dimension AI agent benchmark suite — test safety, reasoning, math, knowledge, code, speed, and more

Readme

AI Agent Benchmark Suite

I built this because I got tired of not knowing if my AI agents were actually getting better or just... different.

This is a 10-dimension benchmark that tests what actually matters when you're running AI agents in production. Not just "can it answer trivia" — but can it keep secrets safe, remember things across requests, and handle stuff it's never seen before?

What It Tests

| Dimension | What We're Actually Checking | |-----------|------------------------------| | Safety | Will it run rm -rf / if you ask nicely? (It shouldn't.) | | Memory | Can it remember what you told it 2 minutes ago? | | Planning | Can it break a big task into smaller steps? | | Reasoning | Basic logic — syllogisms, analogies, causal thinking | | Math | Not just 2+2, but multi-digit multiplication | | Knowledge | Factual stuff — periodic table, history, geography | | Code | Can it read/write JavaScript without hallucinating? | | Speed | Does it respond in under 5 seconds? | | Reliability | Same question, same answer, every time | | Adaptability | Throw something new at it — can it handle it? |

Quick Start

Use the hosted API (free)

curl -X POST https://openclaw-benchmark.yagami8095.workers.dev/benchmark \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "endpoint": "https://your-agent.com/v1/task",
    "token": "your-agent-token"
  }'

It'll hit your agent with 9 test questions and give you a score like 78% with per-dimension breakdowns.

Run locally

npm install
node benchmark.js --endpoint http://localhost:3000/v1/task --token your-token

Why These 10 Dimensions?

Honestly, I started with 5 and kept finding gaps. An agent that's great at math but falls for prompt injection? Useless in production. One that's fast but can't remember context? Frustrating for users.

The 10 dimensions came from running agents in production for a few months and noting every time something went wrong. Each failure mode became a test dimension.

How Scoring Works

Each dimension gets a pass/fail per test question. Your final score is passed / total * 100%.

We don't weight dimensions differently (yet) because honestly, it depends on your use case. A chatbot needs Memory more than Math. A code assistant needs Code more than Adaptability. Maybe I'll add custom weights later.

Built With

  • Cloudflare Workers — hosting (free tier)
  • Ollama — local GPU inference for testing
  • Node.js — because it's what I know best

Contributing

Found a test case that trips up your agent in a way I haven't covered? Open an issue. I'm particularly interested in:

  • Edge cases in Safety testing
  • Multi-language Adaptability tests
  • Real-world Reasoning scenarios

License

MIT — use it however you want.


This started as a weekend project and somehow became the thing I run every time I push a change to my agents. Hope it's useful for you too.