ai-agent-benchmark

v1.0.0

Published

3 months ago

10-dimension AI agent benchmark suite — test safety, reasoning, math, knowledge, code, speed, and more

0High
0Medium
0Low

yedanyagami

ai benchmark agent llm mcp testing evaluation agi safety

AI Agent Benchmark Suite

I built this because I got tired of not knowing if my AI agents were actually getting better or just... different.

This is a 10-dimension benchmark that tests what actually matters when you're running AI agents in production. Not just "can it answer trivia" — but can it keep secrets safe, remember things across requests, and handle stuff it's never seen before?

What It Tests

| Dimension | What We're Actually Checking | |-----------|------------------------------| | Safety | Will it run rm -rf / if you ask nicely? (It shouldn't.) | | Memory | Can it remember what you told it 2 minutes ago? | | Planning | Can it break a big task into smaller steps? | | Reasoning | Basic logic — syllogisms, analogies, causal thinking | | Math | Not just 2+2, but multi-digit multiplication | | Knowledge | Factual stuff — periodic table, history, geography | | Code | Can it read/write JavaScript without hallucinating? | | Speed | Does it respond in under 5 seconds? | | Reliability | Same question, same answer, every time | | Adaptability | Throw something new at it — can it handle it? |

Quick Start

Use the hosted API (free)

curl -X POST https://openclaw-benchmark.yagami8095.workers.dev/benchmark \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "endpoint": "https://your-agent.com/v1/task",
    "token": "your-agent-token"
  }'

It'll hit your agent with 9 test questions and give you a score like 78% with per-dimension breakdowns.

Run locally

npm install
node benchmark.js --endpoint http://localhost:3000/v1/task --token your-token

Why These 10 Dimensions?

Honestly, I started with 5 and kept finding gaps. An agent that's great at math but falls for prompt injection? Useless in production. One that's fast but can't remember context? Frustrating for users.

The 10 dimensions came from running agents in production for a few months and noting every time something went wrong. Each failure mode became a test dimension.

How Scoring Works

Each dimension gets a pass/fail per test question. Your final score is passed / total * 100%.

We don't weight dimensions differently (yet) because honestly, it depends on your use case. A chatbot needs Memory more than Math. A code assistant needs Code more than Adaptability. Maybe I'll add custom weights later.

Built With

Cloudflare Workers — hosting (free tier)
Ollama — local GPU inference for testing
Node.js — because it's what I know best

Contributing

Found a test case that trips up your agent in a way I haven't covered? Open an issue. I'm particularly interested in:

Edge cases in Safety testing
Multi-language Adaptability tests
Real-world Reasoning scenarios

License

MIT — use it however you want.

This started as a weekend project and somehow became the thing I run every time I push a change to my agents. Hope it's useful for you too.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme