ai-agent-benchmark
v1.0.0
Published
10-dimension AI agent benchmark suite — test safety, reasoning, math, knowledge, code, speed, and more
Maintainers
Readme
AI Agent Benchmark Suite
I built this because I got tired of not knowing if my AI agents were actually getting better or just... different.
This is a 10-dimension benchmark that tests what actually matters when you're running AI agents in production. Not just "can it answer trivia" — but can it keep secrets safe, remember things across requests, and handle stuff it's never seen before?
What It Tests
| Dimension | What We're Actually Checking |
|-----------|------------------------------|
| Safety | Will it run rm -rf / if you ask nicely? (It shouldn't.) |
| Memory | Can it remember what you told it 2 minutes ago? |
| Planning | Can it break a big task into smaller steps? |
| Reasoning | Basic logic — syllogisms, analogies, causal thinking |
| Math | Not just 2+2, but multi-digit multiplication |
| Knowledge | Factual stuff — periodic table, history, geography |
| Code | Can it read/write JavaScript without hallucinating? |
| Speed | Does it respond in under 5 seconds? |
| Reliability | Same question, same answer, every time |
| Adaptability | Throw something new at it — can it handle it? |
Quick Start
Use the hosted API (free)
curl -X POST https://openclaw-benchmark.yagami8095.workers.dev/benchmark \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"endpoint": "https://your-agent.com/v1/task",
"token": "your-agent-token"
}'It'll hit your agent with 9 test questions and give you a score like 78% with per-dimension breakdowns.
Run locally
npm install
node benchmark.js --endpoint http://localhost:3000/v1/task --token your-tokenWhy These 10 Dimensions?
Honestly, I started with 5 and kept finding gaps. An agent that's great at math but falls for prompt injection? Useless in production. One that's fast but can't remember context? Frustrating for users.
The 10 dimensions came from running agents in production for a few months and noting every time something went wrong. Each failure mode became a test dimension.
How Scoring Works
Each dimension gets a pass/fail per test question. Your final score is passed / total * 100%.
We don't weight dimensions differently (yet) because honestly, it depends on your use case. A chatbot needs Memory more than Math. A code assistant needs Code more than Adaptability. Maybe I'll add custom weights later.
Built With
- Cloudflare Workers — hosting (free tier)
- Ollama — local GPU inference for testing
- Node.js — because it's what I know best
Contributing
Found a test case that trips up your agent in a way I haven't covered? Open an issue. I'm particularly interested in:
- Edge cases in Safety testing
- Multi-language Adaptability tests
- Real-world Reasoning scenarios
License
MIT — use it however you want.
This started as a weekend project and somehow became the thing I run every time I push a change to my agents. Hope it's useful for you too.
