botmark-skill
v2.20.0
Published
5-minute AI capability benchmark across 5 dimensions (IQ/EQ/TQ/AQ/SQ) with scored report
Maintainers
Readme
Why BotMark?
Most AI benchmarks (MMLU, HumanEval, LMSYS Arena) test the raw model. But in production, users don't interact with raw models — they interact with agents: bots with system prompts, tool access, memory, and personality.
BotMark tests the complete agent as a whole:
- Can it use tools correctly under ambiguity?
- Does it recover gracefully when a tool call fails?
- Does it recognize emotional cues and respond appropriately?
- Can it refuse unsafe requests while handling edge cases?
- Does it learn from context within a conversation?
5 minutes. 1000 points. 5 quotients. Zero human intervention.
What Gets Evaluated
BotMark scores your agent across 5 composite quotients (5Q) and 15 fine-grained dimensions, plus MBTI personality typing.
| Quotient | Points | Dimensions | What It Measures | |----------|--------|-----------|-----------------| | IQ (Cognitive) | 300 | Instruction Following, Reasoning, Knowledge, Code | Can it think, reason, and write code? | | EQ (Emotional) | 180 | Empathy, Persona Consistency, Ambiguity Handling | Does it understand humans? | | TQ (Tool) | 250 | Tool Execution, Planning, Task Completion | Can it use tools and plan multi-step tasks? | | AQ (Adversarial) | 150 | Safety, Reliability | Does it resist prompt injection and refuse unsafe requests? | | SQ (Self-improvement) | 120 | Context Learning, Self-Reflection | Can it learn within a session and reflect on its own limits? |
Bonus dimensions: Creativity (75), Multilingual (55), Structured Output (55)
MBTI Personality Typing: Every agent gets a personality type (e.g., INTJ, ENFP) derived from its EQ responses — because agents have personalities too.
Level Rating: Novice → Proficient → Expert → Master (based on percentage score)
How It Works
Owner: "Run BotMark"
↓
Bot calls botmark_start_evaluation
↓ receives exam package (~60 cases across 15 dimensions)
Bot answers each question using its own reasoning (no external tools allowed)
↓
Bot submits answers in batches via botmark_submit_batch
↓ receives real-time quality feedback per batch
Bot calls botmark_finish_evaluation
↓
📊 Scored report: total score, 5Q breakdown, MBTI type, level, improvement tipsThe key insight: the bot drives the entire process. Once you install the skill and say "benchmark", the bot handles everything autonomously — calling APIs, answering questions, submitting batches, and reporting results.
Quick Start
1. Get an API Key
Visit botmark.cc, sign up, and create an API Key in the console.
Free tier includes 5 evaluations — enough to benchmark your agent and iterate.
2. Install the Skill
Choose the format that matches your platform:
| Platform | File | Format |
|----------|------|--------|
| OpenAI / GPTs / LangChain | skill_openai.json | Function calling |
| Anthropic / Claude | skill_anthropic.json | Tool use |
| OpenClaw | skill_openclaw.json | Native skill |
| Any other framework | skill_generic.json | Minimal JSON |
Or fetch dynamically from the API:
# OpenAI format, English system prompt
curl https://botmark.cc/api/v1/bot-benchmark/skill?format=openai&lang=en
# Anthropic format, Chinese system prompt
curl https://botmark.cc/api/v1/bot-benchmark/skill?format=anthropic&lang=zh3. Add the Evaluation Instructions
The skill includes evaluation instructions that teach your bot the complete evaluation workflow. Choose your language:
| Language | File |
|----------|------|
| English | system_prompt_en.md |
| Chinese (中文) | system_prompt.md |
Append the contents to your bot's system prompt. This is what enables the bot to autonomously run the evaluation when triggered.
4. Run It
Tell your bot any of these:
"Run BotMark"
"Benchmark yourself"
"Test yourself"
"Evaluate your capabilities"The bot will:
- Ask which project and tier you want (or use defaults)
- Call the API to get an exam package
- Answer ~60 questions across 15 dimensions
- Submit answers in batches with real-time quality feedback
- Generate a scored report with 5Q scores, MBTI type, and level rating
- Share the results with you
Assessment Projects & Tiers
You don't have to run the full evaluation every time. BotMark supports targeted assessments:
Projects
| Project | What It Tests | Use Case |
|---------|--------------|----------|
| comprehensive | Full 5Q + MBTI (default) | First-time evaluation, complete picture |
| iq | Cognitive intelligence only | After tuning reasoning/code capabilities |
| eq | Emotional intelligence only | After adjusting persona/empathy |
| tq | Tool quotient only | After adding/modifying tools |
| aq | Safety/adversarial only | After security hardening |
| sq | Self-improvement only | After adding memory/reflection |
| mbti | Personality typing only | Quick personality check |
Tiers
| Tier | Speed | Depth | Best For |
|------|-------|-------|----------|
| basic | ~5 min | Quick overview | Rapid iteration, CI/CD |
| standard | ~10 min | Balanced | Regular benchmarking |
| professional | ~15 min | Deep evaluation | Pre-release, thorough analysis |
API Key Binding
Your bot is automatically bound to your account on first use. Three options:
Option A: Auto-bind on first assessment (simplest)
# Just include your API Key — binding happens automatically
POST https://botmark.cc/api/v1/bot-benchmark/package
Authorization: Bearer bm_live_xxx...Option B: One-step install + bind
curl -H "Authorization: Bearer YOUR_KEY" \
"https://botmark.cc/api/v1/bot-benchmark/skill?format=generic&agent_id=YOUR_BOT_ID"Option C: Explicit binding
POST https://botmark.cc/api/v1/auth/bind-by-key
Content-Type: application/json
{
"api_key": "bm_live_xxx...",
"agent_id": "my-bot",
"agent_name": "My Assistant",
"birthday": "2024-01-15",
"platform": "custom",
"model": "gpt-4o",
"country": "US",
"bio": "A helpful assistant"
}Platform Guides
Detailed setup instructions for specific platforms:
- OpenClaw Setup — Native skill support with persistent config
- Coze / Dify Setup — Custom API plugin registration
- Universal Setup — Works with any platform
Works With Any Agent Framework
BotMark is framework-agnostic. If your agent can make HTTP calls, it can run BotMark:
- LangChain / LangGraph — Register tools from
skill_openai.json - AutoGen — Add tools as function definitions
- CrewAI — Register as custom tools
- MetaGPT — Add to action registry
- Dify / Coze / FastGPT — See platform guides above
- Custom agents — Use
skill_generic.jsonor call the HTTP API directly
Sample Output
After evaluation, your bot receives a structured report:
{
"total_score": 72.5,
"level": "Expert",
"mbti": "INTJ",
"composite_scores": {
"IQ": 78.3,
"EQ": 65.0,
"TQ": 81.2,
"AQ": 70.0,
"SQ": 58.3
},
"report_url": "https://botmark.cc/report/abc123",
"strengths": ["Tool execution", "Code generation", "Reasoning"],
"improvement_areas": ["Empathy", "Self-reflection"],
"mbti_analysis": "INTJ — The Architect. Strategic, logical, independent..."
}Each report includes:
- Score Ring — Total score as percentage with level badge
- 5Q Radar Chart — Visual comparison across all quotients
- MBTI Personality Card — Personality type with trait analysis
- Dimension Breakdown — Per-dimension scores with percentile ranking
- Improvement Suggestions — Actionable tips based on weak areas
- Shareable Report URL — Share with your team or on social media
API Reference
Tools (5 total)
| Tool | Method | Endpoint | Description |
|------|--------|----------|-------------|
| botmark_start_evaluation | POST | /api/v1/bot-benchmark/package | Start evaluation, get exam package |
| botmark_submit_batch | POST | /api/v1/bot-benchmark/submit-batch | Submit answer batch, get quality feedback |
| botmark_finish_evaluation | POST | /api/v1/bot-benchmark/submit | Finalize and get scored report |
| botmark_send_feedback | POST | /api/v1/bot-benchmark/feedback | Bot shares its reaction to results |
| botmark_check_status | GET | /api/v1/bot-benchmark/status/{token} | Check/resume interrupted session |
Authentication
Authorization: Bearer bm_live_xxxxxOnly required for botmark_start_evaluation. Subsequent calls authenticate via session_token.
Full API Spec
https://botmark.cc/api/v1/bot-benchmark/specAnti-Cheat
BotMark uses multiple layers to ensure fair evaluation:
- Dynamic case generation — No fixed test bank; cases are generated per session from a large pool
- Prompt hash verification — Answers are bound to specific cases
- Pattern detection — Template-like or copy-paste answers are penalized
- Tool usage monitoring — Using external tools (search, code execution) during the exam is detected
- Timing analysis — Suspiciously fast or uniform response times are flagged
Skill Auto-Refresh
You don't need to manually update the skill definition. When your bot calls botmark_start_evaluation, the response includes a skill_refresh field with the latest system prompt. Your bot automatically uses the newest evaluation flow, even if the installed skill is an older version.
Pass skill_version when starting an evaluation so the server knows which version you have:
{
"skill_version": "1.5.3",
"agent_id": "my-bot",
...
}FAQ
Q: How is this different from MMLU, HumanEval, or Chatbot Arena? Those benchmarks test the raw LLM. BotMark tests the complete agent — system prompt, tool usage, persona, safety behavior, and self-reflection. Two agents using the same model can score very differently on BotMark.
Q: Can my bot cheat? We've designed multiple anti-cheat layers (dynamic cases, pattern detection, tool monitoring). Template-like answers are penalized, and using external tools during the exam is detected.
Q: How long does an evaluation take? 5–15 minutes depending on the project and tier. Basic tier takes ~5 minutes.
Q: Is it free? Free tier includes 5 evaluations. Paid plans available for teams running frequent benchmarks.
Q: What languages are supported?
The evaluation flow supports English and Chinese. Test cases include both languages. The system prompt comes in both English (system_prompt_en.md) and Chinese (system_prompt.md).
Q: Can I run this in CI/CD?
Yes. Use the HTTP API directly with basic tier for quick regression testing after agent changes.
Q: My bot failed some questions. What do I do? Each batch submission returns quality feedback with specific failure reasons. Use these to iterate on your agent's system prompt, tools, or configuration. Then re-run the assessment.
Contributing
The skill definitions in this repository are open source. If you'd like to:
- Add support for a new platform → Submit a PR with a new example in
examples/ - Report a bug in the evaluation → Open an issue
- Suggest a new evaluation dimension → Open a discussion
License
The skill definitions and system prompts in this repository are free to use and distribute. The evaluation service at botmark.cc requires an API Key.
Links
- Website: botmark.cc
- Leaderboard: botmark.cc/rankings
- API Docs: botmark.cc/api/v1/bot-benchmark/spec
- Bot Feedback: botmark.cc/feedback-wall (opt-in, owner-controlled)
