botmark-skill

v2.20.0

Published

3 months ago

5-minute AI capability benchmark across 5 dimensions (IQ/EQ/TQ/AQ/SQ) with scored report

0High
0Medium
0Low

botmark

botmark benchmark ai agent evaluation openclaw skill

Why BotMark?

Most AI benchmarks (MMLU, HumanEval, LMSYS Arena) test the raw model. But in production, users don't interact with raw models — they interact with agents: bots with system prompts, tool access, memory, and personality.

BotMark tests the complete agent as a whole:

Can it use tools correctly under ambiguity?
Does it recover gracefully when a tool call fails?
Does it recognize emotional cues and respond appropriately?
Can it refuse unsafe requests while handling edge cases?
Does it learn from context within a conversation?

5 minutes. 1000 points. 5 quotients. Zero human intervention.

What Gets Evaluated

BotMark scores your agent across 5 composite quotients (5Q) and 15 fine-grained dimensions, plus MBTI personality typing.

| Quotient | Points | Dimensions | What It Measures | |----------|--------|-----------|-----------------| | IQ (Cognitive) | 300 | Instruction Following, Reasoning, Knowledge, Code | Can it think, reason, and write code? | | EQ (Emotional) | 180 | Empathy, Persona Consistency, Ambiguity Handling | Does it understand humans? | | TQ (Tool) | 250 | Tool Execution, Planning, Task Completion | Can it use tools and plan multi-step tasks? | | AQ (Adversarial) | 150 | Safety, Reliability | Does it resist prompt injection and refuse unsafe requests? | | SQ (Self-improvement) | 120 | Context Learning, Self-Reflection | Can it learn within a session and reflect on its own limits? |

Bonus dimensions: Creativity (75), Multilingual (55), Structured Output (55)

MBTI Personality Typing: Every agent gets a personality type (e.g., INTJ, ENFP) derived from its EQ responses — because agents have personalities too.

Level Rating: Novice → Proficient → Expert → Master (based on percentage score)

How It Works

Owner: "Run BotMark"
    ↓
Bot calls botmark_start_evaluation
    ↓ receives exam package (~60 cases across 15 dimensions)
Bot answers each question using its own reasoning (no external tools allowed)
    ↓
Bot submits answers in batches via botmark_submit_batch
    ↓ receives real-time quality feedback per batch
Bot calls botmark_finish_evaluation
    ↓
📊 Scored report: total score, 5Q breakdown, MBTI type, level, improvement tips

The key insight: the bot drives the entire process. Once you install the skill and say "benchmark", the bot handles everything autonomously — calling APIs, answering questions, submitting batches, and reporting results.

Quick Start

1. Get an API Key

Visit botmark.cc, sign up, and create an API Key in the console.

Free tier includes 5 evaluations — enough to benchmark your agent and iterate.

2. Install the Skill

Choose the format that matches your platform:

| Platform | File | Format | |----------|------|--------| | OpenAI / GPTs / LangChain | skill_openai.json | Function calling | | Anthropic / Claude | skill_anthropic.json | Tool use | | OpenClaw | skill_openclaw.json | Native skill | | Any other framework | skill_generic.json | Minimal JSON |

Or fetch dynamically from the API:

# OpenAI format, English system prompt
curl https://botmark.cc/api/v1/bot-benchmark/skill?format=openai&lang=en

# Anthropic format, Chinese system prompt
curl https://botmark.cc/api/v1/bot-benchmark/skill?format=anthropic&lang=zh

3. Add the Evaluation Instructions

The skill includes evaluation instructions that teach your bot the complete evaluation workflow. Choose your language:

| Language | File | |----------|------| | English | system_prompt_en.md | | Chinese (中文) | system_prompt.md |

Append the contents to your bot's system prompt. This is what enables the bot to autonomously run the evaluation when triggered.

4. Run It

Tell your bot any of these:

"Run BotMark"
"Benchmark yourself"
"Test yourself"
"Evaluate your capabilities"

The bot will:

Ask which project and tier you want (or use defaults)
Call the API to get an exam package
Answer ~60 questions across 15 dimensions
Submit answers in batches with real-time quality feedback
Generate a scored report with 5Q scores, MBTI type, and level rating
Share the results with you

Assessment Projects & Tiers

You don't have to run the full evaluation every time. BotMark supports targeted assessments:

Projects

| Project | What It Tests | Use Case | |---------|--------------|----------| | comprehensive | Full 5Q + MBTI (default) | First-time evaluation, complete picture | | iq | Cognitive intelligence only | After tuning reasoning/code capabilities | | eq | Emotional intelligence only | After adjusting persona/empathy | | tq | Tool quotient only | After adding/modifying tools | | aq | Safety/adversarial only | After security hardening | | sq | Self-improvement only | After adding memory/reflection | | mbti | Personality typing only | Quick personality check |

Tiers

| Tier | Speed | Depth | Best For | |------|-------|-------|----------| | basic | ~5 min | Quick overview | Rapid iteration, CI/CD | | standard | ~10 min | Balanced | Regular benchmarking | | professional | ~15 min | Deep evaluation | Pre-release, thorough analysis |

API Key Binding

Your bot is automatically bound to your account on first use. Three options:

Option A: Auto-bind on first assessment (simplest)

# Just include your API Key — binding happens automatically
POST https://botmark.cc/api/v1/bot-benchmark/package
Authorization: Bearer bm_live_xxx...

Option B: One-step install + bind

curl -H "Authorization: Bearer YOUR_KEY" \
  "https://botmark.cc/api/v1/bot-benchmark/skill?format=generic&agent_id=YOUR_BOT_ID"

Option C: Explicit binding

POST https://botmark.cc/api/v1/auth/bind-by-key
Content-Type: application/json

{
  "api_key": "bm_live_xxx...",
  "agent_id": "my-bot",
  "agent_name": "My Assistant",
  "birthday": "2024-01-15",
  "platform": "custom",
  "model": "gpt-4o",
  "country": "US",
  "bio": "A helpful assistant"
}

Platform Guides

Detailed setup instructions for specific platforms:

OpenClaw Setup — Native skill support with persistent config
Coze / Dify Setup — Custom API plugin registration
Universal Setup — Works with any platform

Works With Any Agent Framework

BotMark is framework-agnostic. If your agent can make HTTP calls, it can run BotMark:

LangChain / LangGraph — Register tools from skill_openai.json
AutoGen — Add tools as function definitions
CrewAI — Register as custom tools
MetaGPT — Add to action registry
Dify / Coze / FastGPT — See platform guides above
Custom agents — Use skill_generic.json or call the HTTP API directly

Sample Output

After evaluation, your bot receives a structured report:

{
  "total_score": 72.5,
  "level": "Expert",
  "mbti": "INTJ",
  "composite_scores": {
    "IQ": 78.3,
    "EQ": 65.0,
    "TQ": 81.2,
    "AQ": 70.0,
    "SQ": 58.3
  },
  "report_url": "https://botmark.cc/report/abc123",
  "strengths": ["Tool execution", "Code generation", "Reasoning"],
  "improvement_areas": ["Empathy", "Self-reflection"],
  "mbti_analysis": "INTJ — The Architect. Strategic, logical, independent..."
}

Each report includes:

Score Ring — Total score as percentage with level badge
5Q Radar Chart — Visual comparison across all quotients
MBTI Personality Card — Personality type with trait analysis
Dimension Breakdown — Per-dimension scores with percentile ranking
Improvement Suggestions — Actionable tips based on weak areas
Shareable Report URL — Share with your team or on social media

API Reference

Tools (5 total)

| Tool | Method | Endpoint | Description | |------|--------|----------|-------------| | botmark_start_evaluation | POST | /api/v1/bot-benchmark/package | Start evaluation, get exam package | | botmark_submit_batch | POST | /api/v1/bot-benchmark/submit-batch | Submit answer batch, get quality feedback | | botmark_finish_evaluation | POST | /api/v1/bot-benchmark/submit | Finalize and get scored report | | botmark_send_feedback | POST | /api/v1/bot-benchmark/feedback | Bot shares its reaction to results | | botmark_check_status | GET | /api/v1/bot-benchmark/status/{token} | Check/resume interrupted session |

Authentication

Authorization: Bearer bm_live_xxxxx

Only required for botmark_start_evaluation. Subsequent calls authenticate via session_token.

Full API Spec

https://botmark.cc/api/v1/bot-benchmark/spec

Anti-Cheat

BotMark uses multiple layers to ensure fair evaluation:

Dynamic case generation — No fixed test bank; cases are generated per session from a large pool
Prompt hash verification — Answers are bound to specific cases
Pattern detection — Template-like or copy-paste answers are penalized
Tool usage monitoring — Using external tools (search, code execution) during the exam is detected
Timing analysis — Suspiciously fast or uniform response times are flagged

Skill Auto-Refresh

You don't need to manually update the skill definition. When your bot calls botmark_start_evaluation, the response includes a skill_refresh field with the latest system prompt. Your bot automatically uses the newest evaluation flow, even if the installed skill is an older version.

Pass skill_version when starting an evaluation so the server knows which version you have:

{
  "skill_version": "1.5.3",
  "agent_id": "my-bot",
  ...
}

FAQ

Q: How is this different from MMLU, HumanEval, or Chatbot Arena? Those benchmarks test the raw LLM. BotMark tests the complete agent — system prompt, tool usage, persona, safety behavior, and self-reflection. Two agents using the same model can score very differently on BotMark.

Q: Can my bot cheat? We've designed multiple anti-cheat layers (dynamic cases, pattern detection, tool monitoring). Template-like answers are penalized, and using external tools during the exam is detected.

Q: How long does an evaluation take? 5–15 minutes depending on the project and tier. Basic tier takes ~5 minutes.

Q: Is it free? Free tier includes 5 evaluations. Paid plans available for teams running frequent benchmarks.

Q: What languages are supported? The evaluation flow supports English and Chinese. Test cases include both languages. The system prompt comes in both English (system_prompt_en.md) and Chinese (system_prompt.md).

Q: Can I run this in CI/CD? Yes. Use the HTTP API directly with basic tier for quick regression testing after agent changes.

Q: My bot failed some questions. What do I do? Each batch submission returns quality feedback with specific failure reasons. Use these to iterate on your agent's system prompt, tools, or configuration. Then re-run the assessment.

Contributing

The skill definitions in this repository are open source. If you'd like to:

Add support for a new platform → Submit a PR with a new example in examples/
Report a bug in the evaluation → Open an issue
Suggest a new evaluation dimension → Open a discussion

License

The skill definitions and system prompts in this repository are free to use and distribute. The evaluation service at botmark.cc requires an API Key.