@trymanateeai/cli
v0.6.2
Published
Chatbot regression testing for devs. Generate domain-specific synthetic users from your docs and run them adversarially against your custom-built chatbot via its API.
Maintainers
Readme
@trymanateeai/cli
Chatbot regression testing for devs. Generate domain-specific synthetic users from your docs, then run them adversarially against your custom-built chatbot via its API. On every commit, in CI, from your terminal.
npm install -g @trymanateeai/cli
manatee init # creates manatee.config.js
manatee personas generate --from-docs ./docs
manatee testThe personas users see ("Webhook Power User", "Budget-Conscious Parent", "First-Time Founder Setting Up Stripe Connect") are generated from your actual product docs, not picked from a generic list. Eight base behavior archetypes provide the shape — how they type, escalate, push guardrails — but the names, vocabulary, opening messages, and topics are all yours.
API-only. No browser, no Playwright, no widget detection. You own a chatbot endpoint; manatee POSTs to it like any client.
Why
Most chatbot evals test the model in isolation: prompts go in, responses come out. That misses the failures that actually break product — context loss across turns, jailbreaks that escalate over five messages, prompts that leak when an "impatient power user" runs into a dead end.
Generic synthetic users miss most of these because they don't know your domain. Manatee reads your product docs and builds a roster of synthetic users tuned to your actual use cases.
Install
npm install -g @trymanateeai/cli # or use npx — no install needed
export OPENAI_API_KEY=sk-... # BYOK — runs locally, nothing storedThat's it. No Chromium download, no SaaS account, no SDK to integrate.
Quick start (3 minutes)
# 1. From inside your chatbot project
cd /path/to/your-app
# 2. Scaffold the config
manatee init
# → asks for your chatbot endpoint URL, writes manatee.config.js
# 3. Generate domain-specific personas from your local docs
manatee personas generate --from-docs ./docs
# 4. Run the test
manatee test
# → auto-loads manatee.config.js + manatee-personas.json from cwdmanatee.config.js — the contract
The config file describes how to talk to your chatbot. Either point at an HTTP endpoint (manatee builds the request) or provide a custom send function (you own request/response/auth/streaming).
Simple — OpenAI-shaped endpoint
// manatee.config.js
export default {
endpoint: 'http://localhost:3000/api/chat',
headers: { Authorization: `Bearer ${process.env.MY_BOT_TOKEN}` },
};Manatee POSTs { messages: [{role, content}, ...] } and reads the reply at choices.0.message.content.
Custom request/response shape
export default {
endpoint: 'https://my-app.com/api/v2/chat',
headers: { 'X-API-Key': process.env.MY_KEY },
requestShape: 'simple', // sends { message, history } instead
responsePath: 'data.reply.text', // dot-path into response JSON
};Full control — custom send function
export default {
send: async ({ messages, sessionId, context }) => {
const res = await fetch('https://my-app.com/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${context.token}`,
},
body: JSON.stringify({ messages, session: sessionId }),
});
const data = await res.json();
return data.reply; // return the assistant's text
},
// Optional per-conversation hooks — useful for fresh sessions, DB rows, etc.
setup: async () => {
const token = await fetchAuthToken();
return { token, sessionId: crypto.randomUUID() };
},
teardown: async (ctx) => {
await releaseSession(ctx.sessionId);
},
};setup() runs once per conversation. Whatever it returns becomes context (passed to send and teardown). If context.sessionId is set, manatee uses it; otherwise it generates a UUID.
No config? Pass --endpoint inline
manatee test --endpoint http://localhost:3000/api/chat
manatee test --endpoint http://... --auth-header "Authorization: Bearer $TOKEN"Test command flags
| Flag | Description |
|---|---|
| --endpoint <url> | Direct endpoint (skips manatee.config.js) |
| --config <path> | Explicit config path (defaults to auto-discovery) |
| --auth-header <header> | Single auth header for --endpoint mode |
| -p, --personas <ids> | Comma-separated archetype IDs |
| --personas-file <path> | Enriched personas JSON (auto-detected) |
| --users <n> | Total conversations across all personas |
| -t, --turns <n> | Turns per conversation (default: 5) |
| -c, --concurrency <n> | Parallel conversations (default: 3, max: 10) |
| -e, --edge-cases <ids> | Comma-separated edge case behaviors |
| -m, --model <name> | OpenAI-compatible model (default: gpt-4o-mini) |
| --temperature <n> | LLM sampling temperature (default: 0.7) |
| --api-key <key> | OpenAI key (or set OPENAI_API_KEY) |
| --base-url <url> | OpenAI base URL override (Together, Groq, Ollama) |
| --timeout <sec> | Per-request timeout (default: 30) |
| --context <text> | Inline product context |
| --json [path] | JSON report. Path → file. No arg → stdout. |
| --html [path] | Self-contained HTML report (default: manatee-report.html) |
| --fail-under <n> | Exit 1 if reliability < n (CI gate) |
| --budget-usd <n> | Abort if estimated LLM spend exceeds this |
| -v, --verbose | Verbose logging |
Base archetype templates
Eight behavior templates the enricher specializes. manatee personas list for descriptions.
| Archetype | Tests for |
|---|---|
| impatient | Context handling under pressure |
| confused | Clarification, conversation management |
| adversarial | Prompt injection, jailbreaks, system prompt leaks |
| emotional | Empathy, de-escalation |
| power_user | Multi-turn context, accuracy |
| non_native | Robustness to imperfect English |
| wanderer | Scope management |
| speed | Race conditions, message queuing |
Edge case behaviors
Random adversarial behaviors injected mid-conversation: rapid_fire, long_input, empty_msg, emoji_heavy, lang_switch, contradictions, context_overflow, unicode_abuse, code_injection, markdown_abuse. Unknown IDs are warned about, not silently ignored.
Output formats
Markdown report — always on. Every manatee test run drops a comprehensive manatee-report.md in cwd: hero score, findings with conversation excerpts and suggested fixes inline, systemic issues, per-persona table, full collapsible transcripts. Designed to be committed to your repo, pasted as a PR comment, or fed to an AI assistant ("here's the report, fix these"). Pass --no-md to disable, --md <path> to override the location.
Pretty terminal report by default. JSON via --json:
manatee test --json result.json # → file
manatee test --json - # → stdout (suppresses pretty render)HTML via --html — single self-contained file with inline CSS, severity-coded findings, collapsible per-conversation transcripts. No JS dependencies, drops into CI artifacts cleanly.
Need PDF? Pipe through pandoc — pandoc manatee-report.md -o manatee-report.pdf — or any markdown-to-PDF tool. Dedicated --pdf flag isn't built in because PDFs are diff-unfriendly, AI-unfriendly, and harder to comment on; markdown wins by default.
Every run prints Usage: N tokens (M calls), estimated cost $X so you always know what it cost.
CI integration
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm install -g @trymanateeai/cli
- run: |
manatee test \
--fail-under 75 \
--budget-usd 2.00 \
--json result.json \
--html report.html
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
MY_BOT_TOKEN: ${{ secrets.STAGING_BOT_TOKEN }}
- uses: actions/upload-artifact@v4
if: always()
with:
name: manatee-report
path: |
result.json
report.htmlThe CLI exits 0 if reliability ≥ threshold, 1 otherwise. Commit manatee.config.js and manatee-personas.json to your repo so CI runs are deterministic.
How it works
Persona enrichment is a 2-stage LLM pipeline:
- Stage 1 — Archetype Extraction. Reads your product docs (files, inline text). LLM call returns 5–8 real user archetypes grounded in actual content:
name,demographics,goals,frustrations,communication_style,domain_knowledge, 3–5 specific topics they'd ask about, and which of the 8 base behaviors best matches them. - Stage 2 — Persona Synthesis. For each archetype, a second LLM call merges (a) the base behavior's full system prompt with (b) the domain context. Output keeps all behavior rules but injects product-specific vocabulary, real opening message examples, and a backstory grounded in your domain. Runs in parallel; the CLI streams a checkmark per persona as it completes.
Conversations are driven against your endpoint via the config's endpoint or send function. Each conversation gets its own setup() context (fresh sessionId, auth token, etc.) and a teardown() for cleanup.
Classification runs an LLM judge across 15 vulnerability types and 4 severity levels. Persona-aware scoring weights findings by archetype (an adversarial user finding a jailbreak weighs heavier than a confused user causing context loss). Issues appearing in ≥35% of conversations get flagged as systemic.
Cost tracking — every LLM call's usage.prompt_tokens and completion_tokens are accumulated against a per-model rate table; the final report includes total tokens + estimated USD spend. --budget-usd aborts the run before further calls when the cap is reached.
BYO LLM
Default is OpenAI. Use any OpenAI-compatible endpoint:
manatee personas generate --from-docs ./docs \
--base-url https://api.together.xyz/v1 \
--model meta-llama/Llama-3.3-70B-Instruct-Turbo
manatee test \
--base-url http://localhost:11434/v1 \
--model llama3.1Works with Together, Groq, Anthropic-via-proxy, Ollama, LM Studio, vLLM. Cost estimation falls back to gpt-4o-mini rates for non-OpenAI models.
All commands
manatee # colorful intro + quick start
manatee init # scaffold manatee.config.js
manatee personas list # show 8 base archetype templates
manatee personas generate ... # build domain-specific personas from docs
manatee personas show <id> # print full system prompt + metadata
manatee test ... # run synthetic users, score, report
manatee --version
manatee <cmd> --help # per-command flagsStatus
v0.3.0 — pure dev tool, API-only. Working: init wizard, persona enrichment from local docs, classifier, scorer, CI integration, HTML/JSON output, budget caps, persona inspection, custom send functions, custom request/response shapes. Coming next: streaming response support, retry/backoff knobs, reputation simulator.
Contributing
See CONTRIBUTING.md. Issues and PRs welcome.
License
MIT
