@agentic-evals/mcp
v0.1.0
Published
Agentic Evaluation Library — MCP server for deterministic and non-deterministic UI/UX evaluations with swarm orchestration
Readme
@agentic-evals/mcp
MCP server for running deterministic and non-deterministic UI/UX evaluations using multi-agent swarms. Combines automated tooling (Lighthouse, axe-core, ESLint) with LLM-judged rubric evaluations, synthesizes findings through a deliberation protocol, and outputs prioritized GitHub issues.
Quick Start
npm install
npm run build
npm start # starts MCP server on stdioUse with VS Code / Copilot
Add to .vscode/mcp.json:
{
"servers": {
"agentic-evals": {
"type": "stdio",
"command": "node",
"args": ["dist/mcp/server.js"]
}
}
}Then ask Copilot: "Run a quick review of http://localhost:3000"
Architecture
User / Copilot
│ MCP tool call (run_swarm, run_eval, …)
▼
MCP Server (17 tools, 2 resources, 3 prompts)
│
▼
SwarmOrchestrator.execute()
│
├─ Phase 1: FAN OUT (parallel batches)
│ ├─ Deterministic agents (Lighthouse, axe, ESLint, Stylelint, Prettier)
│ └─ Non-deterministic agents (LLM + rubric + evidence)
│
├─ Phase 1.5: RE-EVALUATION (confidence-gated)
│ └─ High-severity + low-confidence findings → targeted evidence
│ (zoom, hover, focus states, element-level capture)
│
├─ Phase 1.75: CROSS-RUN MEMORY
│ └─ Annotate findings: new | persistent | regression | resolved
│
└─ Phase 2: DELIBERATION (LLM synthesis)
├─ Merge overlapping findings
├─ Challenge across agents
├─ Prioritize (severity × effort → P0–P3)
└─ Actionize into concrete fixes
│
▼
GitHub Issue Creator → [P0] [Domain] Title with evidence & labelsWizard Flow
The wizard provides a session-based pipeline that guides the evaluation from start to finish. Each step auto-populates the next — the agent never needs to figure out what to call next.
📋 Plan → ✅ Approve → 🔍 Run → 🤔 Deliberate → 🎯 Act → 📝 Confirm → 🎉 Done| Step | What Happens | User Action |
|------|-------------|-------------|
| Plan | wizard_start assembles an eval team based on project signals | Review team, add/remove evals |
| Approve | wizard_advance(approve) locks the team and launches the swarm | Confirm the roster |
| Run | Swarm executes in parallel, returns phased narrative | — (automatic) |
| Deliberate | Agent follows deliberation protocol to merge/prioritize findings | Review prioritized findings |
| Act | wizard_advance(submit_findings) formats findings as issues | Choose: issues, actionize, or fix |
| Confirm | wizard_advance(choose_action) shows dry-run preview | Confirm to create |
| Done | wizard_advance(confirm) creates GitHub issues/PRs | — |
Session state persists across tool calls — findings, issues, and evidence are never lost between steps.
MCP Tools
Wizard Flow (Recommended)
| Tool | Description |
|------|-------------|
| wizard_start | Start a guided evaluation wizard — creates a session, auto-assembles team, walks through each phase |
| wizard_advance | Advance the wizard to the next phase (approve → run → deliberate → act → confirm) |
| wizard_status | Check wizard session status or list all active sessions |
Individual Tools
| Tool | Description |
|------|-------------|
| list_evals | List available evaluation plugins |
| scan_project | Discover project tooling (linters, tests, formatters) and register as plugins |
| run_eval | Run a single evaluation plugin against a URL or project |
| run_suite | Run multiple evaluations as a suite |
| run_swarm | Execute a full multi-agent swarm evaluation |
| list_presets | List swarm presets (quick-scan, full-review, deep-dive, a11y-focus) |
| capture_evidence | Capture screenshots, DOM, and styles at multiple viewports |
| capture_user_flow | Execute a user flow (click, fill, navigate, hover) and capture evidence at each step |
| reevaluate_findings | Re-evaluate high-severity/low-confidence findings with targeted evidence (hover, focus, zoom) |
| get_run_history | View evaluation run history, trend data, regressions, and resolutions |
| get_rubric | Load a rubric with knowledge context |
| list_rubrics | List all available rubrics |
| create_issues | Convert findings into GitHub issues |
| review_page | One-shot page review (evidence + eval + report) |
| register_plugin | Register a custom eval plugin at runtime |
| scaffold_deterministic_eval | Create a custom deterministic eval in .evals/ (plugin + command + config) |
| scaffold_non_deterministic_eval | Create a custom non-deterministic eval in .evals/ (rubric + plugin + config) |
Swarm Presets
| Preset | Agents | Rounds | Use Case |
|--------|--------|--------|----------|
| quick-scan | 3 | 1 | Fast smoke test |
| full-review | 6 | 2 | Standard review |
| deep-dive | 8 | 3 | Comprehensive audit |
| a11y-focus | 4 | 2 | Accessibility-focused |
Knowledge & Rubrics
Non-deterministic evals are powered by rubrics in knowledge/rubrics/. Override per-project by placing rubrics in .evals/rubrics/ in the target project.
Available rubrics: accessibility, content/copy, information architecture, performance perception, responsive design, UX heuristics, visual design.
Supporting knowledge: knowledge/principles/ (Gestalt, Fitts' law, cognitive load, color theory, typography) and knowledge/standards/ (WCAG 2.2, design systems).
See knowledge/README.md for details on the rubric resolution system.
Per-project .evals/ structure
Run scan_project or call initProject() to scaffold a fully customizable .evals/ directory:
.evals/
├── config.json # enable/disable evals, adjust settings
├── commands.json # custom CLI-based evals
├── deliberation-protocol.md # override deliberation prompts
├── issue-template.md # override GitHub issue format
├── rubrics/ # override or extend rubric scoring
│ ├── visual-design.md
│ └── ...
├── knowledge/
│ ├── principles/ # override or add design principles
│ │ ├── gestalt.md
│ │ ├── brand-guidelines.md # (your own)
│ │ └── ...
│ └── standards/ # override or add design standards
│ ├── wcag-2.2.md
│ ├── your-design-system.md # (your own)
│ └── ...
└── plugins/ # override or extend eval plugins
├── deterministic/
├── non-deterministic/
└── README.mdResolution order: project .evals/ overrides → library defaults. Project files always win on name collision, and you can add new files that don't exist in the library.
Custom Evals
Command-based (deterministic)
Add .evals/commands.json to your project:
{
"commands": [
{
"name": "storybook-build",
"command": "npm run build-storybook",
"kind": "build",
"failOn": "exit-code"
}
]
}Plugin-based (non-deterministic)
See examples/example-plugin.ts for a template.
Guided creation (via Copilot)
Use the create-eval prompt — Copilot will interview you, then scaffold everything:
"Create a non-deterministic eval for brand compliance"
Or call the tools directly: scaffold_deterministic_eval / scaffold_non_deterministic_eval.
Development
npm run dev # watch mode
npm run lint
npm test # vitestLicense
MIT
