@agentic-evals/mcp

v0.1.0

Published

a month ago

Agentic Evaluation Library — MCP server for deterministic and non-deterministic UI/UX evaluations with swarm orchestration

0High
0Medium
0Low

unthink

@agentic-evals/mcp

MCP server for running deterministic and non-deterministic UI/UX evaluations using multi-agent swarms. Combines automated tooling (Lighthouse, axe-core, ESLint) with LLM-judged rubric evaluations, synthesizes findings through a deliberation protocol, and outputs prioritized GitHub issues.

Quick Start

npm install
npm run build
npm start          # starts MCP server on stdio

Use with VS Code / Copilot

Add to .vscode/mcp.json:

{
  "servers": {
    "agentic-evals": {
      "type": "stdio",
      "command": "node",
      "args": ["dist/mcp/server.js"]
    }
  }
}

Then ask Copilot: "Run a quick review of http://localhost:3000"

Architecture

User / Copilot
  │  MCP tool call (run_swarm, run_eval, …)
  ▼
MCP Server (17 tools, 2 resources, 3 prompts)
  │
  ▼
SwarmOrchestrator.execute()
  │
  ├─ Phase 1: FAN OUT (parallel batches)
  │   ├─ Deterministic agents (Lighthouse, axe, ESLint, Stylelint, Prettier)
  │   └─ Non-deterministic agents (LLM + rubric + evidence)
  │
  ├─ Phase 1.5: RE-EVALUATION (confidence-gated)
  │   └─ High-severity + low-confidence findings → targeted evidence
  │       (zoom, hover, focus states, element-level capture)
  │
  ├─ Phase 1.75: CROSS-RUN MEMORY
  │   └─ Annotate findings: new | persistent | regression | resolved
  │
  └─ Phase 2: DELIBERATION (LLM synthesis)
      ├─ Merge overlapping findings
      ├─ Challenge across agents
      ├─ Prioritize (severity × effort → P0–P3)
      └─ Actionize into concrete fixes
  │
  ▼
GitHub Issue Creator → [P0] [Domain] Title with evidence & labels

Wizard Flow

The wizard provides a session-based pipeline that guides the evaluation from start to finish. Each step auto-populates the next — the agent never needs to figure out what to call next.

📋 Plan  →  ✅ Approve  →  🔍 Run  →  🤔 Deliberate  →  🎯 Act  →  📝 Confirm  →  🎉 Done

| Step | What Happens | User Action | |------|-------------|-------------| | Plan | wizard_start assembles an eval team based on project signals | Review team, add/remove evals | | Approve | wizard_advance(approve) locks the team and launches the swarm | Confirm the roster | | Run | Swarm executes in parallel, returns phased narrative | — (automatic) | | Deliberate | Agent follows deliberation protocol to merge/prioritize findings | Review prioritized findings | | Act | wizard_advance(submit_findings) formats findings as issues | Choose: issues, actionize, or fix | | Confirm | wizard_advance(choose_action) shows dry-run preview | Confirm to create | | Done | wizard_advance(confirm) creates GitHub issues/PRs | — |

Session state persists across tool calls — findings, issues, and evidence are never lost between steps.

MCP Tools

Wizard Flow (Recommended)

| Tool | Description | |------|-------------| | wizard_start | Start a guided evaluation wizard — creates a session, auto-assembles team, walks through each phase | | wizard_advance | Advance the wizard to the next phase (approve → run → deliberate → act → confirm) | | wizard_status | Check wizard session status or list all active sessions |

Individual Tools

| Tool | Description | |------|-------------| | list_evals | List available evaluation plugins | | scan_project | Discover project tooling (linters, tests, formatters) and register as plugins | | run_eval | Run a single evaluation plugin against a URL or project | | run_suite | Run multiple evaluations as a suite | | run_swarm | Execute a full multi-agent swarm evaluation | | list_presets | List swarm presets (quick-scan, full-review, deep-dive, a11y-focus) | | capture_evidence | Capture screenshots, DOM, and styles at multiple viewports | | capture_user_flow | Execute a user flow (click, fill, navigate, hover) and capture evidence at each step | | reevaluate_findings | Re-evaluate high-severity/low-confidence findings with targeted evidence (hover, focus, zoom) | | get_run_history | View evaluation run history, trend data, regressions, and resolutions | | get_rubric | Load a rubric with knowledge context | | list_rubrics | List all available rubrics | | create_issues | Convert findings into GitHub issues | | review_page | One-shot page review (evidence + eval + report) | | register_plugin | Register a custom eval plugin at runtime | | scaffold_deterministic_eval | Create a custom deterministic eval in .evals/ (plugin + command + config) | | scaffold_non_deterministic_eval | Create a custom non-deterministic eval in .evals/ (rubric + plugin + config) |

Swarm Presets

| Preset | Agents | Rounds | Use Case | |--------|--------|--------|----------| | quick-scan | 3 | 1 | Fast smoke test | | full-review | 6 | 2 | Standard review | | deep-dive | 8 | 3 | Comprehensive audit | | a11y-focus | 4 | 2 | Accessibility-focused |

Knowledge & Rubrics

Non-deterministic evals are powered by rubrics in knowledge/rubrics/. Override per-project by placing rubrics in .evals/rubrics/ in the target project.

Available rubrics: accessibility, content/copy, information architecture, performance perception, responsive design, UX heuristics, visual design.

Supporting knowledge: knowledge/principles/ (Gestalt, Fitts' law, cognitive load, color theory, typography) and knowledge/standards/ (WCAG 2.2, design systems).

See knowledge/README.md for details on the rubric resolution system.

Per-project `.evals/` structure

Run scan_project or call initProject() to scaffold a fully customizable .evals/ directory:

.evals/
├── config.json                      # enable/disable evals, adjust settings
├── commands.json                    # custom CLI-based evals
├── deliberation-protocol.md         # override deliberation prompts
├── issue-template.md                # override GitHub issue format
├── rubrics/                         # override or extend rubric scoring
│   ├── visual-design.md
│   └── ...
├── knowledge/
│   ├── principles/                  # override or add design principles
│   │   ├── gestalt.md
│   │   ├── brand-guidelines.md      # (your own)
│   │   └── ...
│   └── standards/                   # override or add design standards
│       ├── wcag-2.2.md
│       ├── your-design-system.md    # (your own)
│       └── ...
└── plugins/                         # override or extend eval plugins
    ├── deterministic/
    ├── non-deterministic/
    └── README.md

Resolution order: project .evals/ overrides → library defaults. Project files always win on name collision, and you can add new files that don't exist in the library.

Custom Evals

Command-based (deterministic)

Add .evals/commands.json to your project:

{
  "commands": [
    {
      "name": "storybook-build",
      "command": "npm run build-storybook",
      "kind": "build",
      "failOn": "exit-code"
    }
  ]
}

Plugin-based (non-deterministic)

See examples/example-plugin.ts for a template.

Guided creation (via Copilot)

Use the create-eval prompt — Copilot will interview you, then scaffold everything:

"Create a non-deterministic eval for brand compliance"

Or call the tools directly: scaffold_deterministic_eval / scaffold_non_deterministic_eval.

Development

npm run dev        # watch mode
npm run lint
npm test           # vitest

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@agentic-evals/mcp

Quick Start

Use with VS Code / Copilot

Architecture

Wizard Flow

MCP Tools

Wizard Flow (Recommended)

Individual Tools

Swarm Presets

Knowledge & Rubrics

Per-project .evals/ structure

Custom Evals

Command-based (deterministic)

Plugin-based (non-deterministic)

Guided creation (via Copilot)

Development

License

Per-project `.evals/` structure