@rikaidev/mesen

v0.1.1

Published

3 months ago

Behavioral UX testing through simulated user perspectives. Persona-driven, evidence-based, model-agnostic.

0High
0Medium
0Low

gloomcheng

ux accessibility testing audit wcag persona bdd tablet mcp claude-code

mesen (目線)

Behavioral UX testing through simulated user perspectives. Persona-driven, evidence-based, model-agnostic.

"axe-core tells you your button is missing an ARIA label. mesen tells you your grandmother can't find the button."

What is mesen?

mesen is an open-source UX testing tool that simulates real users attempting tasks on your web app, then audits findings against design guidelines (WCAG, Apple HIG, Material Design, Nielsen).

Instead of just checking ARIA labels and contrast ratios, mesen asks: can this specific person actually do what they came to do?

How it works

.ux spec (persona + BDD scenarios)
    → Browser: navigate, authenticate, capture DOM
    → Layer 1: measure everything (contrast, touch targets, fonts, layout)
    → AI agent: read measurements + persona + skills → derive thresholds → judge
    → Output: code-review style findings (accept/dismiss per finding)

What makes it different

| Tool | What it checks | What it misses | |------|---------------|----------------| | axe-core / Lighthouse | Structural WCAG violations | Can the user find the button? | | Playwright / Cypress | Does the feature work? | Can they use it? | | Percy / Chromatic | Did pixels change? | Is the change a problem? | | mesen | Can this persona complete this task? | |

Quick start

# Install
npm install -D mesen

# Put a .ux spec in your project
mkdir -p .mesen/specs
# Write or generate a .ux spec (see below)

# Run
npx mesen run

mesen auto-discovers .mesen/specs/*.ux files.

The .ux spec

A .ux spec combines persona definition with BDD scenarios — written in human language, not technical specs:

---
url: https://your-app.com

auth:
  email: $APP_EMAIL
  password: $APP_PASSWORD

persona:
  name: Nurse Chen
  age: 52
  tech_literacy: minimal
  vision: normal
  context: |
    Station leader at a community care facility.
    Opens tablet at 8am to check who's coming today.
    Each operation must complete in under 30 seconds.
  cognitive_model:
    working_memory: medium
    frustration_threshold: 2
  physical_model:
    click_confidence: low

device:
  type: iPad
  viewport: [1024, 768]

guidelines: [wcag-2.1, apple-hig, nielsen]
---

Feature: Morning station opening

  Scenario: Check today's attendance
    Given I open the tablet after logging in
    When I look at the home screen
    Then I should see how many elders are expected today
    And I should be able to tell who has arrived and who hasn't

  Scenario: Urgent items stand out
    Given I'm on the home screen
    When an elder had abnormal blood pressure yesterday
    Then that alert should appear before routine tasks
    And I should see which elder and what the issue is without tapping in

Key principles:

BDD is in the persona's language (any human language)
No px, contrast ratio, or technical measurements in scenarios
Thresholds are derived by the AI agent from persona traits + guideline skills
Auth is auto-detected (form login, token, cookies)

Architecture

Layer 1: Evidence (mesen)          Layer 2: Analysis (AI agent + skills)
┌──────────────────────┐           ┌──────────────────────────┐
│ DOM measurements     │           │ bdd-to-thresholds skill  │
│ - contrast ratios    │           │ - persona traits → numbers│
│ - touch target sizes │     →     │ tablet-task-audit skill  │
│ - font sizes         │           │ - one-screen, priority    │
│ - accessible names   │           │ domain skills (extensible)│
│ - layout structure   │           │ - care-station, retail... │
│ Lighthouse scores    │           │                          │
└──────────────────────┘           └──────────────────────────┘
        Facts only                    Judgment with domain knowledge

Layer 1 collects measurements (facts, no judgment). A 3.2:1 contrast ratio is just a number — whether it's a problem depends on the persona.

Layer 2 is the AI agent reading the evidence through skills. The bdd-to-thresholds skill translates "52-year-old with normal vision" into "min font 14px, key data 20px." Domain skills like care-station know that ID number search is standard practice in healthcare (not a privacy issue).

Integration

Claude Code plugin

# Install as Claude Code plugin
claude plugin add mesen

# Use in Claude Code
/audit              # Full UX audit with skills
/write-spec          # Generate a .ux spec

MCP server

{
  "mcpServers": {
    "mesen": {
      "command": "npx",
      "args": ["mesen", "serve"]
    }
  }
}

MCP tools:

mesen_dom_check — quick DOM measurements
mesen_audit — full VLM audit with spec
mesen_evidence — evidence bundle for any model
mesen_score — UX score
mesen_fix_suggestions — remediation suggestions

CLI

# Auto-discover specs
mesen run

# Specific spec
mesen run --spec path/to/spec.ux

# Override URL (test against staging)
mesen run --url https://staging.example.com

# Evidence bundle for AI agent consumption
mesen evidence --spec spec.ux --output-format prompt

CI / GitHub Actions

- name: UX Audit
  uses: mesen/action@v1
  with:
    specs: '.mesen/specs/**/*.ux'
    tier: dom-only

Smart auth

mesen auto-detects login mechanisms:

Form login: Detects login form → fills credentials → submits
CAPTCHA: Opens browser → pre-fills credentials → you fill CAPTCHA → mesen continues automatically
Session caching: Saves session after first login, reuses on next run
Token/cookie: Set in spec auth section

auth:
  email: [email protected]
  password: secret123

No need to specify type: — mesen figures it out.

Extensible skills

mesen ships with base skills and supports domain-specific extensions:

Built-in skills

| Skill | Purpose | |-------|---------| | write-ux-spec | Guide for writing effective .ux BDD specs | | bdd-to-thresholds | Derive measurable thresholds from persona traits | | tablet-task-audit | Audit operational tablet apps (one-screen, priority, layout) |

Domain skills

Domain skills extend the base audit with field-specific knowledge:

skills/domains/
├── care-station/    # Elder care facilities
├── retail-pos/      # Point of sale systems
├── warehouse/       # Warehouse management
└── your-domain/     # Create your own

Domain skills provide:

False positive filters — things that look wrong to a generalist but are correct for the domain
Priority rules — what counts as "urgent" varies by domain
Standard workflows — the real-world task sequence the UI must support

Create a domain skill

---
name: domain-your-industry
description: Domain knowledge for [industry] apps
---

# Your Industry Domain Knowledge

## Standard Workflow
[What users actually do, step by step]

## False Positive Filters
[Things that look wrong but are correct in this domain]

## Priority Rules
[What counts as urgent in this domain]

Output format

mesen produces code-review style findings — each finding is a suggestion, not a mandate:

{
  "severity": "critical",
  "category": "one-screen",
  "title": "40 urgent items hidden below fold",
  "evidence": "Emergency block at position 4, needs scroll. Empty blocks occupy 180px above.",
  "reasoning": "Nurse checks tablet at 8am, must see urgent items in <3 seconds",
  "suggestion": "Move urgent block above empty state blocks",
  "dismissible_if": "User research confirms nurses check attendance before urgency"
}

The consuming agent decides what to act on based on project context.

Tech stack

| Layer | Tool | |-------|------| | Language | TypeScript (ESM, Node 20+) | | Monorepo | Turborepo + Bun workspaces | | Browser | Vercel agent-browser (Rust CLI, 5.7x token efficiency) | | VLM | Vercel AI SDK (provider-agnostic) | | Lint | Biome (strict) | | Test | Vitest |

License

MIT