@tmls-ai/qa-harness

v0.2.0

Published

10 days ago

Project-agnostic QA harness: on every PR, an agent generates feature-scoped tests (Playwright e2e now; unit/api/a11y planned), labeled [Acceptance]/[Characterization], and commits them back so prod accumulates real QA coverage.

Downloads

590

0High
0Medium
0Low

ron-tmls

qa testing playwright ai test-generation ci

Every PR gets the tests a senior QA engineer would have written — automatically.

When a pull request opens, an agent reads the diff, the PR description, and any linked issues, generates tests scoped strictly to that one feature, runs them against a runner-local preview of your app, and commits them back to the PR branch. On merge, the tests travel into your main branch — so your test suite grows with every feature, automatically, without breaking anything that already exists.

PR opened
   │
   ▼
┌─ qa-harness ───────────────────────────────────────────────┐
│  1. context    diff + PR body + linked issues (intent)     │
│  2. preview    build & start your app inside the runner    │
│  3. generate   isolated agent session per test type        │
│  4. verify     run the new tests — all must pass           │
│  5. commit     push tests back to the PR  [skip ci]        │
│  6. report     JSON + GitHub step summary                  │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
your full suite runs as the gate → merge → tests live in main

Honest test labels

The hard problem of generated tests is the oracle problem: if the agent only looks at the code, it tests what the code does — not what it should do. qa-harness makes this distinction explicit. Every generated test is labeled:

| Label | Meaning | Trust | |---|---|---| | [Acceptance] | The behavior is stated in the PR description or a linked issue. The test encodes intent. | High — a failing acceptance test is treated as a real bug and reported, never weakened to pass. | | [Characterization] | An edge case the PR did not specify. Current behavior is frozen as a regression check. | Review — flagged for a human glance; if the frozen behavior is a bug, the test would cement it. |

If the PR contradicts itself (rules say X, example says Y), the agent follows the stated rules and flags the contradiction in a [Characterization] test.

Quick start

1. Add the workflow to your repo:

# .github/workflows/qa.yml
name: QA Harness
on:
  pull_request:
    types: [opened, synchronize, reopened]
permissions:
  contents: write        # commit generated tests back to the PR branch
jobs:
  qa:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.head_ref }}
      - uses: actions/setup-node@v4
        with: { node-version: 22, cache: npm }
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: npx -y @tmls-ai/[email protected] run --pr ${{ github.event.pull_request.number }}
        env:
          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

2. Add the OPENROUTER_API_KEY secret to your repo:

gh secret set OPENROUTER_API_KEY

That's it. No config file is required — defaults assume a standard npm web app (npm run build / npm run start on port 3000).

Tip: if your workflow already builds and starts the app (e.g. to reuse it for a full-suite gate), pass --base-url http://localhost:3000 and qa-harness will skip booting its own preview.

Configuration

Optional qa-harness.config.json at the repo root — every field has a default:

{
  "model": "deepseek/deepseek-v4-pro",   // any tool-calling model on OpenRouter
  "testRoot": "tests/qa-generated",      // where generated tests live
  "preview": {
    "build": "npm run build",
    "start": "npm run start",
    "port": 3000,
    "readyPath": "/",                    // polled until it answers
    "timeoutSec": 180
  },
  "generators": {
    "e2e": true                          // see "Generators" below
  }
}

QA_AGENT_MODEL (env) overrides model per run.

CLI

qa-harness run --pr <number> [options]

| Option | Description | Default | |---|---|---| | --pr <n> | PR number (required) | — | | --repo <owner/name> | GitHub repo | $GITHUB_REPOSITORY or git remote | | --target <path> | project root under test | cwd | | --slug <slug> | feature slug (test filename) | derived from PR head branch | | --generator <type> | run only this generator | all enabled in config | | --base-url <url> | use an already-running app | boot preview from config | | --port <n> | port for the local preview | config / 3000 | | --force | regenerate even if the test file exists | off | | --commit / --no-commit | push generated tests to the PR branch | on in CI, off locally |

Required env: OPENROUTER_API_KEY (generation), GITHUB_TOKEN (PR/issue context).

Generators

Each test type is a structurally isolated plugin: its own subdirectory under testRoot, its own agent session, its own tools, its own runner. Test types cannot mix — a generator physically cannot write into another generator's directory, and a generator that gets no app URL cannot drift into writing HTTP tests. A generator may also honestly report "not applicable" for a PR instead of inventing tests.

| Type | Status | Runner | Writes to | |---|---|---|---| | e2e | ✅ shipped | Playwright (browser) | tests/qa-generated/e2e/ | | unit | ✅ shipped | Vitest | tests/qa-generated/unit/ | | api | 🔜 planned | Playwright request-context (no browser) | tests/qa-generated/api/ | | a11y | 🔜 planned | Playwright + axe-core | tests/qa-generated/a11y/ |

Enabling `unit`

The unit generator tests exported pure logic touched by the diff (it gets no app URL and may honestly skip PRs without testable logic). It needs three things in the consumer repo:

// qa-harness.config.json
{ "generators": { "e2e": true, "unit": true } }

npm i -D vitest

// playwright.config.ts — keep the runners apart:
testIgnore: "**/unit/**"

Guardrails

Enforced by the harness — not left to the model's goodwill:

Path ownership — the test file path is {testRoot}/{type}/{slug}.spec.ts, period. Whatever filename the model suggests, it can never overwrite another feature's or another type's tests.
Skip-if-exists — if the test file is already there (e.g. human-reviewed on a previous push), generation is skipped and the file preserved. Regenerate explicitly with --force.
No narration credit — a model that announces "let me write the file" without calling the tool is nudged until the file is actually written and verified green.
Never weaken — an [Acceptance] test that correctly encodes PR intent but fails is kept and reported as a bug, not loosened until it passes.
Scoped diff — the agent sees the feature diff only; generated tests, workflows, and harness files are filtered out of its view.

Running locally

export OPENROUTER_API_KEY=...
export GITHUB_TOKEN=$(gh auth token)

npx @tmls-ai/qa-harness run --pr 42 --target ~/code/my-app --port 3456
# local default is --no-commit: inspect the generated file, then commit yourself

Roadmap

api generator (request-level tests against the preview — cheap & stable)
a11y generator (axe-core scans of routes the diff touched)
Slack notification with the Acceptance/Characterization breakdown
Mutation-style meaningfulness check for generated tests

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme