@tmls-ai/qa-harness
v0.2.0
Published
Project-agnostic QA harness: on every PR, an agent generates feature-scoped tests (Playwright e2e now; unit/api/a11y planned), labeled [Acceptance]/[Characterization], and commits them back so prod accumulates real QA coverage.
Downloads
590
Maintainers
Readme
Every PR gets the tests a senior QA engineer would have written — automatically.
When a pull request opens, an agent reads the diff, the PR description, and any linked issues, generates tests scoped strictly to that one feature, runs them against a runner-local preview of your app, and commits them back to the PR branch. On merge, the tests travel into your main branch — so your test suite grows with every feature, automatically, without breaking anything that already exists.
PR opened
│
▼
┌─ qa-harness ───────────────────────────────────────────────┐
│ 1. context diff + PR body + linked issues (intent) │
│ 2. preview build & start your app inside the runner │
│ 3. generate isolated agent session per test type │
│ 4. verify run the new tests — all must pass │
│ 5. commit push tests back to the PR [skip ci] │
│ 6. report JSON + GitHub step summary │
└─────────────────────────────────────────────────────────────┘
│
▼
your full suite runs as the gate → merge → tests live in mainHonest test labels
The hard problem of generated tests is the oracle problem: if the agent only looks at the code, it tests what the code does — not what it should do. qa-harness makes this distinction explicit. Every generated test is labeled:
| Label | Meaning | Trust |
|---|---|---|
| [Acceptance] | The behavior is stated in the PR description or a linked issue. The test encodes intent. | High — a failing acceptance test is treated as a real bug and reported, never weakened to pass. |
| [Characterization] | An edge case the PR did not specify. Current behavior is frozen as a regression check. | Review — flagged for a human glance; if the frozen behavior is a bug, the test would cement it. |
If the PR contradicts itself (rules say X, example says Y), the agent follows
the stated rules and flags the contradiction in a [Characterization] test.
Quick start
1. Add the workflow to your repo:
# .github/workflows/qa.yml
name: QA Harness
on:
pull_request:
types: [opened, synchronize, reopened]
permissions:
contents: write # commit generated tests back to the PR branch
jobs:
qa:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ github.head_ref }}
- uses: actions/setup-node@v4
with: { node-version: 22, cache: npm }
- run: npm ci
- run: npx playwright install --with-deps chromium
- run: npx -y @tmls-ai/[email protected] run --pr ${{ github.event.pull_request.number }}
env:
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}2. Add the OPENROUTER_API_KEY secret to your repo:
gh secret set OPENROUTER_API_KEYThat's it. No config file is required — defaults assume a standard npm web app
(npm run build / npm run start on port 3000).
Tip: if your workflow already builds and starts the app (e.g. to reuse it for a full-suite gate), pass
--base-url http://localhost:3000and qa-harness will skip booting its own preview.
Configuration
Optional qa-harness.config.json at the repo root — every field has a default:
{
"model": "deepseek/deepseek-v4-pro", // any tool-calling model on OpenRouter
"testRoot": "tests/qa-generated", // where generated tests live
"preview": {
"build": "npm run build",
"start": "npm run start",
"port": 3000,
"readyPath": "/", // polled until it answers
"timeoutSec": 180
},
"generators": {
"e2e": true // see "Generators" below
}
}QA_AGENT_MODEL (env) overrides model per run.
CLI
qa-harness run --pr <number> [options]| Option | Description | Default |
|---|---|---|
| --pr <n> | PR number (required) | — |
| --repo <owner/name> | GitHub repo | $GITHUB_REPOSITORY or git remote |
| --target <path> | project root under test | cwd |
| --slug <slug> | feature slug (test filename) | derived from PR head branch |
| --generator <type> | run only this generator | all enabled in config |
| --base-url <url> | use an already-running app | boot preview from config |
| --port <n> | port for the local preview | config / 3000 |
| --force | regenerate even if the test file exists | off |
| --commit / --no-commit | push generated tests to the PR branch | on in CI, off locally |
Required env: OPENROUTER_API_KEY (generation), GITHUB_TOKEN (PR/issue context).
Generators
Each test type is a structurally isolated plugin: its own subdirectory
under testRoot, its own agent session, its own tools, its own runner. Test
types cannot mix — a generator physically cannot write into another
generator's directory, and a generator that gets no app URL cannot drift into
writing HTTP tests. A generator may also honestly report "not applicable"
for a PR instead of inventing tests.
| Type | Status | Runner | Writes to |
|---|---|---|---|
| e2e | ✅ shipped | Playwright (browser) | tests/qa-generated/e2e/ |
| unit | ✅ shipped | Vitest | tests/qa-generated/unit/ |
| api | 🔜 planned | Playwright request-context (no browser) | tests/qa-generated/api/ |
| a11y | 🔜 planned | Playwright + axe-core | tests/qa-generated/a11y/ |
Enabling unit
The unit generator tests exported pure logic touched by the diff (it gets no app URL and may honestly skip PRs without testable logic). It needs three things in the consumer repo:
// qa-harness.config.json
{ "generators": { "e2e": true, "unit": true } }npm i -D vitest// playwright.config.ts — keep the runners apart:
testIgnore: "**/unit/**"Guardrails
Enforced by the harness — not left to the model's goodwill:
- Path ownership — the test file path is
{testRoot}/{type}/{slug}.spec.ts, period. Whatever filename the model suggests, it can never overwrite another feature's or another type's tests. - Skip-if-exists — if the test file is already there (e.g. human-reviewed
on a previous push), generation is skipped and the file preserved.
Regenerate explicitly with
--force. - No narration credit — a model that announces "let me write the file" without calling the tool is nudged until the file is actually written and verified green.
- Never weaken — an
[Acceptance]test that correctly encodes PR intent but fails is kept and reported as a bug, not loosened until it passes. - Scoped diff — the agent sees the feature diff only; generated tests, workflows, and harness files are filtered out of its view.
Running locally
export OPENROUTER_API_KEY=...
export GITHUB_TOKEN=$(gh auth token)
npx @tmls-ai/qa-harness run --pr 42 --target ~/code/my-app --port 3456
# local default is --no-commit: inspect the generated file, then commit yourselfRoadmap
apigenerator (request-level tests against the preview — cheap & stable)a11ygenerator (axe-core scans of routes the diff touched)- Slack notification with the Acceptance/Characterization breakdown
- Mutation-style meaningfulness check for generated tests
License
MIT © Timeless
