@plune-ai/cairn
v0.4.2
Published
Cairn — an AI that walks your system and leaves a trail of tests. Autonomous QA agent (UI today; API/unit/docs planned) powered by Claude/OpenRouter, with self-improvement via Langfuse.
Maintainers
Readme
Cairn
Cairn — an AI that walks your system and leaves a trail of tests: UI, API, unit, docs.
▶ Demo: docs/demo/cairn.cast — play with asciinema play docs/demo/cairn.cast
Autonomous QA agent (Node.js / TypeScript) that logs into a web app with a saved session, explores pages (ARIA snapshot + screenshot), designs methodology-based UI test cases (ISO/IEC/IEEE 29119-4), generates runnable
@playwright/testcode, self-validates and self-repairs, and self-improves via Langfuse. A portable utility (CLI + library) for embedding into other TypeScript projects.
Cairn is the generation layer — it produces tests across surfaces (UI today; API / unit / docs planned), each arriving by demand, one at a time. A separate Plune layer owns record / management / eval.
Renamed from Lex-Bot → Cairn. The old
lex-botcommand and the@plune-ai/lex-botpackage still work — the CLI prints a one-line deprecation notice — but switch tocairn/@plune-ai/cairn; the old names will be removed in 1–2 releases. LegacyLEX_/LEXBOT_env vars still work too; preferCAIRN_.
What it does
Point it at a URL (behind login, with a saved Playwright storageState) and it will:
- Observe — navigate, wait for SPA hydration, capture an ARIA snapshot + screenshot, extract interactive elements.
- Ground — verify every locator (
getByRole().count()), explore tabs/views (multi-state), probe safe state transitions — so it tests what is actually there, not hallucinations. - Design — write methodology-based test cases (29119-4: EP / BVA / decision-table / state-transition / error-guessing), steered by an optional checklist and domain knowledge files.
- Generate & validate — emit POM-style
@playwright/testspecs (role-based locators,test.step), run them, classify pass/fail/flaky, and self-repair failures (with keep-best: a repair never makes the suite worse). - Judge & learn — deterministic scorers + an LLM judge + a holistic Pilot verdict + semantic checklist coverage, all traced to Langfuse; accumulate the best cases across runs.
Two decoupled modes
design— explore + write test cases as Markdown files (ATC-*automatable /MTC-*manual) with recorded selectors. No code. Review them as a human, automate later.automate— generate@playwright/testcode from approvedATC-*cases (skipsMTC-*manual ones).explore— the full pipeline at once (cases → code → validation → repair → Pilot verdict).
How it works — the full cycle
New to this? The bot writes two kinds of test case:
- ATC (Automatable Test Case) — the bot is confident it can drive reliably (read-only checks, verified locators) → it generates Playwright code for these.
- MTC (Manual Test Case) — needs a human (full form submits, security/XSS, visual/UX, irreversible actions) → left for you to run by hand.
The typical flow:
- Capture a session (once) — log in so the bot can reach pages behind auth.
- Design — the bot studies the page and writes test cases (
ATC-*/MTC-*.mdfiles with recorded selectors). No code yet. - Review — you read the cases (in the TUI: Browse past runs → open a run → Cases).
- Promote (optional) — reviewed an
MTCand decided it's actually automatable?cairn promote …(orain the TUI) converts it to anATCin place. It's then picked up by automate. - Automate — generate
@playwright/testcode from theATCcases. - Validate — run the generated tests, classify pass/fail/flaky, and self-repair failures.
explore runs steps 2–6 in one go; design + automate split them so you can review (and promote) in between. Full walkthrough: docs/getting-started.md.
Install
npm install -g @plune-ai/cairn # global CLI → run `cairn …`
# …or local / library install:
npm install @plune-ai/cairn # → run via `npx cairn …`
# one-time: download the Chromium build Cairn drives (NOT shipped inside the npm package)
cairn install-browsers # uses Cairn's OWN Playwright → always the right Chromium revision
# …or skip the download entirely and drive your installed Google Chrome: pass --channel chromeRequires Node.js 20+. Copy .env.example → .env and fill in your keys.
Two ways to invoke. A global install (
-g) putscairnon your PATH, socairn design …works anywhere. A local install does not — run it asnpx cairn design …from the folder where you installed it. The examples below use the barecairn; prefix them withnpxif you installed locally.Browsers are a separate download.
npm installpulls the Playwright library but not its browser binaries. Runcairn install-browsersonce — it uses Cairn's own Playwright, so the Chromium revision always matches what Cairn launches. Prefer your existing Chrome? Skip the download and pass--channel chrome(this is also how Cairn coexists with a project that already ships its own Playwright). Otherwiseexplore/automate --validatestop early with a clear "Playwright browsers are not installed" message that prints both fixes — runcairn doctorany time to see the state.
Quickstart
Installed locally (without
-g)? Prefix everycairn …below withnpx(e.g.npx cairn design …).
# 0. One-time: download the browser Cairn drives (skip if you already ran it during install,
# or skip entirely and add --channel chrome to drive your installed Google Chrome)
cairn install-browsers
# 1. Capture a session (opens a browser to log in)
cairn session capture --url https://app.example.com/login --name myapp
# 2. Design test cases (no code) — review the .md files it writes
cairn design --url https://app.example.com/page --session myapp --checklist plan.md
# 3. (optional) Promote a manual case you decided is automatable: MTC → ATC
cairn promote --run runs/<id> --cases MTC-LOGIN-001
# 4. Automate the approved (ATC) cases → @playwright/test code, and run them
cairn automate --run runs/<id> --validate --session myapp
# …or do everything at once:
cairn explore --url https://app.example.com/page --session myapp --checklist plan.mdThe --checklist file steers what the bot tests (and is scored as coverage). Copy
examples/plan.md — a ready-to-run checklist for https://plune.ai/cairn — as a starting point.
Run id / Windows tip:
--run(and--from-run) accept just the run id —--run <id>— in addition toruns/<id>or an absolute path. In Git Bash (MINGW64) quote the path or use forward slashes (--run 'runs/<id>'), because an unquoted\is eaten by the shell before Cairn sees it (soruns\<id>would otherwise arrive asruns<id>). The bare-id form sidesteps the issue entirely.
Add --fresh (on design and explore) to ignore prior runs of the same URL. By default a 2nd+
run on a URL reuses its previously stable cases as context and generates only the delta (new
cases), so re-runs get smaller. --fresh skips that and generates a full set every time — useful
for clean A/B comparisons. In the TUI it's the "Ignore prior-run experience for this URL?" toggle (default no).
New here? Read the Getting started guide — it walks the whole cycle with explanations.
Authenticated targets
Cairn explores your app as a logged-in user. You capture the login once into a Playwright
storageState (cookies + localStorage); every later run reuses it — no credentials in code, no
re-login per run.
# 1. Capture once — a real browser opens; log in by hand, then press Enter.
cairn session capture --url https://your-app.example.com/login --name myapp
# 2. Point Cairn at any page behind that login, reusing the session.
cairn explore --url https://your-app.example.com/dashboard --session myapp- Pointing Cairn at your OWN gated app? That's the intended flow — capture against your login page, then
explore/designany authenticated page with--session <name>. - OAuth / Google login (blocks automated browsers): add
--channel chrometo drive your real Google Chrome.--channelworks onsession capture,observe,design,explore, andautomate --validate— and needs no bundled-Chromium download, so it's also the simplest way to run inside a project that already has its own Playwright. (Without a channel, Cairn uses the bundled Chromium fromcairn install-browsers.) - Manage sessions:
cairn session lslists saved sessions;cairn session rm <name>deletes one. (cairn loginis a shorthand forcairn session capture.) - Already have a
storageState.json? Skip capture and pass it directly:--session-file ./path/to/state.json. - Expired session? If the first page Cairn sees looks like a login screen, it stops with a clear re-capture message instead of exploring the sign-in page.
- Secrets hygiene: sessions live in
.auth/(matching*.storageState.json), which is gitignored — never committed. Treat the files like passwords.
Working inside the repo?
npm run session:save -- --url <u> --name <s>still works — it's a thin wrapper over the same capture logic that ships ascairn session capture.
Interactive TUI (optional)
Run cairn with no arguments in a terminal to open the interactive TUI (built with Ink / React-for-CLI).
Ink and React are optional dependencies — a default install omits them to keep the footprint small.
Install them once to enable the TUI:
npm install ink react ink-select-input ink-spinner ink-text-inputThen:
cairn # launches the terminal UI (requires the Ink packages above)Pick a command (explore / design / automate), fill parameters (URL, session, checklist, style, fresh) via a
guided form, watch a live dashboard of the pipeline stages as the run progresses, read the result summary
(scores, green %, Pilot verdict, test cases), and browse past runs in ./runs — opening any run to
read its test cases, report and logs.
The commands below stay available for scripting/CI; in a non-interactive (piped/CI) shell, cairn with
no arguments prints help instead of starting the UI. If the Ink packages are not installed, cairn with
no arguments also falls back to printing help.
Commands
| Command | Purpose |
|---|---|
| cairn session capture --url <loginUrl> --name <s> | Capture a login session once → .auth/ (cairn login is a shorthand; session ls / session rm) |
| cairn observe --url <u> [--session <s>] | ARIA snapshot + interactive elements + screenshot |
| cairn design --url <u> --session <s> [--checklist <f>] [--style <s>] [--fresh] | Test cases only (ATC/MTC .md + selectors), no code |
| cairn automate --run <dir> [--validate --session <s>] | @playwright/test from ATC-* cases |
| cairn promote --run <dir> --cases <ids> [--session <s>] | Promote manual MTC case(s) to ATC (.md only; then automate) |
| cairn explore --url <u> --session <s> [--checklist <f>] [--fresh] | Full pipeline (cases → code → validate → repair → Pilot) |
| cairn experiment --dataset <d> --candidate name=file | Compare prompt versions on a dataset |
lex-bot <command>still runs every command above (deprecated alias — prints a notice, then runscairn).
Configuration (env)
| Var | Purpose |
|---|---|
| LLM_PROFILE | anthropic | openrouter | mixed (per-tier default models) |
| LLM_ROUTING | per-role preset: fast (Groq worker) | volume (OpenRouter worker) — see Role routing |
| ANTHROPIC_API_KEY / OPENROUTER_API_KEY / GROQ_API_KEY | provider keys (per profile / routing) |
| QA_TESTCASE_LANG | test-case language (default English; e.g. Ukrainian, uk) |
| LANGFUSE_BASE_URL / LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY | Langfuse — cloud or self-hosted (optional; see below) |
| BROWSER_BACKEND | lib (in-process Playwright) | cli |
| BROWSER_CHANNEL | chrome/msedge → drive a system browser (helps with OAuth; no bundled-Chromium download, and coexists with a host project's own Playwright). Per-command flag: --channel. |
| MAX_REPAIR | repair attempts (default 2) |
Env var prefix: every variable above is read as-is or with a
CAIRN_prefix (e.g.CAIRN_LLM_PROFILE,CAIRN_MAX_REPAIR). LegacyLEX_/LEXBOT_prefixes still work but print a one-time deprecation warning — preferCAIRN_.Role routing (
LLM_ROUTING, optional): layer a cheap worker over any profile while keeping the strong reasoner. One flag picks where the mechanical steps (identify-elements, generate-code/repair) run:fast→ worker on Groqllama-3.3-70b-versatile— lowest latency/cost, OpenAI-compatible tool-calling.volume→ worker on OpenRouterdeepseek/deepseek-chat— model breadth.- default (unset) → the profile's own per-tier models.
In every preset the reasoner (design test cases + Pilot verdict) stays on Anthropic
claude-opus-4-8for judgment quality, and the cheapjudgescorer keeps the profile tier (routing never touches it). Override any role withCAIRN_ROLE_WORKER/CAIRN_ROLE_REASONER=provider:model; pass--routing <preset>onexplore/design/automateto set it per run. Per-run per-role cost (tokens + $) is printed in the run summary.Domain knowledge: put
*.mdfiles in./knowledge/with aurl:front-matter to inject credentials/validation rules into design.Prompt overrides: drop
./prompts/<name>.mdto override any built-in prompt without rebuilding.
Metrics
Every run scores itself. The numbers appear in the console (=== Metrics ===), in each run's
report.md (with a one-line meaning per metric), and in Langfuse when configured. ↑ = higher is
better, ↓ = lower is better. case_redundancy and flaky_ratio are the only "lower is better"
metrics — every other metric is higher-is-better.
Deterministic (computed from run data, no LLM):
| metric | direction | meaning |
|---|---|---|
| runs_green | ↑ higher is better | Share of generated tests that pass on validation. |
| flaky_ratio | ↓ lower is better | Share of tests classified flaky (inconsistent pass/fail). |
| verified_ratio | ↑ higher is better | Share of identified elements that resolve to exactly one element (unique locator). |
| grounding | ↑ higher is better | Share of cases whose element refs all point to real on-page elements (no hallucinated refs). |
| locator_quality | ↑ higher is better | Share of user-facing locators (getByRole/Label/Text…) vs fragile (.locator/getByTestId). |
| locator_robustness | ↑ higher is better | Weighted selector strength: role 1.0 > label/text 0.8 > test-id 0.5 > css 0. |
| technique_coverage | ↑ higher is better | Distinct test techniques used out of the 6 (ISO/IEC/IEEE 29119-4). |
| case_redundancy | ↓ lower is better | Share of cases that are near-duplicates of another (0 = all distinct). |
Judge (LLM-scored):
| metric | direction | meaning |
|---|---|---|
| test_case_quality | ↑ higher is better | Holistic quality of the cases (clarity, correctness, usefulness). |
| methodology_adherence | ↑ higher is better | How well the cases follow the testing methodology. |
| checklist_coverage | ↑ higher is better | Semantic coverage of the provided checklist by the cases. |
(The holistic Pilot verdict is separate — a pass / needs-work / fail judgment on the whole run, not a 0–1 score.)
Cost benchmark
What does a run cost on each routing preset? The table below is generated by
npm run bench: it runs cairn explore against a fixed target once per preset and reads the per-run
cost ledger (tokens + $) already written into each run's report.json (L1-01) — it re-prices nothing.
Treat the numbers as an approximate, single-run snapshot (LLM token counts vary run-to-run).
Snapshot: 2026-06-13 · commit a2906cf · profile anthropic · MAX_REPAIR=0 · target: https://demoqa.com/text-box (no session) · approximate, single-run.
| Preset | Worker | Reasoner | Tokens/run | $/run | Wall-clock/run | $/hour† |
|---|---|---|---|---|---|---|
| default | claude-haiku-4-5+claude-sonnet-4-6 | claude-opus-4-8 | 27,191 | $0.2854 | 390.6s | $2.63 |
| volume | deepseek/deepseek-chat | claude-opus-4-8 | 13,923 | $0.1209 | 58.2s | $7.48 |
| fast | llama-3.3-70b-versatile | claude-opus-4-8 | n/a — run failed: 400 Failed to call a function. Please adjust your prompt. See 'fail… | — | — | — |
† $/hour is an extrapolation, not a steady-state rate: $/run × (3600 / seconds-per-run) — the cost if runs fired back-to-back for an hour. Real throughput varies with target complexity, retries, and provider latency.
Reproduce: npm run bench -- --url https://demoqa.com/text-box --session <name>
Token counts vary run-to-run (LLM nondeterminism). OpenRouter/Groq prices are approximate and movable (ADR-0002); Anthropic prices follow the published rates.
$/runis—when a model has no configured price (tokens are still counted).
Regenerate it yourself — you need the provider keys for the presets you want measured and, ideally, a
captured session so the target is a real page rather than example.com:
npm run bench -- --url https://your-app.example.com/dashboard --session myapp --writeFor reproducibility the benchmark pins LLM_PROFILE=anthropic and MAX_REPAIR=0 (both shown in the
snapshot line) instead of inheriting your .env — so default always means the same baseline and the
reasoner stays on Opus in every preset; only the worker changes. Override with
--profile <p> / --max-repair <n>. A preset whose provider key is unset is skipped and shown as
n/a, so a partial run (e.g. only ANTHROPIC_API_KEY available) still produces a useful table.
Optional: Langfuse
Langfuse is entirely optional — leave the LANGFUSE_* variables unset and the bot runs fully offline.
Everything core still works: observe / design / automate / explore, locator grounding, the LLM judge,
deterministic scorers, self-repair, and results-level learning (best cases are read from local
runs/<id>/report.json). Prompts fall back to the built-in defaults — override any of them with ./prompts/<name>.md.
Set the three LANGFUSE_* variables to additionally get: traces in the Langfuse UI, scores/datasets
recorded centrally, and versioned prompts (with production labels & A/B prompt experiments via cairn experiment).
Tracing ships as an optional add-on (0.3.3): the @langfuse/* / @opentelemetry/* packages are no
longer part of the default install — that keeps the footprint small and npm audit clean. Install them
once to enable it:
npm install @langfuse/client @langfuse/langchain @langfuse/otel @langfuse/tracing @opentelemetry/api @opentelemetry/sdk-nodeIf the LANGFUSE_* variables are set but the packages aren't installed, Cairn prints a one-line hint and
keeps running without tracing — it never crashes a run over telemetry.
Cloud or self-hosted — same setup. Langfuse Cloud and a self-hosted instance are configured identically: you only pass the host URL and the API keys, nothing else changes.
# Pick ONE base URL:
LANGFUSE_BASE_URL=https://cloud.langfuse.com # Langfuse Cloud (EU)
# LANGFUSE_BASE_URL=https://us.cloud.langfuse.com # Langfuse Cloud (US)
# LANGFUSE_BASE_URL=https://langfuse.your-host.tld # self-hosted
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...Enablement is all-or-nothing: Langfuse turns on only when all three variables are set; otherwise telemetry is a no-op and the bot behaves exactly as offline.
Library API
import { runDesign, runAutomate, runExploration, loadConfig } from "@plune-ai/cairn";
const config = loadConfig(process.env);
const result = await runDesign({ url, config, sessionName: "myapp", checklistText });
// result.testCases, result.testCaseFiles, result.scoresDevelopment
npm run build # tsc
npm test # vitest (unit + integration; LLM/browser are mocked in unit)
npm run test:coverage
npm run lintDocumentation
- Getting started — step-by-step onboarding (session → design → review → promote → automate → validate), written for people new to the tool.
- Architecture overview — how the agent works inside (the plain async pipeline, locator grounding, self-improvement).
- Architecture Decision Records — why it's built this way (0001–0013, incl. the interactive TUI, the
@playwright/testoutput format, the Lex-Bot → Cairn rename, the Apache-2.0 relicense, and the drop of LangGraph in 0.4.0).
License
Apache-2.0 (relicensed from GPL-3.0 in 0.3.0 — see docs/adr/0012). Methodology prompts ported from AZANIR/qa-skills (see docs/adr/0008).
