@k71n/agent-probe
v0.2.0
Published
LLM-in-the-loop multi-service debugger - MCP server with leave-no-trace probe instrumentation
Downloads
447
Readme
agent-probe
Evidence-based, leave-no-trace debugging for AI coding agents — an MCP server that lets your agent place temporary probes in running code, capture runtime evidence across services, name the root cause, and verifiably clean up after itself.
agent-probe is the local-dev counterpart to production AI debugging. Your agent (Claude Code, Cursor, or any MCP host) drives the whole loop: instrument → reproduce → analyze → remove → verify. Evidence lives in a per-session local SQLite file that is destroyed when the session ends. No SDK in your app, no accounts, no dashboards, and no network egress, ever.
Why agent-probe?
The hardest bugs are invisible from the code alone: a write succeeds, the dependent read shows nothing, and no error appears anywhere. Static analysis and grepping can't see runtime state — and console.log debugging by an agent leaves litter in your tree and noise in your terminal.
- Probes are one-liners, not a library — a self-contained, fire-and-forget HTTP POST to localhost, wrapped in marker comments. Your app gains zero dependencies.
- Probing never perturbs the app — the server replies 202 before validating anything, applies no backpressure, and a dead server changes nothing (proven by an explicit behavior-equivalence test).
- Evidence is structured, not scrollback — timelines across services, bounded queries, and a first-class diff between a failing run and a working run that surfaces the discriminating difference directly.
- Cleanup is verified, never assumed — the server scans your workspace itself and refuses to close the session while any probe marker remains.
git diffends empty. - Everything stays on your machine — localhost-only ingestion (bound at the kernel level), ephemeral per-session storage, zero telemetry.
- Language-agnostic by contract — anything that can POST JSON conforms: JS/TS, Python, shell/curl, Go, SQL comments for markers, and more.
Features
The debugging loop
- Goal-scoped sessions —
start_session(goal, workspace_root)opens an isolated investigation with its own evidence store and ingestion port - Runs as first-class boundaries — tag reproductions (
"buggy","clean"), and events are attributed to the run they arrived in; out-of-run events are kept as"unattributed", never lost - Cross-service timelines — time-ordered events from every service in one view, with honest
seq_tiedflags when ordering rests on arrival rather than causality - Bounded span queries — filter by run, probe, service, or time range; results are capped server-side with
truncated+ truetotal, so a context window holds hypotheses, not haystacks - Run diffing (the wedge) — structured presence/absence, ordering inversions, and payload deltas between two runs
Non-perturbation guarantees
- 202-before-validate ingestion — a probe never waits on parsing, validation, or storage
- Fire-and-forget idiom — no
await, no retries, no queues; errors are swallowed at the probe (.catch(() => {})) - Caps instead of pressure — oversized payloads are truncated (event kept), events past the per-session cap are dropped with a warning; the app never feels any of it
- Warnings surface on every tool response — rejected events, truncations, and drops are reported to the agent; silence about dropped evidence would mislead the analysis
Leave no trace
- Marker convention — every probe is wrapped in own-line comment pairs (
-begin p1…-end p1, each prefixed with theagent-probetoken) that survive formatters and are mechanically removable - Server-side cleanup verification —
verify_cleanupscans the workspace (git-aware;node_modules/and.git/always skipped) and returns exact file+line locations - The close gate —
end_sessionis refused withMARKERS_REMAINwhile markers exist; the override is explicit, user-owned, and always reported in the result - Destructive close — the session's SQLite file and lock are deleted; orphans from crashes are surfaced and disposed explicitly, never silently resumed
Agent guidance, token-cheap
- The workflow is an Agent Skill — conventions, the wire contract, probe strategy, and the removal ritual ship as
plugin/skills/agent-probe/SKILL.md: ~100 idle tokens of frontmatter, the full six-phase workflow loaded only when the user invokes it - Explicitly invoked, never auto-triggered — the skill runs as a command; it activates on the user's say-so, not on conversational drift
- Host-agnostic by construction — plain SKILL.md (open Agent Skills format), bare tool names, no host-specific features; the same file installs into Claude Code, Cursor, Codex, and any skill-capable host
Quick start
Install
agent-probe is a plugin: one Agent Skill (the debugging workflow, invoked as a command) plus the MCP server (the 9 tools, run via npx). Requires Node >= 22.13 (the server tells you on stderr if yours is older — it uses the built-in node:sqlite).
Claude Code — the repo is a plugin marketplace; skill and server wire up together:
claude plugin marketplace add k71n/agent-probe
claude plugin install agent-probe@agent-probeCursor
npx skills add k71n/agent-probe -a cursor # the skillthen install the MCP server with one click (deeplink) — or add it to .cursor/mcp.json manually (plugin/README.md).
Codex
npx skills add k71n/agent-probe -a codex # the skill
codex mcp add agent-probe -- npx -y @k71n/agent-probe@latest # the serverAny other skill-capable host — npx skills add k71n/agent-probe (the skills CLI auto-detects 70+ agents), then add npx -y @k71n/agent-probe@latest to your host's MCP config. Per-host detail: plugin/README.md.
Bare MCP (no skill support) still works — claude mcp add agent-probe -- npx -y @k71n/agent-probe@latest or the equivalent — but the agent won't know the probe conventions: the skill is where the workflow lives.
Note that npx caches versions: use @latest as above, or pin a version. The npm package is scoped (@k71n/agent-probe); the plugin, tool, bin, and probe markers are plain agent-probe.
Try it: the golden demo
The repo ships a tiny two-layer app staging the classic silent failure — the form saves, but the list never updates. No errors anywhere.
node examples/golden-demo/api/server.mjs
# open http://localhost:4280 — save an entry, watch it never appearThen tell your agent:
The form saves but the list never updates — find out why.
What happens next:
- The skill activates and the agent starts a goal-scoped session, stating its hypothesis out loud.
- It inserts marker-wrapped probe lines on both sides of the data boundary (write path + read path).
- You reproduce the bug twice — once failing, once working — while the agent captures both runs.
- The timeline names the root cause: the entry was written with
categoryId: nullwhile the list filtered on a category — a silent field-name mismatch between frontend and backend. - The agent removes every probe, the server verifies the workspace is clean, and the session is destroyed with all its evidence.
git diffis empty.
Tool reference
The surface is deliberately frozen at 9 tools.
| Tool | Arguments | Description |
|------|-----------|-------------|
| start_session | goal, workspace_root, stale? | Open a Debug Session scoped to a stated goal. Returns session_id, the ingestion port, and the echoed workspace_root. Refused with STALE_SESSION_EXISTS if orphans exist (resolve with stale: "dispose" or "keep"), or INSTANCE_CONFLICT if another live process holds the workspace. |
| start_run | tag? | Arm a Run while the user reproduces the flow. With host elicitation support, waits for the user's confirmation and returns the closed run; otherwise returns status: "open" and end_run closes it. |
| end_run | tag? | Close the open Run (a tag here overwrites one set at start). |
| list_runs | — | Every run with tag, boundaries, and event count. |
| get_timeline | run, limit? | Time-ordered events across all services for a run (or "unattributed"). Same-millisecond events are flagged seq_tied. |
| get_span | run?, probe?, service?, from?, to?, limit? | Bounded slice by any filter combination. |
| diff_runs | a, b | Structured differences between two runs: presence/absence, ordering changes, payload deltas. |
| verify_cleanup | — | Server-side scan of the workspace for residual probe markers; returns exact locations and orphan (unpaired) markers separately. |
| end_session | override? | Destroy the session and all stored evidence. Refused with MARKERS_REMAIN while markers remain; override: true is the user's call and is always reported in the result. |
Response envelope
Every successful response is the same shape — even single-item ones:
{ "data": { ... }, "truncated": false, "total": 1, "warnings": ["..."] }truncated/total make result capping honest; warnings (when present) carry ingestion rejections, payload truncations, and drops.
Errors
Every tool error is { code, message, hint } with code drawn from a closed enum:
NO_ACTIVE_SESSION · SESSION_ALREADY_ACTIVE · STALE_SESSION_EXISTS · INSTANCE_CONFLICT · MARKERS_REMAIN · RUN_NOT_FOUND · NO_ACTIVE_RUN · INVALID_STATE
The hint always says what to do next — errors are written for agents.
The probe contract
Probes POST JSON to http://127.0.0.1:<port>/events (the port comes from start_session). snake_case on the wire; optional fields are omitted when absent, never null.
| Field | Type | Required | Meaning |
|-------|------|----------|---------|
| session_id | string | yes | from the start_session response |
| probe_id | string | yes | matches the probe's marker id |
| service | string | yes | which service emitted it ("web", "api", …) |
| file | string | yes | source file the probe lives in |
| line | int | yes | source line |
| ts_probe | int | yes | epoch milliseconds at emission |
| payload | JSON | yes | any JSON value; keep it ≤ 64 KiB (truncated beyond) |
| trace_id | string | no | correlation headroom (W3C-aligned, free-form) |
| parent_id | string | no | correlation headroom (W3C-aligned, free-form) |
The server assigns ts_server and a monotonic seq on arrival. Unknown extra keys are ignored — strictness would break language-agnosticism. No Content-Type required.
The idiom (JS/TS)
// ⟨token⟩-begin p1
fetch(`http://127.0.0.1:${PORT}/events`, { method: "POST", body: JSON.stringify({ session_id: SID, probe_id: "p1", service: "api", file: "list.ts", line: 42, ts_probe: Date.now(), payload: { categoryId, rowCount } }) }).catch(() => {});
// ⟨token⟩-end p1(⟨token⟩ is literally agent-probe — spelled indirectly here so this very README never trips the cleanup scan when you debug a workspace containing it.)
Markers are own-line comment pairs (they survive Prettier/ESLint), both carrying the probe id; the comment leader adapts to the language (//, #, --, <!-- -->). Anything that can POST JSON conforms — shell:
curl -s -X POST http://127.0.0.1:$PORT/events -d "{\"session_id\":\"$SID\",\"probe_id\":\"p3\",\"service\":\"worker\",\"file\":\"job.sh\",\"line\":7,\"ts_probe\":$(date +%s%3N),\"payload\":{\"jobId\":\"$JOB\"}}" >/dev/null 2>&1 &The full conventions — probe strategy for silent write/read failures, the removal ritual, run protocol — live in the plugin's Agent Skill (plugin/skills/agent-probe/SKILL.md), loaded when the user invokes the workflow.
Limits (enforced, not asserted)
| Cap | Value | Behavior past it |
|-----|-------|------------------|
| Payload size | 64 KiB | truncated, event kept, warning emitted |
| Events per session | 100,000 | dropped with warning |
| Events per query result | 500 | clamped; truncated: true + true total |
Architecture
MCP host (Claude Code, Cursor, …)
+ the Agent Skill (the workflow)
|
stdio (JSON-RPC) your services
| (web, api, worker)
+---------v----------+ |
| MCP server | marker-wrapped
| 9 tools, nothing | one-line probes
| else | |
+---------+----------+ POST /events (fire-and-forget)
| |
+-------------+--------------+ +-------v--------+
| SessionManager | | Ingest listener |
| state machine, lockfiles, |<--+ 127.0.0.1:ephem |
| orphan recovery, close | | 202-before- |
| gate (MARKERS_REMAIN) | | validate |
+------+--------------+------+ +----------------+
| |
+---------v---+ +------v----------+
| Evidence | | Cleanup verify |
| store + query| | (workspace scan,|
| + run diff | | marker pairing)|
| (SQLite, | +-----------------+
| per-session,|
| destroyed |
| on close) |
+-------------+How a session works
start_sessionvalidates the workspace, takes a PID lockfile (one active session per workspace), creates a per-session SQLite file, and returns the ingestion port.- The agent instruments — it inserts whole probe lines into your code, baking the session id and port in as literals. The server never edits your files.
- Ingestion replies
202immediately, then parses, validates against the contract, applies caps, and flushes to SQLite once per event-loop tick in a single transaction. Events arriving while a run is open are attributed to it. - Analysis runs entirely over the bounded query tools — one shared ordering comparator (
ts_probe, thenseq) backs the timeline, spans, and the run diff. - Cleanup is a trust split: the server reports exact marker locations (fresh scan every time), the agent deletes the marked ranges, the server re-verifies.
end_sessiononly succeeds clean — then closes and unlinks everything.
Crash safety
There is no state outside the session's SQLite file and its lockfile. Sudden death leaves only an orphan .db; the next start_session surfaces it (STALE_SESSION_EXISTS, with the goal of the lost investigation in the hint) and the user decides: dispose or keep. Orphans are never silently resumed.
State location
Per-user, per-platform (override with XDG_STATE_HOME):
| Platform | Path |
|----------|------|
| Linux | ~/.local/state/agent-probe/ |
| macOS | ~/Library/Application Support/agent-probe/ |
| Windows | %LOCALAPPDATA%/agent-probe/ |
Inside: sessions/<session-id>.db (deleted on close) and locks/ (PID lockfiles, reaped when stale).
Development
Setup
git clone <this repo>
cd agent-probe
npm ciScripts
| Command | What it does |
|---------|--------------|
| npm run dev | Run the server from source (tsx, stdio) |
| npm test | Full vitest suite (unit + integration) |
| npm run test:watch | Watch mode |
| npm run typecheck | tsc --noEmit (strict, noUncheckedIndexedAccess) |
| npm run lint | ESLint over the solution |
| npm run build | tsc (the plugin ships verbatim from the repo — no injection step) |
Repository layout
src/
index.ts entry point (Node version gate, then stdio server)
server.ts MCP server wiring (pure capability — no guidance content)
tools.ts the 9 tool registrations (the ONE place tools exist)
constants.ts single source for names, limits, error codes
logger.ts stderr-only logging (stdout belongs to MCP framing)
session/ session state machine, run boundaries, lockfiles, state dirs
ingest/ POST /events listener, wire contract, caps, warnings
evidence/ per-session SQLite store, timeline/span queries, run diff
cleanup/ marker scanning, pairing, removal-range derivation
integration/ cross-module flow tests + the golden-demo fixture
plugin/
.claude-plugin/ Claude Code plugin manifest
.mcp.json MCP server config (npx the published package)
skills/agent-probe/ the Agent Skill — the debugging workflow itself
examples/
golden-demo/ runnable demo app with the staged silent-failure bug
.claude-plugin/ marketplace manifest (the repo installs as a plugin)
.github/workflows/ ci.yml (tests, greps, plugin drift gates), release.yml (npm publish)Design rules the code enforces
- stdout is sacred — it carries MCP framing only; all diagnostics go to stderr. Lint + CI grep enforce it.
- No egress — the only networking is the localhost ingestion listener; net-client imports are lint-banned, URLs are CI-grepped.
- Frozen runtime deps —
@modelcontextprotocol/sdk+zod, nothing else. - Single-sourced names — the package name and marker token live in
src/constants.ts; the plugin ships verbatim literals, and CI greps chain them back to the constants (including the self-scan invariant: no plugin file may match the cleanup marker pattern). - Tools never touch SQL — strict layering: tools → SessionManager → EvidenceStore.
See CONTRIBUTING.md for the full list and the PR process.
Releases
Pushing a v* tag runs release.yml: typecheck → tests → lint → build → pack-install smoke (tarball contents + the installed bin must boot) → npm publish via OIDC trusted publishing (no tokens, provenance attached automatically).
3-minute demo
Recording coming with launch.
Known limitations
Honest ones, each with its mitigation:
- Timing-sensitive bugs may shift under instrumentation. Behavior equivalence is proven for application-visible outputs (see
src/integration/behavior-equivalence.test.ts), not microsecond timing. If your bug is a sub-millisecond race, probes can move it. - Git hygiene: don't commit probed code mid-session. Markers are plain greppable comments and
verify_cleanupis the gate — but nothing stops agit commitwhile probes are in place. Finish the loop before committing. - Container/WSL2 clock skew can disorder cross-service timelines.
ts_probeis each app's own clock. Keep services on one clock domain, or readseq_tiedflags honestly — arrival order is not causality. workspace_rootis agent-supplied and trusted (v1). The cleanup scan verifies everything within it; it does not verify that it is your project root. Glance at the echoed path when the session starts.- One active session per workspace (v1). A second session against the same workspace is refused while the first holds the lock; stale sessions from crashes are surfaced and disposed explicitly.
- Non-git workspaces over-scan. Without git's ignore rules the cleanup scan walks everything under
workspace_root, so stale markers in build output may surface. They're real markers — delete the build artifacts or rebuild.
Contributing
Contributions are welcome — see CONTRIBUTING.md for setup, the project's invariants, and the PR process. All participants are expected to follow the Code of Conduct.
Found a security issue? Please report it privately — see SECURITY.md.
