@k71n/agent-probe

v0.2.0

Published

12 days ago

LLM-in-the-loop multi-service debugger - MCP server with leave-no-trace probe instrumentation

Downloads

447

0High
0Medium
0Low

k71n

agent-probe

Evidence-based, leave-no-trace debugging for AI coding agents — an MCP server that lets your agent place temporary probes in running code, capture runtime evidence across services, name the root cause, and verifiably clean up after itself.

agent-probe is the local-dev counterpart to production AI debugging. Your agent (Claude Code, Cursor, or any MCP host) drives the whole loop: instrument → reproduce → analyze → remove → verify. Evidence lives in a per-session local SQLite file that is destroyed when the session ends. No SDK in your app, no accounts, no dashboards, and no network egress, ever.

Why agent-probe?

The hardest bugs are invisible from the code alone: a write succeeds, the dependent read shows nothing, and no error appears anywhere. Static analysis and grepping can't see runtime state — and console.log debugging by an agent leaves litter in your tree and noise in your terminal.

Probes are one-liners, not a library — a self-contained, fire-and-forget HTTP POST to localhost, wrapped in marker comments. Your app gains zero dependencies.
Probing never perturbs the app — the server replies 202 before validating anything, applies no backpressure, and a dead server changes nothing (proven by an explicit behavior-equivalence test).
Evidence is structured, not scrollback — timelines across services, bounded queries, and a first-class diff between a failing run and a working run that surfaces the discriminating difference directly.
Cleanup is verified, never assumed — the server scans your workspace itself and refuses to close the session while any probe marker remains. git diff ends empty.
Everything stays on your machine — localhost-only ingestion (bound at the kernel level), ephemeral per-session storage, zero telemetry.
Language-agnostic by contract — anything that can POST JSON conforms: JS/TS, Python, shell/curl, Go, SQL comments for markers, and more.

Features

The debugging loop

Goal-scoped sessions — start_session(goal, workspace_root) opens an isolated investigation with its own evidence store and ingestion port
Runs as first-class boundaries — tag reproductions ("buggy", "clean"), and events are attributed to the run they arrived in; out-of-run events are kept as "unattributed", never lost
Cross-service timelines — time-ordered events from every service in one view, with honest seq_tied flags when ordering rests on arrival rather than causality
Bounded span queries — filter by run, probe, service, or time range; results are capped server-side with truncated + true total, so a context window holds hypotheses, not haystacks
Run diffing (the wedge) — structured presence/absence, ordering inversions, and payload deltas between two runs

Non-perturbation guarantees

202-before-validate ingestion — a probe never waits on parsing, validation, or storage
Fire-and-forget idiom — no await, no retries, no queues; errors are swallowed at the probe (.catch(() => {}))
Caps instead of pressure — oversized payloads are truncated (event kept), events past the per-session cap are dropped with a warning; the app never feels any of it
Warnings surface on every tool response — rejected events, truncations, and drops are reported to the agent; silence about dropped evidence would mislead the analysis

Leave no trace

Marker convention — every probe is wrapped in own-line comment pairs (-begin p1 … -end p1, each prefixed with the agent-probe token) that survive formatters and are mechanically removable
Server-side cleanup verification — verify_cleanup scans the workspace (git-aware; node_modules/ and .git/ always skipped) and returns exact file+line locations
The close gate — end_session is refused with MARKERS_REMAIN while markers exist; the override is explicit, user-owned, and always reported in the result
Destructive close — the session's SQLite file and lock are deleted; orphans from crashes are surfaced and disposed explicitly, never silently resumed

Agent guidance, token-cheap

The workflow is an Agent Skill — conventions, the wire contract, probe strategy, and the removal ritual ship as plugin/skills/agent-probe/SKILL.md: ~100 idle tokens of frontmatter, the full six-phase workflow loaded only when the user invokes it
Explicitly invoked, never auto-triggered — the skill runs as a command; it activates on the user's say-so, not on conversational drift
Host-agnostic by construction — plain SKILL.md (open Agent Skills format), bare tool names, no host-specific features; the same file installs into Claude Code, Cursor, Codex, and any skill-capable host

Quick start

Install

agent-probe is a plugin: one Agent Skill (the debugging workflow, invoked as a command) plus the MCP server (the 9 tools, run via npx). Requires Node >= 22.13 (the server tells you on stderr if yours is older — it uses the built-in node:sqlite).

Claude Code — the repo is a plugin marketplace; skill and server wire up together:

claude plugin marketplace add k71n/agent-probe
claude plugin install agent-probe@agent-probe

Cursor

npx skills add k71n/agent-probe -a cursor                 # the skill

then install the MCP server with one click (deeplink) — or add it to .cursor/mcp.json manually (plugin/README.md).

Codex

npx skills add k71n/agent-probe -a codex                  # the skill
codex mcp add agent-probe -- npx -y @k71n/agent-probe@latest   # the server

Any other skill-capable host — npx skills add k71n/agent-probe (the skills CLI auto-detects 70+ agents), then add npx -y @k71n/agent-probe@latest to your host's MCP config. Per-host detail: plugin/README.md.

Bare MCP (no skill support) still works — claude mcp add agent-probe -- npx -y @k71n/agent-probe@latest or the equivalent — but the agent won't know the probe conventions: the skill is where the workflow lives.

Note that npx caches versions: use @latest as above, or pin a version. The npm package is scoped (@k71n/agent-probe); the plugin, tool, bin, and probe markers are plain agent-probe.

Try it: the golden demo

The repo ships a tiny two-layer app staging the classic silent failure — the form saves, but the list never updates. No errors anywhere.

node examples/golden-demo/api/server.mjs
# open http://localhost:4280 — save an entry, watch it never appear

Then tell your agent:

The form saves but the list never updates — find out why.

What happens next:

The skill activates and the agent starts a goal-scoped session, stating its hypothesis out loud.
It inserts marker-wrapped probe lines on both sides of the data boundary (write path + read path).
You reproduce the bug twice — once failing, once working — while the agent captures both runs.
The timeline names the root cause: the entry was written with categoryId: null while the list filtered on a category — a silent field-name mismatch between frontend and backend.
The agent removes every probe, the server verifies the workspace is clean, and the session is destroyed with all its evidence. git diff is empty.

Tool reference

The surface is deliberately frozen at 9 tools.

| Tool | Arguments | Description | |------|-----------|-------------| | start_session | goal, workspace_root, stale? | Open a Debug Session scoped to a stated goal. Returns session_id, the ingestion port, and the echoed workspace_root. Refused with STALE_SESSION_EXISTS if orphans exist (resolve with stale: "dispose" or "keep"), or INSTANCE_CONFLICT if another live process holds the workspace. | | start_run | tag? | Arm a Run while the user reproduces the flow. With host elicitation support, waits for the user's confirmation and returns the closed run; otherwise returns status: "open" and end_run closes it. | | end_run | tag? | Close the open Run (a tag here overwrites one set at start). | | list_runs | — | Every run with tag, boundaries, and event count. | | get_timeline | run, limit? | Time-ordered events across all services for a run (or "unattributed"). Same-millisecond events are flagged seq_tied. | | get_span | run?, probe?, service?, from?, to?, limit? | Bounded slice by any filter combination. | | diff_runs | a, b | Structured differences between two runs: presence/absence, ordering changes, payload deltas. | | verify_cleanup | — | Server-side scan of the workspace for residual probe markers; returns exact locations and orphan (unpaired) markers separately. | | end_session | override? | Destroy the session and all stored evidence. Refused with MARKERS_REMAIN while markers remain; override: true is the user's call and is always reported in the result. |

Response envelope

Every successful response is the same shape — even single-item ones:

{ "data": { ... }, "truncated": false, "total": 1, "warnings": ["..."] }

truncated/total make result capping honest; warnings (when present) carry ingestion rejections, payload truncations, and drops.

Errors

Every tool error is { code, message, hint } with code drawn from a closed enum:

NO_ACTIVE_SESSION · SESSION_ALREADY_ACTIVE · STALE_SESSION_EXISTS · INSTANCE_CONFLICT · MARKERS_REMAIN · RUN_NOT_FOUND · NO_ACTIVE_RUN · INVALID_STATE

The hint always says what to do next — errors are written for agents.

The probe contract

Probes POST JSON to http://127.0.0.1:<port>/events (the port comes from start_session). snake_case on the wire; optional fields are omitted when absent, never null.

| Field | Type | Required | Meaning | |-------|------|----------|---------| | session_id | string | yes | from the start_session response | | probe_id | string | yes | matches the probe's marker id | | service | string | yes | which service emitted it ("web", "api", …) | | file | string | yes | source file the probe lives in | | line | int | yes | source line | | ts_probe | int | yes | epoch milliseconds at emission | | payload | JSON | yes | any JSON value; keep it ≤ 64 KiB (truncated beyond) | | trace_id | string | no | correlation headroom (W3C-aligned, free-form) | | parent_id | string | no | correlation headroom (W3C-aligned, free-form) |

The server assigns ts_server and a monotonic seq on arrival. Unknown extra keys are ignored — strictness would break language-agnosticism. No Content-Type required.

The idiom (JS/TS)

// ⟨token⟩-begin p1
fetch(`http://127.0.0.1:${PORT}/events`, { method: "POST", body: JSON.stringify({ session_id: SID, probe_id: "p1", service: "api", file: "list.ts", line: 42, ts_probe: Date.now(), payload: { categoryId, rowCount } }) }).catch(() => {});
// ⟨token⟩-end p1

(⟨token⟩ is literally agent-probe — spelled indirectly here so this very README never trips the cleanup scan when you debug a workspace containing it.)

Markers are own-line comment pairs (they survive Prettier/ESLint), both carrying the probe id; the comment leader adapts to the language (//, #, --, ). Anything that can POST JSON conforms — shell:

curl -s -X POST http://127.0.0.1:$PORT/events -d "{\"session_id\":\"$SID\",\"probe_id\":\"p3\",\"service\":\"worker\",\"file\":\"job.sh\",\"line\":7,\"ts_probe\":$(date +%s%3N),\"payload\":{\"jobId\":\"$JOB\"}}" >/dev/null 2>&1 &

The full conventions — probe strategy for silent write/read failures, the removal ritual, run protocol — live in the plugin's Agent Skill (plugin/skills/agent-probe/SKILL.md), loaded when the user invokes the workflow.

Limits (enforced, not asserted)

| Cap | Value | Behavior past it | |-----|-------|------------------| | Payload size | 64 KiB | truncated, event kept, warning emitted | | Events per session | 100,000 | dropped with warning | | Events per query result | 500 | clamped; truncated: true + true total |

Architecture

        MCP host (Claude Code, Cursor, …)
          + the Agent Skill (the workflow)
                      |
               stdio (JSON-RPC)            your services
                      |                  (web, api, worker)
            +---------v----------+               |
            |     MCP server     |        marker-wrapped
            |  9 tools, nothing  |        one-line probes
            |  else              |               |
            +---------+----------+      POST /events (fire-and-forget)
                      |                          |
        +-------------+--------------+   +-------v--------+
        |       SessionManager       |   | Ingest listener |
        |  state machine, lockfiles, |<--+ 127.0.0.1:ephem |
        |  orphan recovery, close    |   | 202-before-     |
        |  gate (MARKERS_REMAIN)     |   | validate        |
        +------+--------------+------+   +----------------+
               |              |
     +---------v---+   +------v----------+
     | Evidence    |   | Cleanup verify  |
     | store + query|  | (workspace scan,|
     | + run diff  |   |  marker pairing)|
     | (SQLite,    |   +-----------------+
     |  per-session,|
     |  destroyed   |
     |  on close)   |
     +-------------+

How a session works

start_session validates the workspace, takes a PID lockfile (one active session per workspace), creates a per-session SQLite file, and returns the ingestion port.
The agent instruments — it inserts whole probe lines into your code, baking the session id and port in as literals. The server never edits your files.
Ingestion replies 202 immediately, then parses, validates against the contract, applies caps, and flushes to SQLite once per event-loop tick in a single transaction. Events arriving while a run is open are attributed to it.
Analysis runs entirely over the bounded query tools — one shared ordering comparator (ts_probe, then seq) backs the timeline, spans, and the run diff.
Cleanup is a trust split: the server reports exact marker locations (fresh scan every time), the agent deletes the marked ranges, the server re-verifies. end_session only succeeds clean — then closes and unlinks everything.

Crash safety

There is no state outside the session's SQLite file and its lockfile. Sudden death leaves only an orphan .db; the next start_session surfaces it (STALE_SESSION_EXISTS, with the goal of the lost investigation in the hint) and the user decides: dispose or keep. Orphans are never silently resumed.

State location

Per-user, per-platform (override with XDG_STATE_HOME):

| Platform | Path | |----------|------| | Linux | ~/.local/state/agent-probe/ | | macOS | ~/Library/Application Support/agent-probe/ | | Windows | %LOCALAPPDATA%/agent-probe/ |

Inside: sessions/<session-id>.db (deleted on close) and locks/ (PID lockfiles, reaped when stale).

Development

Setup

git clone <this repo>
cd agent-probe
npm ci

Scripts

| Command | What it does | |---------|--------------| | npm run dev | Run the server from source (tsx, stdio) | | npm test | Full vitest suite (unit + integration) | | npm run test:watch | Watch mode | | npm run typecheck | tsc --noEmit (strict, noUncheckedIndexedAccess) | | npm run lint | ESLint over the solution | | npm run build | tsc (the plugin ships verbatim from the repo — no injection step) |

Repository layout

src/
  index.ts              entry point (Node version gate, then stdio server)
  server.ts             MCP server wiring (pure capability — no guidance content)
  tools.ts              the 9 tool registrations (the ONE place tools exist)
  constants.ts          single source for names, limits, error codes
  logger.ts             stderr-only logging (stdout belongs to MCP framing)
  session/              session state machine, run boundaries, lockfiles, state dirs
  ingest/               POST /events listener, wire contract, caps, warnings
  evidence/             per-session SQLite store, timeline/span queries, run diff
  cleanup/              marker scanning, pairing, removal-range derivation
  integration/          cross-module flow tests + the golden-demo fixture
plugin/
  .claude-plugin/       Claude Code plugin manifest
  .mcp.json             MCP server config (npx the published package)
  skills/agent-probe/   the Agent Skill — the debugging workflow itself
examples/
  golden-demo/          runnable demo app with the staged silent-failure bug
.claude-plugin/         marketplace manifest (the repo installs as a plugin)
.github/workflows/      ci.yml (tests, greps, plugin drift gates), release.yml (npm publish)

Design rules the code enforces

stdout is sacred — it carries MCP framing only; all diagnostics go to stderr. Lint + CI grep enforce it.
No egress — the only networking is the localhost ingestion listener; net-client imports are lint-banned, URLs are CI-grepped.
Frozen runtime deps — @modelcontextprotocol/sdk + zod, nothing else.
Single-sourced names — the package name and marker token live in src/constants.ts; the plugin ships verbatim literals, and CI greps chain them back to the constants (including the self-scan invariant: no plugin file may match the cleanup marker pattern).
Tools never touch SQL — strict layering: tools → SessionManager → EvidenceStore.

See CONTRIBUTING.md for the full list and the PR process.

Releases

Pushing a v* tag runs release.yml: typecheck → tests → lint → build → pack-install smoke (tarball contents + the installed bin must boot) → npm publish via OIDC trusted publishing (no tokens, provenance attached automatically).

3-minute demo

Recording coming with launch.

Known limitations

Honest ones, each with its mitigation:

Timing-sensitive bugs may shift under instrumentation. Behavior equivalence is proven for application-visible outputs (see src/integration/behavior-equivalence.test.ts), not microsecond timing. If your bug is a sub-millisecond race, probes can move it.
Git hygiene: don't commit probed code mid-session. Markers are plain greppable comments and verify_cleanup is the gate — but nothing stops a git commit while probes are in place. Finish the loop before committing.
Container/WSL2 clock skew can disorder cross-service timelines. ts_probe is each app's own clock. Keep services on one clock domain, or read seq_tied flags honestly — arrival order is not causality.
workspace_root is agent-supplied and trusted (v1). The cleanup scan verifies everything within it; it does not verify that it is your project root. Glance at the echoed path when the session starts.
One active session per workspace (v1). A second session against the same workspace is refused while the first holds the lock; stale sessions from crashes are surfaced and disposed explicitly.
Non-git workspaces over-scan. Without git's ignore rules the cleanup scan walks everything under workspace_root, so stale markers in build output may surface. They're real markers — delete the build artifacts or rebuild.

Contributing

Contributions are welcome — see CONTRIBUTING.md for setup, the project's invariants, and the PR process. All participants are expected to follow the Code of Conduct.

Found a security issue? Please report it privately — see SECURITY.md.

License

MIT