@marcfargas/pi-test-harness
v0.5.0
Published
Test harness for pi extensions — in-process session testing, package install verification, and subprocess mocking
Maintainers
Readme
@marcfargas/pi-test-harness
Test harness for pi extensions — in-process session testing with playbook-driven model mocking, package install verification, and subprocess mocking.
Why
Testing pi extensions is hard. Extensions register tools, subscribe to hooks, intercept tool calls, use UI — all deeply integrated with pi's runtime. Mocking everything produces tests that don't reflect reality. Not testing produces extensions that break in production.
pi-test-harness takes a different approach: let pi be pi. Everything runs for real — extension loading, tool registration, hooks, event lifecycle, session state. Only the model is replaced (via streamFn), and optionally tool execution is intercepted for tools you don't want to run for real.
The result: tests that exercise real code paths, in ~10 lines of setup, with zero LLM calls.
Install
npm install --save-dev @marcfargas/pi-test-harnessPeer dependencies
@mariozechner/pi-coding-agent>= 0.50.0@mariozechner/pi-ai@mariozechner/pi-agent-core
Quick Start
import { describe, it, expect, afterEach } from "vitest";
import {
createTestSession,
when, calls, says,
type TestSession,
} from "@marcfargas/pi-test-harness";
describe("my extension", () => {
let t: TestSession;
afterEach(() => t?.dispose());
it("calls a tool and responds", async () => {
t = await createTestSession({
extensions: ["./src/index.ts"],
mockTools: {
bash: (params) => `$ ${params.command}\nfile1.txt\nfile2.txt`,
read: "file contents here",
write: "written",
edit: "edited",
},
});
await t.run(
when("List files in the project", [
calls("bash", { command: "ls" }),
says("Found 2 files: file1.txt and file2.txt"),
]),
);
expect(t.events.toolResultsFor("bash")).toHaveLength(1);
expect(t.events.toolResultsFor("bash")[0].text).toContain("file1.txt");
expect(t.events.toolResultsFor("bash")[0].mocked).toBe(true);
});
});Architecture
┌───────────────────────────────────────────┐
│ Real pi environment │
│ │
│ Extensions ─── loaded for real │
│ Tool registry ─ real hooks + wrapping │
│ Session state ─ in-memory persistence │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Agent Loop │ │
│ │ │ │
│ │ streamFn ──── REPLACED by playbook │ │
│ │ tool.execute() INTERCEPTED if mock │ │
│ │ ctx.ui.* ──── INTERCEPTED + logged │ │
│ └─────────────────────────────────────┘ │
└───────────────────────────────────────────┘Three substitution points at the boundary — everything else runs through pi's real code:
| What | Substituted with | Purpose |
|------|-----------------|---------|
| streamFn | Playbook | Scripts what the model "decides" |
| tool.execute() | Mock handler | Controls what tools "return" (hooks still fire) |
| ctx.ui.* | Mock UI | Controls what the user "answers" |
Playbook DSL
The playbook replaces the LLM. Instead of calling a model, the agent loop consumes scripted actions in order.
when(prompt, actions)
Defines a conversation turn — the prompt you'll send and what the model does in response:
when("Deploy the app", [
calls("bash", { command: "npm run build" }),
calls("bash", { command: "gcloud run deploy" }),
says("Deployed successfully."),
])calls(tool, params)
The model calls a tool. Pi's hooks fire, the tool executes (real or mocked), result feeds back:
calls("plan_mode", { enable: true })
calls("bash", { command: "ls -la" })says(text)
The model emits text. The agent turn ends:
says("All done. The deployment is complete.")Multi-turn conversations
Pass multiple turns to run():
await t.run(
when("What files are in the project?", [
calls("bash", { command: "ls" }),
says("Found 3 files."),
]),
when("Now read the README", [
calls("read", { path: "README.md" }),
says("Here's what it says..."),
]),
);Mock Tools
mockTools intercepts tool.execute() for specific tools. Pi's tool registry and event flow remain untouched. Extension hooks (tool_call, tool_result) fire for mocked tools via the extension runner — so hook-based blocking (e.g., plan mode) works correctly even with mocked tools.
const t = await createTestSession({
extensions: ["./src/index.ts"],
mockTools: {
// Static string → becomes { content: [{ type: "text", text: "..." }] }
bash: "command output here",
// Dynamic function → receives params, returns string or ToolResult
read: (params) => `contents of ${params.path}`,
// Full ToolResult for precise control
write: {
content: [{ type: "text", text: "Written successfully" }],
details: { bytesWritten: 42 },
},
},
});Extension-registered tools execute for real unless they appear in mockTools. This lets you test your extension's actual tool logic while controlling the built-in tools.
Late-bound Params & .then()
When one tool call produces a value needed by the next, use .then() to capture it and () => params for late binding:
let planId = "";
await t.run(
when("Create and approve a plan", [
calls("plan_propose", {
title: "Send invoice",
steps: [{ description: "Send email", tool: "go-easy", operation: "send" }],
}).then((result) => {
// Extract the plan ID from the tool result
planId = result.text.match(/PLAN-[a-f0-9]+/)![0];
}),
// Late-bound: params resolved at call time, after .then() has fired
calls("plan_approve", () => ({ id: planId })),
says("Plan approved and executing."),
]),
);
expect(planId).toMatch(/^PLAN-/);Mock UI
Extensions that call ctx.ui.confirm(), ctx.ui.select(), etc. get mock responses. All calls are recorded for assertions.
const t = await createTestSession({
extensions: ["./src/index.ts"],
mockUI: {
confirm: false, // deny all confirmations
select: 0, // always pick first item
input: "user input text", // return fixed string
editor: "edited content", // return fixed string
},
});
// ... run playbook ...
// Assert the extension asked for confirmation
expect(t.events.uiCallsFor("confirm")).toHaveLength(1);
expect(t.events.uiCallsFor("confirm")[0].returnValue).toBe(false);Dynamic handlers are also supported:
mockUI: {
confirm: (title, message) => title.includes("Delete") ? false : true,
select: (title, items) => items.find(i => i.includes("staging")),
}Defaults (when no mock config is provided): confirm → true, select → first item, input → "", editor → "".
Event Collection
Every session event, tool call, tool result, message, and UI interaction is collected:
// Tool events
t.events.toolCallsFor("bash") // ToolCallRecord[] for "bash"
t.events.toolResultsFor("bash") // ToolResultRecord[] for "bash"
t.events.blockedCalls() // tools blocked by hooks (e.g., plan mode)
// UI events
t.events.uiCallsFor("notify") // UICallRecord[] for notify()
t.events.uiCallsFor("confirm") // UICallRecord[] for confirm()
// Messages and raw events
t.events.messages // AgentMessage[]
t.events.all // AgentSessionEvent[] (everything)ToolResultRecord
interface ToolResultRecord {
step: number; // playbook step index
toolName: string;
toolCallId: string;
text: string; // concatenated text content
content: Array<{ type: string; text?: string }>;
isError: boolean;
mocked: boolean; // true if mockTools handled it
}Error Propagation
By default (propagateErrors: true), real tool errors abort the test with a diagnostic pointing to the exact playbook step:
Error during tool execution at playbook step 3 (call "bash"):
ENOENT: no such file or directory '/foo/bar'
at Object.readFileSync (node:fs:...)
This error was thrown by the real tool execution, not by the playbook.
To capture errors as tool results instead of aborting, set:
createTestSession({ propagateErrors: false })Set propagateErrors: false to capture errors as isError: true in the result instead:
const t = await createTestSession({
propagateErrors: false,
// ...
});Playbook Diagnostics
The harness auto-asserts that all playbook actions are consumed after run() completes. If the playbook is exhausted early or has remaining unconsumed actions, you get a clear diagnostic:
Playbook exhausted unexpectedly.
Consumed 2 action(s).
Last consumed: calls("bash", {"command":"ls"}) at step 2
The agent loop called streamFn but no more playbook actions were available.
This usually means a tool call produced an unexpected result that caused
additional streamFn calls (retries, error handling).Playbook not fully consumed after run() completed.
Consumed 1 of 3 action(s).
Remaining:
- calls("write", {"path":"out.txt","content":"hello"})
- says("Done writing.")
The agent loop ended before all playbook actions were used.
This usually means a tool was blocked by a hook or returned early,
causing fewer streamFn calls than expected.Sandbox Install Verification
Catches broken packages before publish — verifies that npm pack → install → load actually works:
import { verifySandboxInstall } from "@marcfargas/pi-test-harness";
const result = await verifySandboxInstall({
packageDir: "./packages/my-extension",
expect: {
extensions: 1,
tools: ["my_tool", "my_other_tool"],
skills: 0,
},
});
expect(result.loaded.extensionErrors).toEqual([]);
expect(result.loaded.tools).toContain("my_tool");Optionally run a smoke test inside the sandbox:
const result = await verifySandboxInstall({
packageDir: "./packages/my-extension",
expect: { extensions: 1 },
smoke: {
mockTools: { bash: "ok", read: "contents", write: "written", edit: "edited" },
script: [
when("Test", [
calls("my_tool", { value: "test" }),
says("Works."),
]),
],
},
});Mock Pi CLI
For extensions that spawn pi --mode json -p as a subprocess (e.g., subagent orchestrators), createMockPi() puts a fake pi binary in PATH that returns controllable responses.
import { createMockPi } from "@marcfargas/pi-test-harness";
const mockPi = createMockPi();
mockPi.install(); // creates temp dir with pi shim, prepends PATH
// Queue responses (consumed in order, last one repeats)
mockPi.onCall({ output: "Hello from agent", exitCode: 0 });
mockPi.onCall({ stderr: "agent crashed", exitCode: 1 });
mockPi.onCall({
jsonl: [
{ type: "tool_execution_start", toolName: "bash" },
{ type: "message_end", message: { role: "assistant", content: [{ type: "text", text: "done" }] } },
],
});
// Write files during execution (e.g., chain_dir output simulation)
mockPi.onCall({
output: "Result written",
writeFiles: { "/tmp/output.md": "# Result\nDone." },
});
// Reset queue between tests
mockPi.reset();
// Check invocation count
expect(mockPi.callCount()).toBe(0);
// Cleanup
mockPi.uninstall(); // restores PATH, deletes temp dirHow it works
install()creates a temp directory with a platform-specific shim (pi.cmdon Windows,pishell script on Linux)- The shim is prepended to PATH so
child_process.spawn("pi", ...)resolves to it - Each invocation reads the next response from a file-based queue (
queue.json+counter) - When the queue is exhausted, the last response repeats
- If no responses are queued, the mock echoes the task text
Response options
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| output | string | echo task | Text in the message_end event |
| exitCode | number | 0 | Process exit code |
| stderr | string | — | Written to stderr |
| delay | number | 0 | Delay in ms before responding |
| jsonl | object[] | — | Raw JSONL events (replaces default message_end) |
| writeFiles | Record<string, string> | — | Files to create (path → content) |
Safety features
- Exit handler: PATH is restored on process exit even if
uninstall()isn't called (test crash safety) - Key validation: Typos like
{ ouptut: "..." }throw immediately instead of silently passing - Timeout: Mock script exits after 30s to prevent hanging tests
Concurrency
Designed for serial subprocess spawns within a single test. If your test spawns multiple pi processes concurrently, responses may be consumed out of order.
Test layer summary
| Layer | What it mocks | Use when |
|-------|--------------|----------|
| createTestSession | LLM (streamFn) | Testing extension logic in-process |
| verifySandboxInstall | Nothing (real install) | Verifying npm package works |
| createMockPi | pi CLI binary | Testing subprocess-spawning extensions |
API Reference
createTestSession(options?)
Creates a test session with a real pi environment.
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| extensions | string[] | [] | Extension file paths to load |
| extensionFactories | Function[] | [] | Inline extension factory functions |
| cwd | string | auto temp dir | Working directory (cleaned on dispose if auto) |
| systemPrompt | string | — | Override the system prompt |
| mockTools | Record<string, MockToolHandler> | — | Tool execution interceptors |
| mockUI | MockUIConfig | defaults | UI mock configuration |
| propagateErrors | boolean | true | Abort test on real tool throw |
Returns Promise<TestSession>.
TestSession
| Property / Method | Type | Description |
|-------------------|------|-------------|
| run(...turns) | Promise<void> | Run the conversation script |
| session | AgentSession | The real pi session underneath |
| cwd | string | Working directory |
| events | TestEvents | All collected events |
| playbook | { consumed, remaining } | Playbook consumption state |
| dispose() | void | Cleanup temp dir and session |
verifySandboxInstall(options)
| Option | Type | Description |
|--------|------|-------------|
| packageDir | string | Package directory (must have package.json) |
| expect.extensions | number | Expected extension count |
| expect.tools | string[] | Expected tool names |
| expect.skills | number | Expected skill count |
| smoke.mockTools | Record<string, MockToolHandler> | Mock tools for smoke test |
| smoke.script | Turn[] | Playbook script for smoke test |
createMockPi()
Creates a mock pi CLI with file-based response queue.
Returns MockPi:
| Property / Method | Type | Description |
|-------------------|------|-------------|
| install() | void | Create shim, prepend to PATH |
| uninstall() | void | Restore PATH, delete temp dir |
| onCall(response) | void | Queue a MockPiCall response |
| reset() | void | Clear queue and counter |
| callCount() | number | Number of times mock pi was invoked |
| dir | string | Temp directory path |
MockToolHandler
type MockToolHandler =
| string // static text
| ToolResult // full result object
| ((params: Record<string, unknown>) => string | ToolResult); // dynamicMockUIConfig
interface MockUIConfig {
confirm?: boolean | ((title: string, message: string) => boolean);
select?: number | string | ((title: string, items: string[]) => string | undefined);
input?: string | ((title: string, placeholder?: string) => string | undefined);
editor?: string | ((title: string, prefilled?: string) => string | undefined);
}ToolBlockedError
Thrown (and exported) when an extension hook blocks a mocked tool call. Use with instanceof to assert that a specific tool was blocked rather than crashed:
import { ToolBlockedError } from "@marcfargas/pi-test-harness";
// Verify a tool was blocked (not just errored)
const result = t.events.toolResultsFor("bash")[0];
expect(result.isError).toBe(true);
// Or catch it in error-propagation scenarios
try {
await t.run(when("Try write", [calls("bash", { command: "rm -rf /" }), says("Done.")]));
} catch (err) {
if (err instanceof ToolBlockedError) {
// Expected — extension hook blocked the call
} else {
throw err; // real error
}
}safeRmSync(filePath)
Removes a file, swallowing EPERM/EBUSY errors only. Intended for afterEach cleanup of extension-owned SQLite files on Windows. See Platform Notes.
Real-World Example: Testing pi-planner
Testing an extension that registers 8 tools, blocks writes in plan mode, and manages plan lifecycle:
import { createTestSession, when, calls, says, type TestSession } from "@marcfargas/pi-test-harness";
import * as path from "node:path";
const EXTENSION = path.resolve(__dirname, "../../src/index.ts");
const MOCKS = {
bash: (p: Record<string, unknown>) => `mock: ${p.command}`,
read: "mock contents", write: "mock written", edit: "mock edited",
};
describe("pi-planner", () => {
let t: TestSession;
afterEach(() => t?.dispose());
it("enters plan mode and proposes a plan", async () => {
t = await createTestSession({
extensions: [EXTENSION],
mockTools: MOCKS,
});
let planId = "";
await t.run(
when("Plan the deployment", [
calls("plan_mode", { enable: true }),
calls("plan_propose", {
title: "Deploy v2",
steps: [
{ description: "Build", tool: "bash", operation: "build" },
{ description: "Deploy", tool: "gcloud", operation: "deploy" },
],
}).then((r) => {
planId = r.text.match(/PLAN-[a-f0-9]+/)![0];
}),
says("Plan proposed."),
]),
);
expect(planId).toMatch(/^PLAN-/);
expect(t.events.toolResultsFor("plan_mode")[0].text).toContain("enabled");
expect(t.events.uiCallsFor("notify")).toHaveLength(1);
});
});Platform Notes
Windows + SQLite (EPERM in afterEach)
session.dispose() does not fire session_shutdown. That event fires at Node.js process exit. Extensions that open SQLite databases in session_start (e.g., brainiac, memory extensions) keep those files locked for the entire test runner lifetime.
On Windows, this means rmSync(dbPath) in afterEach throws EPERM. Use safeRmSync instead:
import { safeRmSync } from "@marcfargas/pi-test-harness";
afterEach(() => {
// Dispose session first, then attempt file cleanup
t?.dispose();
safeRmSync(dbPath);
safeRmSync(dbPath + "-wal");
safeRmSync(dbPath + "-shm");
});Files are cleaned by the OS when the process exits. Use unique DB paths per test (e.g., mkdtempSync + test name) for isolation.
safeRmSync only swallows EPERM and EBUSY — all other errors still propagate.
Design Philosophy
Let pi be pi. The less we fake, the more real the test.
The harness minimizes substitution. Extensions load through pi's real loader (jiti). Tools go through pi's real wrapping pipeline (wrapToolsWithExtensions). Hooks fire through pi's real ExtensionRunner. Events flow through pi's real event system.
Only the LLM boundary is replaced — because that's the one thing you can't run in a test.
License
MIT
