agent-do
v0.6.0
Published
Provider-agnostic autonomous agent loop for JavaScript. Built on the [Vercel AI SDK](https://sdk.vercel.ai/), it drives any `LanguageModel` through a tool-use loop until the task is complete.
Readme
agent-do
Provider-agnostic autonomous agent loop for JavaScript. Built on the Vercel AI SDK, it drives any LanguageModel through a tool-use loop until the task is complete.
⚠️ No sandbox
agent-do is not sandboxed within its working directory: when file tools are enabled, the agent can read, write, edit, and delete files under that directory.
By default, file tools operate on files reachable from the working
directory (--cwd, defaulting to the current directory). Path-traversal
guards prevent the agent from escaping that root, but within that scope
a misbehaving prompt or unintended tool call can cause permanent data
loss.
--read-only blocks writes, deletes, and edits — but the agent can
still read, list, and grep every file in the working directory and
send its contents to the model provider. If a directory contains
secrets you don't want exposed, use --no-tools instead.
The CLI prints a one-line warning to stderr on every run that has
file tools enabled, so the blast radius is visible up front. The
warning text adapts to the resolved configuration (read-only vs full
read/write), and a saved agent's noTools / readOnly settings win
over CLI flags so you don't get a misleading "no tools" warning when a
saved config silently re-enables them.
Before using agent-do, especially the CLI:
- Understand what the agent will do before it runs — review the task and system prompt carefully.
- Run with
--read-onlyto prevent writes, deletes, and edits while still letting the agent reason about files. - Run with
--no-toolsto disable all file access entirely (including reads). - Always work in a directory you are comfortable giving the agent full access to.
- Keep important files backed up or under version control.
There is no undo. Proceed with caution.
Features
- Provider-agnostic -- works with any Vercel AI SDK
LanguageModel(OpenAI, Anthropic, Google, Mistral, Ollama, etc.) - Autonomous loop -- calls tools, reads results, and continues until the model responds without tool calls
- Streaming and non-streaming --
stream()yieldsProgressEvents as anAsyncIterable;run()returns the final text - Built-in tool factories --
createMemoryTools(private scratchpad),createWorkspaceTools(project files + deny-list, optionally sandboxed),createShellTool(sandbox-mediated bash) - Skills system -- install, search, and manage skill definitions that extend the agent's system prompt
- Lifecycle hooks -- intercept tool calls, track steps, modify arguments, or halt execution
- Permission system -- accept-all, deny-all, or ask mode with per-tool overrides
- Usage tracking -- built-in cost estimation for 50+ models with per-run and per-day spending limits
- Testable --
createMockModel()returns a mockLanguageModelwith predetermined responses - Eval framework --
defineEval()+runEvals()to measure agent quality with 13 assertion types, LLM-as-judge, and multi-provider comparison
Install
npm install agent-doPeer dependency: ai (Vercel AI SDK v6+).
The CLI ships with @ai-sdk/anthropic, @ai-sdk/google, and @ai-sdk/openai
bundled so npx agent-do works out of the box. These are declared as optional
peers for library consumers — if you only use one provider, npm won't complain
about the others being missing, but the CLI covers them all.
Using a different provider
The CLI only knows about anthropic, google, openai, and ollama. For
any other provider (Mistral, Groq, Cohere, OpenRouter, Bedrock, xAI, etc.),
install the SDK and use agent-do as a library:
npm install agent-do @ai-sdk/mistralimport { createAgent } from 'agent-do';
import { createMistral } from '@ai-sdk/mistral';
const agent = createAgent({
model: createMistral()('mistral-large-latest'),
});
await agent.run('your task');Any Vercel AI SDK LanguageModel works — see
sdk.vercel.ai/providers for the full list.
CLI
Run agents from the command line with zero config:
# One-shot task
npx agent-do "What is TypeScript?"
# Pipe content as context
cat README.md | npx agent-do "Summarize this"
# Pipe + prompt merged
echo "function add(a, b) { return a + b }" | npx agent-do "Review this code"
# Interactive chat
npx agent-do
# Choose provider and model
npx agent-do --provider google --model gemini-2.5-flash "Hello"
# Create a reusable agent
npx agent-do create code-reviewer --provider anthropic --system "Review code for bugs"
# Run a saved agent by name
npx agent-do run code-reviewer "Review this function"
# List saved agents
npx agent-do list
# Run a custom agent script (.js/.mjs/.cjs/.ts files).
# --script is required to import local JavaScript/TypeScript; see
# "Script mode" below for the security reasoning.
npx agent-do run ./my-agent.ts --script "Do something"
# Run eval cases
npx agent-do eval evals/basic.ts
# Compare providers
npx agent-do eval evals/ --compare anthropic,google,openai --output jsonCLI options
npx agent-do [options] [prompt] One-shot or interactive
npx agent-do run <name|file> [task] Run a saved agent OR a script file
npx agent-do eval <file|dir> [options] Run evals
Options:
--provider <name> anthropic | google | openai | ollama (default: anthropic)
--model <id> Model ID (default: provider-specific)
--system <prompt> System prompt
--cwd <dir> Working directory for workspace tools (default: cwd)
--memory <dir> Memory directory for --with-memory (default: .agent-do/)
--with-memory Enable memory tools (agent scratchpad)
--read-only Block all writes (workspace + memory)
--exclude <globs> Extra deny-list patterns (comma-separated, gitignore-style)
--include-sensitive Bypass built-in sensitive-file deny list (.env, .ssh, etc.)
--max-iterations <n> Max loop iterations (default: 20)
--no-tools Disable all file tools
--verbose Show per-step thinking + tool summaries (stderr)
--show-content With --verbose: also include each tool's full result
--script Required for `run <path>` to import local JS/TS files
-y, --yes Skip the interactive confirmation for --script
--provider-tool <name> Enable a provider-native tool (repeatable). See below.
--provider-options <json>
Provider-specific options forwarded to every model
call, e.g. '{"google":{"useSearchGrounding":true}}'.
--json JSON output
--output <fmt> console | json | csv (eval only)
--compare <providers> Compare providers (eval only, comma-separated)
--concurrency <n> Parallel eval cases (default: 1)Provider-native tools and options
Provider packages (@ai-sdk/google, @ai-sdk/anthropic,
@ai-sdk/openai) ship server-side tools — Google
search-with-grounding, Anthropic web search, OpenAI web search,
code execution, URL context, etc. — and provider-specific call
options like Google's useSearchGrounding, Anthropic's thinking,
or OpenAI's reasoningEffort. Both are exposed on the CLI:
# Google search grounding via the provider tool
npx agent-do "what won the F1 race last weekend?" \
--provider google \
--provider-tool googleSearch
# Plus provider options on the model call itself
npx agent-do "summarize https://example.com" \
--provider google \
--provider-tool urlContext \
--provider-options '{"google":{"useSearchGrounding":true}}'
# Anthropic web search (alias `webSearch` resolves to the latest dated tool)
npx agent-do "find recent papers on diffusion models" \
--provider anthropic --provider-tool webSearch
# OpenAI web search
npx agent-do "latest research on graph neural networks" \
--provider openai --provider-tool webSearch--provider-tool is repeatable and can take comma-separated values
(--provider-tool googleSearch,urlContext). A small alias table
maps short names to the latest dated versions:
webSearch → webSearch_20260209, bash → bash_20250124, etc.
The CLI only accepts tools that work with empty config — currently
googleSearch, urlContext, codeExecution, webSearch,
webFetch, bash, textEditor, computer, memory,
webSearchPreview, codeInterpreter, imageGeneration,
applyPatch. Tools that need per-tool args (fileSearch →
vectorStoreIds, mcp → server URL, customTool → name/description,
…) must be configured from a script export so you can pass real args.
Trying to enable one from the CLI fails fast with a copy-pasteable
script-mode snippet.
The same fields are first-class on saved agents and AgentConfig:
npx agent-do create researcher \
--provider google --model gemini-2.5-flash \
--system 'Research topics' \
--provider-tool googleSearch --provider-tool urlContext \
--provider-options '{"google":{"useSearchGrounding":true}}'
npx agent-do researcher "summarize the latest Mars helicopter mission"In a script (Format 2), set them on the exported AgentConfig:
import { google } from '@ai-sdk/google';
export default {
id: 'researcher', name: 'researcher',
model: google('gemini-2.5-flash'),
tools: {
google_search: google.tools.googleSearch({}),
url_context: google.tools.urlContext({}),
},
providerOptions: { google: { useSearchGrounding: true } },
};Script mode: run <path> --script
agent-do run <arg> resolves saved-agent names by default. To run a
local JavaScript or TypeScript file as an agent, pass --script
explicitly:
npx agent-do run ./my-agent.ts --script "Do something"Importing an arbitrary JS/TS file runs its top-level code with your
user privileges — the same trust model as running any local script.
The --script flag is a deliberate speed bump so that:
- A missed saved-agent lookup (typo, stale name) fails with a clear
error instead of silently
import()-ing a stray file that happens to match the name. - Social-engineering vectors like "download this helper script, then
run
agent-do run helper.js" require an explicit opt-in.
When --script is passed:
- The path must point inside
--cwd(symlinks and..escapes are rejected after canonicalisation). - Only
.js/.mjs/.cjs/.ts/.mts/.ctsextensions are allowed. - The file must be a regular file (no directories, no special files).
- A banner prints the path, size, and SHA-256 prefix, then asks
Continue? [y/N]. Pass-y/--yesto skip the prompt (required in non-TTY contexts — CI, piped input — since there's nowhere to type the answer).
Tools: workspace vs memory
agent-do splits file access into two distinct concepts so the agent knows whether it's touching your project or its own notes:
- Workspace tools (
read_file,write_file,list_directory,grep_file,find_files,edit_file,delete_file) are enabled by default and rooted at--cwd(defaults to the current directory). This is what most CLI users want — the agent reads and modifies real project files. - Memory tools (
memory_read,memory_write,memory_list,memory_delete,memory_search) are opt-in via--with-memory. They give the agent a private, per-agent scratchpad under--memory(default.agent-do/). Use memory when you want the agent to remember notes or plans across runs without scribbling on the project.
Both respect --read-only. To disable all file access, use --no-tools.
Sensitive-file deny list
Workspace tools ship with a gitignore-style deny list that blocks access to credential material by default:
- Reads blocked:
.env*,*.pem,*.key,id_rsa*,id_ed25519*,.ssh/**,.aws/**,.gcloud/**,.kube/**,.git/objects/**,.git/hooks/**. - Writes blocked (above plus):
.git/**,node_modules/**.
Reads of node_modules/** and .git/HEAD are allowed — the agent can
inspect dependencies and branch state but cannot silently rewrite git
hooks or clobber installed modules.
Layer your own policy on top:
--exclude 'secrets/**,*.cred'— per-invocation patterns..agent-doignoreat the workspace root — project-scoped, gitignore- style file. Merged with the defaults.--include-sensitive— opt out of the built-in defaults when you explicitly want the old fully-open behaviour..agent-doignoreand--excludestill apply.
Blocked operations surface in --verbose logs as [blocked] entries
with the matched rule; the model sees only that it was blocked (not
the rule name) to avoid letting it probe the policy.
Tool result layering: model vs user vs programmatic
Every built-in tool returns a structured ToolResult with three views:
modelContent— the string the LLM sees. File contents are wrapped in<tool_output tool="…" path="…">…</tool_output>markers, capped at 256 KB (maxReadBytes), and common prompt-injection markers (ignore previous instructions,<system>tags) are replaced with a visible[redacted prompt-injection marker].userSummary— a one-liner for operator logs. Includes real paths, byte counts, line counts, match counts, block reasons, errno codes.data— structured fields (path,bytes,lines,truncated,redactedMarkerCount,matchCount,hiddenByDenyList,rule, …) for programmatic consumers.
In --verbose CLI mode you see the userSummary + a compact data
line on stderr. Full raw tool output is withheld by default so secrets
and large file contents don't leak into CI logs; pass --show-content
to include it.
Library consumers who need the full raw payload on tool-result
progress events pass emitFullResult: true in AgentConfig:
const agent = createAgent({ /* ... */, emitFullResult: true });
for await (const event of agent.stream('review the code')) {
if (event.type === 'tool-result') {
console.log(event.summary); // always present
console.log(event.data); // structured, always present
console.log(event.toolResult); // only when emitFullResult: true
}
}Custom tools can return a ToolResult directly for full control:
import { tool } from 'ai';
import type { ToolResult } from 'agent-do';
const myTool = tool({
description: 'Do a thing',
inputSchema: z.object({ path: z.string() }),
execute: async ({ path }): Promise<ToolResult> => ({
modelContent: 'Short sanitised view for the model',
userSummary: `[my_tool] ${path} — did a thing`,
data: { path, widgetCount: 42 },
}),
});String returns still work and are normalised automatically.
Piping
Piped stdin is merged with the command-line prompt:
| stdin | prompt | result |
|-------|--------|--------|
| no | "Hello" | Task: "Hello" |
| "context" | no | Task: "context" |
| "context" | "Summarize" | Task: "Summarize\n\n---\n\ncontext" |
| no | no | Interactive mode |
Quick Start
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
const model = createMockModel({
responses: [
{ text: 'The capital of France is Paris.' },
],
});
const agent = createAgent({
id: 'geography',
name: 'Geography Agent',
model,
});
const result = await agent.run('What is the capital of France?');
console.log(result); // "The capital of France is Paris."Streaming
stream() returns an AsyncIterable<ProgressEvent> that yields events as the agent works:
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
const model = createMockModel({
responses: [
{ toolCalls: [{ toolName: 'lookup', args: { query: 'Paris' } }] },
{ text: 'Paris is the capital of France.' },
],
});
const agent = createAgent({
id: 'geo',
name: 'Geo',
model,
tools: {
// ... your tools here
},
});
for await (const event of agent.stream('Tell me about Paris')) {
switch (event.type) {
case 'thinking':
process.stdout.write(event.content);
break;
case 'tool-call':
console.log(`Calling ${event.toolName}`, event.toolArgs);
break;
case 'tool-result':
console.log(`Result from ${event.toolName}:`, event.toolResult);
break;
case 'text':
console.log('Agent says:', event.content);
break;
case 'step-complete':
console.log(`Step ${event.step! + 1} complete`);
break;
case 'done':
console.log('Final answer:', event.content);
break;
case 'error':
console.error('Error:', event.content);
break;
}
}ProgressEvent types
| Type | Description |
|------|-------------|
| thinking | Partial text streaming from the model |
| tool-call | The model is calling a tool (toolName, toolArgs) |
| tool-result | A tool returned a result (toolName, toolResult) |
| text | Final text output for a step |
| step-complete | An iteration of the loop finished |
| done | The agent completed its task |
| error | Something went wrong or limits were exceeded |
Multiple Agents
Create multiple agents with different models, tools, and system prompts:
import { createAgent } from 'agent-do';
import { createAnthropic } from '@ai-sdk/anthropic';
const anthropic = createAnthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const assistant = createAgent({
id: 'assistant',
name: 'Assistant',
model: anthropic('claude-sonnet-4-6'),
systemPrompt: 'You are a helpful assistant.',
});
const researcher = createAgent({
id: 'researcher',
name: 'Researcher',
model: anthropic('claude-haiku-4-5'), // cheaper model for research
systemPrompt: 'You are a research assistant. Be thorough.',
});
// Each agent has its own conversation context
const answer = await assistant.run('Hello!');
const research = await researcher.run('Find info about TypeScript');Tools
Define tools using the Vercel AI SDK's tool() function:
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
import { tool } from 'ai';
import { z } from 'zod';
const agent = createAgent({
id: 'math',
name: 'Math Agent',
model: createMockModel({
responses: [
{ toolCalls: [{ toolName: 'add', args: { a: 2, b: 3 } }] },
{ text: 'The sum of 2 and 3 is 5.' },
],
}),
tools: {
add: tool({
description: 'Add two numbers',
inputSchema: z.object({
a: z.number(),
b: z.number(),
}),
execute: async ({ a, b }) => `${a + b}`,
}),
},
});
const result = await agent.run('What is 2 + 3?');The agent loops automatically: it calls tools, feeds results back to the model, and continues until the model responds with text only (no tool calls) or hits maxIterations (default: 20).
History hygiene
Between iterations, the loop replaces older <tool_output>...</tool_output>
blocks in the conversation history with self-closing redacted="stale"
markers. This keeps the model's view of what happened (which tool ran,
on what path) but drops the body, so injected content from a poisoned
file can't keep influencing the model on every subsequent step. It also
keeps token spend bounded as iterations accumulate.
Tune the window with AgentConfig.historyKeepWindow (default 1 —
only the most recent iteration's tool outputs flow in full to the next
call). Set to Infinity to restore the historical
"everything-stays-in-context" behaviour.
Tool factories
Three consumer-facing factories cover the common cases. See
docs/sandbox.md for how each interacts with a
sandbox.
Workspace tools
createWorkspaceTools(workingDir, opts?) gives the agent file tools
(read_file, write_file, edit_file, list_directory,
delete_file, grep_file, find_files) rooted at workingDir,
with a deny-list (.env, .ssh/**, ...) applied at the tool layer.
Pass { sandbox } to swap the internal store for a
SandboxBackedMemoryStore and route every file op through the
sandbox.
import { createAgent, createWorkspaceTools } from 'agent-do';
const agent = createAgent({
id: 'coder', name: 'Coder', model,
tools: createWorkspaceTools(process.cwd(), { readOnly: true }),
});Memory tools
createMemoryTools(store, agentId, opts?) gives the agent a private
scratchpad (memory_read, memory_write, memory_list,
memory_delete, memory_search) backed by any MemoryStore.
import {
createAgent, createMemoryTools, InMemoryMemoryStore,
} from 'agent-do';
const store = new InMemoryMemoryStore();
const agent = createAgent({
id: 'writer', name: 'Writer', model,
tools: createMemoryTools(store, 'writer'),
});Shell tool
createShellTool(sandbox?, opts?) gives the agent a single shell
tool (default name bash) wired to a SandboxApi. Defaults to
createHostSandbox() when no sandbox is supplied; for real
isolation, pass createJustBashSandbox() or your own connector.
import { createAgent, createShellTool, createJustBashSandbox } from 'agent-do';
const sandbox = await createJustBashSandbox();
const agent = createAgent({
id: 'runner', name: 'Runner', model,
tools: createShellTool(sandbox),
});MemoryStore
The MemoryStore interface abstracts file storage for agents. Three implementations are included:
InMemoryMemoryStore— for testing and prototyping (data lost on exit)FilesystemMemoryStore— persists to the local filesystem (survives restarts)SandboxBackedMemoryStore— adapts aSandboxApiconnector into aMemoryStore(seedocs/sandbox.md)
import { FilesystemMemoryStore, createMemoryTools, createAgent } from 'agent-do';
const store = new FilesystemMemoryStore('./agent-data');
const agent = createAgent({
id: 'my-agent',
name: 'My Agent',
model: model as any,
tools: createMemoryTools(store, 'my-agent'),
});
// Files persist at ./agent-data/my-agent/Security: FilesystemMemoryStore
Warning:
FilesystemMemoryStoregives the agent read/write access to the specified directory. The agent decides what files to create and modify. UsereadOnly: trueto restrict to read-only access, oronBeforeWriteto approve each write operation.
// Read-only mode — agent can read but not create/modify/delete
const readOnlyStore = new FilesystemMemoryStore('./data', { readOnly: true });// Write confirmation — approve each operation (sync or async)
const guardedStore = new FilesystemMemoryStore('./data', {
onBeforeWrite: (agentId, canonicalPath, operation) => {
console.log(`Agent ${agentId} wants to ${operation}: ${canonicalPath}`);
// Return true to allow, false to block
// The path is canonicalized — ../traversal is resolved before this callback
return true;
},
});For other backends, implement the interface:
interface MemoryStore {
read(agentId: string, path: string): Promise<string>;
write(agentId: string, path: string, content: string): Promise<void>;
append(agentId: string, path: string, content: string): Promise<void>;
delete(agentId: string, path: string): Promise<void>;
list(agentId: string, path?: string): Promise<FileEntry[]>;
mkdir(agentId: string, path: string): Promise<void>;
exists(agentId: string, path: string): Promise<boolean>;
search(agentId: string, pattern: string, path?: string): Promise<Array<{ path: string; line: string }>>;
}Custom implementations
See examples/08-custom-memory-store.ts for complete patterns for:
- Node.js filesystem (
fs) - AWS S3 (
@aws-sdk/client-s3) - Google Firestore (
@google-cloud/firestore) - SQLite (
better-sqlite3)
Conversation History
Pass previous conversation turns to maintain context:
import { createAgent, type ConversationMessage } from 'agent-do';
const history: ConversationMessage[] = [];
// First turn
const r1 = await agent.run('My name is Alice', undefined, history);
history.push({ role: 'user', content: 'My name is Alice' });
history.push({ role: 'assistant', content: r1 });
// Second turn — agent remembers the name
const r2 = await agent.run('What is my name?', undefined, history);
// r2 = "Your name is Alice."Skills
Skills extend an agent's system prompt with additional instructions. They can be installed, removed, searched, and managed through a SkillStore.
Defining a skill
import type { Skill } from 'agent-do';
const skill: Skill = {
id: 'code-review',
name: 'Code Review',
description: 'Reviews code for quality and best practices',
content: `When reviewing code:
- Check for error handling
- Look for security issues
- Suggest performance improvements`,
};Using InMemorySkillStore
import { createAgent, InMemorySkillStore } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
const skills = new InMemorySkillStore();
await skills.install({
id: 'code-review',
name: 'Code Review',
description: 'Reviews code for quality',
content: 'When reviewing code, check for errors and suggest improvements.',
});
const agent = createAgent({
id: 'reviewer',
name: 'Reviewer',
model: createMockModel({ responses: [{ text: 'LGTM' }] }),
skills,
});When a SkillStore is provided, the agent gets:
- Installed skill content injected into the system prompt, wrapped in
<skill>…</skill>markers with a preamble instructing the model to treat the body as reference data rather than overriding instructions. - Auto-generated tools:
search_skills,list_skills,remove_skill. Theinstall_skilltool is not exposed by default — see below.
allowSkillInstall (privileged)
The LLM-facing install_skill tool lets the model write skills into
the backing SkillStore. Because installed skills get injected into
every subsequent run's system prompt, a prompt-injected agent with
install access could plant a persistent jailbreak across sessions.
Set allowSkillInstall: true on the agent config to expose
install_skill to the model. Default is false — library callers
install skills themselves (via skills.install(...)) and the agent
only searches / lists / removes them.
const agent = createAgent({
// ...
skills,
allowSkillInstall: true, // opt-in: model can persist new skills
});Inputs to install_skill are validated by a strict schema (id matches
/^[a-zA-Z0-9_-]+$/, content ≤ 8 KB, name ≤ 64 chars, description ≤
256 chars) regardless of who calls the tool, and any <skill> or
</skill> sequences inside the skill body are neutralised before the
prompt is rendered so the structural isolation can't be broken from
inside.
Parsing SKILL.md files
import { parseSkillMd } from 'agent-do';
const skill = parseSkillMd(`---
name: My Skill
description: Does useful things
author: Alice
version: 1.0.0
---
Instructions for the skill go here.
`);
console.log(skill.name); // "My Skill"
console.log(skill.content); // "Instructions for the skill go here."SkillStore interface
Implement SkillStore for custom backends (database, filesystem, API):
interface SkillStore {
list(): Promise<Skill[]>;
get(skillId: string): Promise<Skill | undefined>;
install(skill: Skill): Promise<void>;
remove(skillId: string): Promise<void>;
search(query: string): Promise<Array<{ id: string; name: string; description: string }>>;
}SkillSearchResult deliberately has no url field — an external
registry returning a URL would turn skill search into an SSRF / auto-
fetch footgun (see issue #34). If you wire up a network-backed store,
host allowlisting and explicit installation must happen outside
search(); install() should only receive content the caller has
already verified.
Lifecycle Hooks
Hooks let you observe and control the agent loop. All hooks are optional and async.
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
const agent = createAgent({
id: 'hooked',
name: 'Hooked Agent',
model: createMockModel({ responses: [{ text: 'Done.' }] }),
hooks: {
// Called before each tool execution. Return a HookDecision to allow/deny/modify.
onPreToolUse: async ({ toolName, args, step }) => {
console.log(`Step ${step}: about to call ${toolName}`);
// Return { decision: 'deny', reason: 'not allowed' } to block
// Return { decision: 'allow', modifiedArgs: { ... } } to modify input
return { decision: 'allow' };
},
// Called after each tool execution.
onPostToolUse: async ({ toolName, args, result, step, durationMs }) => {
console.log(`${toolName} took ${durationMs}ms`);
},
// Called at the start of each loop iteration. Return 'stop' to halt.
onStepStart: async ({ step, totalSteps, tokensSoFar, costSoFar }) => {
if (costSoFar > 1.0) {
return { decision: 'stop', reason: 'Too expensive' };
}
},
// Called after each loop iteration completes.
onStepComplete: async ({ step, hasToolCalls, text }) => {
console.log(`Step ${step} done, has tools: ${hasToolCalls}`);
},
// Called when the entire run finishes.
onComplete: async ({ result, totalSteps, usage, aborted }) => {
console.log(`Finished in ${totalSteps} steps, cost: $${usage.totalCost.toFixed(4)}`);
},
// Called after each step's usage is recorded.
onUsage: async (record) => {
console.log(`Step ${record.step}: ${record.inputTokens}in/${record.outputTokens}out, $${record.estimatedCost.toFixed(4)}`);
},
},
});HookDecision
Returned from onPreToolUse and onStepStart:
interface HookDecision {
decision: 'allow' | 'deny' | 'ask' | 'stop' | 'continue';
reason?: string;
modifiedArgs?: unknown; // Only for onPreToolUse: replace the tool's input
}Permissions
Control which tools the agent can call.
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
const agent = createAgent({
id: 'safe',
name: 'Safe Agent',
model: createMockModel({ responses: [{ text: 'Done.' }] }),
permissions: {
// Base mode: 'accept-all' | 'deny-all' | 'ask'
mode: 'ask',
// Per-tool overrides: 'always' | 'ask' | 'never'
tools: {
read_file: 'always', // Always allowed, even in deny-all mode
delete_file: 'never', // Always blocked, even in accept-all mode
write_file: 'ask', // Falls through to onPermissionRequest
},
// Called when mode is 'ask' or a tool's level is 'ask'
onPermissionRequest: async ({ toolName, args }) => {
console.log(`Allow ${toolName}?`, args);
return true; // or false to deny
},
},
});Permission evaluation order
- If mode is
accept-all, allow (but still check per-toolneveroverrides) - If mode is
deny-all, deny (but still check per-toolalwaysoverrides) - Check per-tool override:
always-> allow,never-> deny - If
askor no override: callonPermissionRequest(defaults to allow if no callback)
Usage Tracking
Track token usage and costs across agent runs with built-in pricing for 50+ models.
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
const agent = createAgent({
id: 'tracked',
name: 'Tracked',
model: createMockModel({ responses: [{ text: 'Hi' }] }),
usage: {
enabled: true,
limits: {
perRun: 0.50, // $0.50 max per run
perDay: 5.00, // $5.00 max per day
},
// Called when a limit is exceeded. Return true to continue anyway.
onLimitExceeded: async ({ type, spent, limit }) => {
console.warn(`${type} limit exceeded: $${spent.toFixed(2)} / $${limit.toFixed(2)}`);
return false; // stop the run
},
// Optional: override built-in pricing
pricing: {
'my-custom-model': { input: 1.0, output: 3.0 }, // per 1M tokens
},
},
});UsageTracker class
For standalone usage tracking outside of createAgent:
import { UsageTracker, estimateCost, DEFAULT_PRICING } from 'agent-do';
const tracker = new UsageTracker({
perRunLimit: 1.0,
});
// Record a step
const record = tracker.record(0, 'claude-sonnet-4-6', 1000, 500);
console.log(record.estimatedCost); // cost based on built-in pricing
// Get summary
const summary = tracker.getSummary();
console.log(summary.totalCost, summary.totalInputTokens, summary.totalOutputTokens);
// Check limits
const ok = await tracker.checkLimits(); // false if limit exceeded
// Standalone cost estimation
const cost = estimateCost('gpt-4o', 10000, 5000);Debugging & Observability
--verbose / --show-content expose what the agent loop sees —
text deltas, tool call summaries, streaming events. One layer lower
is what's actually crossing the wire to the model provider: the
resolved system prompt, the full message list per step, the cache
metrics coming back, and every raw stream part. That's the debug
surface.
CLI log levels
# Default: final answer + errors
npx agent-do "hello"
# Thinking + tool summaries (what --verbose has always been)
npx agent-do --log-level verbose "hello"
# Legacy: --verbose still works and implies --log-level verbose.
# + system prompt, messages per step, cache metrics, request metadata
npx agent-do --log-level debug "hello"
# + every raw stream part (text-delta, tool-call, finish, …)
npx agent-do --log-level trace "hello"At debug the CLI emits compact, labelled lines to stderr:
[debug:request] step=0 model=claude-sonnet-4-6 tools=[read_file,write_file,grep_file]
[debug:messages] step=0 count=2 bytes=1247
[debug:cache] step=0 read=0 write=1198 no-cache=49 out=112 hit=0%
…second iteration benefits from the cache write on the first:
[debug:cache] step=1 read=1198 write=0 no-cache=14 out=87 hit=98%That last pair is what you want when you're checking whether
Anthropic prompt caching is actually firing — read/(read+no-cache)
is the hit rate.
API: AgentConfig.debug
Library callers opt in explicitly and control fan-out:
import { createAgent, type DebugEvent } from 'agent-do';
const agent = createAgent({
id: 'debug-me',
name: 'Debug Me',
model,
systemPrompt: 'You are helpful.',
debug: {
systemPrompt: true, // log the resolved prompt once per run
messages: true, // log the message list going into each step
request: true, // model id + tool names + providerOptions
cache: true, // per-step cache read/write/no-cache tokens
response: false, // opt in to raw stream parts
// Optional: sink to a separate destination in addition to the
// progress stream. The sink fires *in addition to* the
// `type: 'debug'` progress events, not instead of them.
sink: (event: DebugEvent) => {
myLogger.info(event.channel, event);
},
// Cap body size in system-prompt / messages events. Default 16 KB.
maxBodyBytes: 8 * 1024,
},
});
for await (const event of agent.stream('task')) {
if (event.type === 'debug' && event.debug?.channel === 'cache') {
const hit = event.debug.cacheReadTokens /
(event.debug.cacheReadTokens + event.debug.noCacheTokens);
console.log(`cache hit rate: ${(hit * 100).toFixed(1)}%`);
}
}Channels
| Channel | Fires when | Payload highlights |
|---|---|---|
| system-prompt | once per run, before the first model call | content, bytes, truncated |
| messages | before each streamText call | full messages[], bytes, truncated |
| request | before each streamText call | model, toolCount, toolNames, providerOptions |
| response-part | each raw stream part | partType always; full part when traceResponseParts: true |
| cache | after each inner model step | cacheReadTokens, cacheWriteTokens, noCacheTokens, outputTokens, providerMetadata |
Notes
response-partat trace only: the full stream is noisy (text-delta fires per token). Atdebuglevel the channel emits just thepartTypefor aggregate counting; bump totracewhen you need the contents.- String model IDs skip middleware: if you passed
model: 'gpt-4o'as a string, the AI SDK resolves the provider lazily and the middleware can't attach. The other channels (system-prompt,cache) still fire — they don't depend on middleware. Pass a structured model (createAnthropic()('claude-sonnet-4-6')) for full coverage. - Sink errors are swallowed. A broken sink can't wedge the run. Sync and async throws are both caught.
- No secret redaction. If you opt into debug and point it at a
log file, it's your job to keep that file private. The
maxBodyBytescap exists only to prevent accidental megabyte dumps.
Testing
createMockModel() returns a mock LanguageModel compatible with the Vercel AI SDK. It uses predetermined responses so you can test agent behavior without API keys.
import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
import { tool } from 'ai';
import { z } from 'zod';
// Simulate a multi-step agent run: tool call -> final answer
const model = createMockModel({
responses: [
// Step 1: model calls a tool
{ toolCalls: [{ toolName: 'get_weather', args: { city: 'London' } }] },
// Step 2: model responds with text (ends the loop)
{ text: 'The weather in London is rainy.' },
],
modelId: 'test-model',
inputTokensPerCall: 100,
outputTokensPerCall: 50,
});
const agent = createAgent({
id: 'test',
name: 'Test',
model,
tools: {
get_weather: tool({
description: 'Get weather for a city',
inputSchema: z.object({ city: z.string() }),
execute: async ({ city }) => `${city}: rainy, 12C`,
}),
},
});
const result = await agent.run('Weather in London?');
// result === 'The weather in London is rainy.'MockModelOptions
| Option | Default | Description |
|--------|---------|-------------|
| responses | (required) | Array of MockResponse objects, used in order |
| modelId | 'mock-model' | Model ID for logging |
| provider | 'mock-provider' | Provider name for logging |
| inputTokensPerCall | 10 | Simulated input tokens per call |
| outputTokensPerCall | 20 | Simulated output tokens per call |
Eval Framework
Define eval cases to measure agent quality, compare providers, and catch regressions.
import { defineEval, runEvals } from 'agent-do/eval';
import { createAnthropic } from '@ai-sdk/anthropic';
const anthropic = createAnthropic();
const suite = defineEval({
name: 'my-assistant-eval',
model: anthropic('claude-haiku-4-5'),
systemPrompt: 'You are a helpful assistant.',
cases: [
{
name: 'knows capitals',
input: 'What is the capital of France?',
assert: [
{ type: 'contains', value: 'Paris' },
{ type: 'not-contains', value: 'London' },
],
},
{
name: 'saves notes correctly',
input: 'Save a note that my name is Alice.',
assert: [
{ type: 'tool-called', tool: 'write_file' },
{ type: 'file-contains', path: 'memories/user.md', value: 'Alice' },
{ type: 'max-steps', max: 5 },
{ type: 'max-cost', maxUsd: 0.05 },
],
},
],
});
const result = await runEvals(suite);
// Console output: ✓ PASS knows capitals ($0.0012, 800ms, 1 steps)
// ✓ PASS saves notes correctly ($0.0035, 2100ms, 2 steps)Assertion types
| Type | Description |
|------|-------------|
| contains | Response text contains a string |
| not-contains | Response does NOT contain a string |
| regex | Response matches a regex pattern |
| json-schema | Response is valid JSON matching a schema |
| tool-called | A specific tool was called during execution |
| tool-not-called | A specific tool was NOT called |
| tool-args | Tool was called with specific arguments (partial match) |
| file-exists | A file was created in the memory store |
| file-contains | A file in the store contains a string |
| max-steps | Agent completed in N or fewer steps |
| max-cost | Agent completed within a cost budget (USD) |
| llm-rubric | Another LLM scores the response against a rubric |
| custom | Custom function receives the full result |
Multi-provider comparison
const result = await runEvals(suite, {
providers: [
{ name: 'anthropic', model: anthropic('claude-sonnet-4-6') },
{ name: 'google', model: google('gemini-2.5-flash') },
{ name: 'openai', model: openai('gpt-4.1-mini') },
],
});
// Prints a comparison table with pass rate, cost, and latency per providerLLM-as-judge
{
name: 'explains clearly',
input: 'Explain quantum computing to a 10 year old',
assert: [
{
type: 'llm-rubric',
rubric: 'The explanation should be simple, use analogies, avoid jargon.',
score: 'pass-fail', // or '1-5'
},
],
}Output formats
// Console output (default)
await runEvals(suite);
// JSON (for CI/dashboards)
await runEvals(suite, { output: 'json' });
// CSV (for spreadsheets)
await runEvals(suite, { output: 'csv' });
// Silent (programmatic use)
const result = await runEvals(suite, { output: 'silent' });API Reference
Main exports (agent-do)
| Export | Type | Description |
|--------|------|-------------|
| createAgent | (config: AgentConfig) => Agent | Create an agent with run(), stream(), and abort() |
| runAgentLoop | (config, task, context?) => Promise<RunResult> | Run the loop directly (lower-level) |
| streamAgentLoop | (config, task, context?) => AsyncGenerator<ProgressEvent> | Stream the loop directly (lower-level) |
| createShellTool | (sandbox?, opts?) => ToolSet | A single shell tool (default name bash) whose execute calls sandbox.exec. Defaults to host when no sandbox is supplied. (docs/sandbox.md) |
| SandboxBackedMemoryStore | class | Adapt a SandboxApi into a MemoryStore |
| createHostSandbox | (opts?) => SandboxApi | Direct passthrough to the host — not a security boundary |
| createJustBashSandbox | (opts?) => Promise<SandboxApi> | Wrap a vercel-labs/just-bash Sandbox |
| wrapJustBashSandbox | (instance) => SandboxApi | Wrap an externally-constructed just-bash instance |
| createSkillTools | (store: SkillStore) => ToolSet | Create skill management tools |
| buildSkillsPrompt | (skills: Skill[]) => string | Build a system prompt section from skills |
| parseSkillMd | (content, id?) => Skill | Parse a SKILL.md with YAML frontmatter |
| InMemorySkillStore | class | In-memory reference implementation of SkillStore |
| InMemoryMemoryStore | class | In-memory store (testing/prototyping) |
| FilesystemMemoryStore | class | Node.js filesystem store (persistent) |
| createOrchestrator | (config) => Orchestrator | Create a multi-agent orchestrator |
| evaluatePermission | (toolName, args, config) => Promise<boolean> | Evaluate a permission check |
| UsageTracker | class | Track usage and costs within a run |
| estimateCost | (model, input, output, pricing?) => number | Estimate cost in USD |
| DEFAULT_PRICING | PricingTable | Built-in pricing for 50+ models |
Test exports (agent-do/testing)
| Export | Type | Description |
|--------|------|-------------|
| createMockModel | (options: MockModelOptions) => LanguageModel | Create a mock model for testing |
Key types
| Type | Description |
|------|-------------|
| AgentConfig | Full agent configuration (model, tools, hooks, permissions, usage) |
| Agent | Agent instance with id, name, run(), stream(), abort() |
| ProgressEvent | Event emitted during streaming |
| RunResult | Result of run() with text, usage, steps, aborted flag |
| AgentHooks | Lifecycle hook callbacks |
| PermissionConfig | Permission mode, per-tool overrides, callback |
| Skill / SkillStore | Skill definition and storage interface |
| RunUsage / UsageRecord | Usage summary and per-step records |
| HookDecision | Return value from hooks to control execution |
| PricingTable | Model pricing lookup (per 1M tokens) |
| MemoryStore | Storage interface for agent file operations |
| SandboxApi | Pluggable sandbox contract (Flue-shaped) — see docs/sandbox.md |
| FileStat / ExecOptions / ExecResult | Shapes returned by SandboxApi methods |
| FileEntry | File/directory entry from list() |
| ConversationMessage | User/assistant message for conversation history |
| Orchestrator / OrchestratorConfig | Multi-agent orchestration types |
| BuildSystemPromptOptions | Options for the prompt builder |
| SectionFn | Function that returns a prompt section string |
| PromptTemplate | Named template with ordered section list |
Eval exports (agent-do/eval)
| Export | Type | Description |
|--------|------|-------------|
| defineEval | (config: EvalSuiteConfig) => EvalSuiteConfig | Define an eval suite (type-safe helper) |
| runEvals | (suite, options?) => Promise<EvalResult> | Run an eval suite and return results |
| evaluateAssertion | (assertion, result, judgeModel?) => Promise<AssertionResult> | Evaluate a single assertion |
| EvalSuiteConfig | type | Eval suite definition (name, model, cases) |
| EvalCase | type | Single eval test case (input, assertions) |
| Assertion | type | Union of all 13 assertion types |
| EvalResult | type | Full eval result with provider breakdowns |
| CaseResult | type | Result of a single eval case |
Prompt exports (agent-do/prompts)
| Export | Type | Description |
|--------|------|-------------|
| buildSystemPrompt | function | Compose a system prompt from templates, sections, and variables |
| interpolate | function | Simple {{variable}} replacement |
| builtinTemplates | object | Preconfigured templates: assistant, coder, researcher, reviewer, writer, planner |
| builtinSections | object | Reusable sections: identity, memoryManagement, fileTools, efficiency, etc. |
| roleSections | object | Role-specific sections: codingApproach, researchApproach, etc. |
Store exports (agent-do/stores)
| Export | Description |
|--------|-------------|
| MemoryStore | Storage interface (type) |
| FileEntry | File entry type |
Store implementations
| Export | Import path | Description |
|--------|-------------|-------------|
| InMemoryMemoryStore | agent-do | In-memory store for testing/prototyping (data lost on exit) |
| FilesystemMemoryStore | agent-do | Node.js filesystem store (persistent, path-traversal safe) |
Examples
The examples/ directory contains runnable examples:
| # | File | Description |
|---|------|-------------|
| 1 | 01-basic-agent.ts | Simplest possible agent |
| 2 | 02-agent-with-tools.ts | Custom tools (weather, calculator) |
| 3 | 03-agent-with-memory.ts | File tools with InMemoryMemoryStore |
| 4 | 04-lifecycle-hooks.ts | Hooks for monitoring and control |
| 5 | 05-multi-provider.ts | Anthropic, Google, OpenAI, Ollama |
| 6 | 06-conversation-history.ts | Multi-turn conversations |
| 7 | 07-multi-agent-orchestration.ts | Master + worker agents |
| 8 | 08-custom-memory-store.ts | Patterns for S3, Firestore, SQLite, filesystem |
| 9 | 09-skills.ts | Skills system |
| 10 | 10-testing.ts | Testing with createMockModel |
| 11 | 11-filesystem-store.ts | Persistent filesystem storage — explore the created files |
| 12 | 12-prompt-builder.ts | Composable system prompts from templates + sections + variables |
| 13 | 13-eval-framework.ts | Eval framework — define cases, assert quality, compare providers |
| 16 | 16-sandbox-bash.ts | Pluggable sandbox + bash tool (host connector) |
| 17 | 17-sandbox-with-memory.ts | Sandbox alongside InMemoryMemoryStore (different substrates) |
| 18 | 18-sandbox-with-filesystem.ts | Sandbox alongside FilesystemMemoryStore (soft policy + sandboxed bash, plus a strong-isolation pattern) |
Run any example: npx tsx examples/01-basic-agent.ts
Releasing
Version management uses Changesets. Releases are manual by design — publishing requires short-lived npm credentials that the maintainer mints ahead of time, so nothing lives in CI as a long-lived token.
Recording a change
When you make a change that should appear in the next release, record a changeset describing it:
npm run changesetThe CLI asks whether the change is a major / minor / patch
bump and writes a markdown file in .changeset/. Commit that file
alongside your code change. The changeset body becomes an entry in
CHANGELOG.md at release time.
Rules of thumb for pre-1.0:
- patch — bug fixes, internal refactors, doc updates
- minor — new features, non-breaking API additions, security fixes that don't change the public API
- major — save for the 1.0 cut; until then, breaking changes can ride in minor
Not every PR needs a changeset. If a change doesn't affect what gets shipped to npm consumers (CI config, internal tests, comment tweaks), skip it.
Cutting a release
Once one or more changesets are sitting in .changeset/ on the branch
you want to cut from (usually main):
# Mint a short-lived npm token, then either:
npm login # interactive, stays in ~/.npmrc for this session
# or:
export NPM_TOKEN=npm_… # paste the token
npm run releasenpm run release executes scripts/release.sh, which walks the whole
manual flow in one go:
- Preconditions — dirty tree aborts, zero pending changesets aborts.
- Quality gate —
typecheck+test+build. - Apply changesets —
changeset versionbumpspackage.jsonand prepends entries toCHANGELOG.md. The script prints the diff so you can eyeball it. - Commit —
chore: release vX.Y.Zon the current branch. - Publish —
changeset publishpushes to npm (with provenance) and creates the git tag. - Push — commit + tag to
origin/<current-branch>.
Any failure stops the script. Steps are idempotent enough that you can usually re-run after fixing the cause.
After publish, open https://github.com/PaulKinlan/agent-do/releases and (optionally) create a GitHub release from the fresh tag with the CHANGELOG body pasted in.
Commit message format
Not required. Changesets determines the semver impact from the
.changeset/*.md files, not from commit messages, so commits can be
whatever shape fits the change. Conventional-commit prefixes
(feat:, fix:, chore:) are fine but not enforced.
The one discipline to keep is: any PR that changes user-facing
behaviour needs a changeset. If that ever drifts, the cheapest
enforcement is a CI check that fails PRs which touch src/ without
adding a file in .changeset/ — happy to wire that up if it becomes
a problem.
License
Apache 2.0
