npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

agent-do

v0.6.0

Published

Provider-agnostic autonomous agent loop for JavaScript. Built on the [Vercel AI SDK](https://sdk.vercel.ai/), it drives any `LanguageModel` through a tool-use loop until the task is complete.

Readme

agent-do

Provider-agnostic autonomous agent loop for JavaScript. Built on the Vercel AI SDK, it drives any LanguageModel through a tool-use loop until the task is complete.

⚠️ No sandbox

agent-do is not sandboxed within its working directory: when file tools are enabled, the agent can read, write, edit, and delete files under that directory.

By default, file tools operate on files reachable from the working directory (--cwd, defaulting to the current directory). Path-traversal guards prevent the agent from escaping that root, but within that scope a misbehaving prompt or unintended tool call can cause permanent data loss.

--read-only blocks writes, deletes, and edits — but the agent can still read, list, and grep every file in the working directory and send its contents to the model provider. If a directory contains secrets you don't want exposed, use --no-tools instead.

The CLI prints a one-line warning to stderr on every run that has file tools enabled, so the blast radius is visible up front. The warning text adapts to the resolved configuration (read-only vs full read/write), and a saved agent's noTools / readOnly settings win over CLI flags so you don't get a misleading "no tools" warning when a saved config silently re-enables them.

Before using agent-do, especially the CLI:

  • Understand what the agent will do before it runs — review the task and system prompt carefully.
  • Run with --read-only to prevent writes, deletes, and edits while still letting the agent reason about files.
  • Run with --no-tools to disable all file access entirely (including reads).
  • Always work in a directory you are comfortable giving the agent full access to.
  • Keep important files backed up or under version control.

There is no undo. Proceed with caution.

Features

  • Provider-agnostic -- works with any Vercel AI SDK LanguageModel (OpenAI, Anthropic, Google, Mistral, Ollama, etc.)
  • Autonomous loop -- calls tools, reads results, and continues until the model responds without tool calls
  • Streaming and non-streaming -- stream() yields ProgressEvents as an AsyncIterable; run() returns the final text
  • Built-in tool factories -- createMemoryTools (private scratchpad), createWorkspaceTools (project files + deny-list, optionally sandboxed), createShellTool (sandbox-mediated bash)
  • Skills system -- install, search, and manage skill definitions that extend the agent's system prompt
  • Lifecycle hooks -- intercept tool calls, track steps, modify arguments, or halt execution
  • Permission system -- accept-all, deny-all, or ask mode with per-tool overrides
  • Usage tracking -- built-in cost estimation for 50+ models with per-run and per-day spending limits
  • Testable -- createMockModel() returns a mock LanguageModel with predetermined responses
  • Eval framework -- defineEval() + runEvals() to measure agent quality with 13 assertion types, LLM-as-judge, and multi-provider comparison

Install

npm install agent-do

Peer dependency: ai (Vercel AI SDK v6+).

The CLI ships with @ai-sdk/anthropic, @ai-sdk/google, and @ai-sdk/openai bundled so npx agent-do works out of the box. These are declared as optional peers for library consumers — if you only use one provider, npm won't complain about the others being missing, but the CLI covers them all.

Using a different provider

The CLI only knows about anthropic, google, openai, and ollama. For any other provider (Mistral, Groq, Cohere, OpenRouter, Bedrock, xAI, etc.), install the SDK and use agent-do as a library:

npm install agent-do @ai-sdk/mistral
import { createAgent } from 'agent-do';
import { createMistral } from '@ai-sdk/mistral';

const agent = createAgent({
  model: createMistral()('mistral-large-latest'),
});
await agent.run('your task');

Any Vercel AI SDK LanguageModel works — see sdk.vercel.ai/providers for the full list.

CLI

Run agents from the command line with zero config:

# One-shot task
npx agent-do "What is TypeScript?"

# Pipe content as context
cat README.md | npx agent-do "Summarize this"

# Pipe + prompt merged
echo "function add(a, b) { return a + b }" | npx agent-do "Review this code"

# Interactive chat
npx agent-do

# Choose provider and model
npx agent-do --provider google --model gemini-2.5-flash "Hello"

# Create a reusable agent
npx agent-do create code-reviewer --provider anthropic --system "Review code for bugs"

# Run a saved agent by name
npx agent-do run code-reviewer "Review this function"

# List saved agents
npx agent-do list

# Run a custom agent script (.js/.mjs/.cjs/.ts files).
# --script is required to import local JavaScript/TypeScript; see
# "Script mode" below for the security reasoning.
npx agent-do run ./my-agent.ts --script "Do something"

# Run eval cases
npx agent-do eval evals/basic.ts

# Compare providers
npx agent-do eval evals/ --compare anthropic,google,openai --output json

CLI options

npx agent-do [options] [prompt]          One-shot or interactive
npx agent-do run <name|file> [task]      Run a saved agent OR a script file
npx agent-do eval <file|dir> [options]   Run evals

Options:
  --provider <name>      anthropic | google | openai | ollama (default: anthropic)
  --model <id>           Model ID (default: provider-specific)
  --system <prompt>      System prompt
  --cwd <dir>            Working directory for workspace tools (default: cwd)
  --memory <dir>         Memory directory for --with-memory (default: .agent-do/)
  --with-memory          Enable memory tools (agent scratchpad)
  --read-only            Block all writes (workspace + memory)
  --exclude <globs>      Extra deny-list patterns (comma-separated, gitignore-style)
  --include-sensitive    Bypass built-in sensitive-file deny list (.env, .ssh, etc.)
  --max-iterations <n>   Max loop iterations (default: 20)
  --no-tools             Disable all file tools
  --verbose              Show per-step thinking + tool summaries (stderr)
  --show-content         With --verbose: also include each tool's full result
  --script               Required for `run <path>` to import local JS/TS files
  -y, --yes              Skip the interactive confirmation for --script
  --provider-tool <name> Enable a provider-native tool (repeatable). See below.
  --provider-options <json>
                         Provider-specific options forwarded to every model
                         call, e.g. '{"google":{"useSearchGrounding":true}}'.
  --json                 JSON output
  --output <fmt>         console | json | csv (eval only)
  --compare <providers>  Compare providers (eval only, comma-separated)
  --concurrency <n>      Parallel eval cases (default: 1)

Provider-native tools and options

Provider packages (@ai-sdk/google, @ai-sdk/anthropic, @ai-sdk/openai) ship server-side tools — Google search-with-grounding, Anthropic web search, OpenAI web search, code execution, URL context, etc. — and provider-specific call options like Google's useSearchGrounding, Anthropic's thinking, or OpenAI's reasoningEffort. Both are exposed on the CLI:

# Google search grounding via the provider tool
npx agent-do "what won the F1 race last weekend?" \
  --provider google \
  --provider-tool googleSearch

# Plus provider options on the model call itself
npx agent-do "summarize https://example.com" \
  --provider google \
  --provider-tool urlContext \
  --provider-options '{"google":{"useSearchGrounding":true}}'

# Anthropic web search (alias `webSearch` resolves to the latest dated tool)
npx agent-do "find recent papers on diffusion models" \
  --provider anthropic --provider-tool webSearch

# OpenAI web search
npx agent-do "latest research on graph neural networks" \
  --provider openai --provider-tool webSearch

--provider-tool is repeatable and can take comma-separated values (--provider-tool googleSearch,urlContext). A small alias table maps short names to the latest dated versions: webSearch → webSearch_20260209, bash → bash_20250124, etc.

The CLI only accepts tools that work with empty config — currently googleSearch, urlContext, codeExecution, webSearch, webFetch, bash, textEditor, computer, memory, webSearchPreview, codeInterpreter, imageGeneration, applyPatch. Tools that need per-tool args (fileSearchvectorStoreIds, mcp → server URL, customTool → name/description, …) must be configured from a script export so you can pass real args. Trying to enable one from the CLI fails fast with a copy-pasteable script-mode snippet.

The same fields are first-class on saved agents and AgentConfig:

npx agent-do create researcher \
  --provider google --model gemini-2.5-flash \
  --system 'Research topics' \
  --provider-tool googleSearch --provider-tool urlContext \
  --provider-options '{"google":{"useSearchGrounding":true}}'

npx agent-do researcher "summarize the latest Mars helicopter mission"

In a script (Format 2), set them on the exported AgentConfig:

import { google } from '@ai-sdk/google';
export default {
  id: 'researcher', name: 'researcher',
  model: google('gemini-2.5-flash'),
  tools: {
    google_search: google.tools.googleSearch({}),
    url_context: google.tools.urlContext({}),
  },
  providerOptions: { google: { useSearchGrounding: true } },
};

Script mode: run <path> --script

agent-do run <arg> resolves saved-agent names by default. To run a local JavaScript or TypeScript file as an agent, pass --script explicitly:

npx agent-do run ./my-agent.ts --script "Do something"

Importing an arbitrary JS/TS file runs its top-level code with your user privileges — the same trust model as running any local script. The --script flag is a deliberate speed bump so that:

  • A missed saved-agent lookup (typo, stale name) fails with a clear error instead of silently import()-ing a stray file that happens to match the name.
  • Social-engineering vectors like "download this helper script, then run agent-do run helper.js" require an explicit opt-in.

When --script is passed:

  • The path must point inside --cwd (symlinks and .. escapes are rejected after canonicalisation).
  • Only .js/.mjs/.cjs/.ts/.mts/.cts extensions are allowed.
  • The file must be a regular file (no directories, no special files).
  • A banner prints the path, size, and SHA-256 prefix, then asks Continue? [y/N]. Pass -y/--yes to skip the prompt (required in non-TTY contexts — CI, piped input — since there's nowhere to type the answer).

Tools: workspace vs memory

agent-do splits file access into two distinct concepts so the agent knows whether it's touching your project or its own notes:

  • Workspace tools (read_file, write_file, list_directory, grep_file, find_files, edit_file, delete_file) are enabled by default and rooted at --cwd (defaults to the current directory). This is what most CLI users want — the agent reads and modifies real project files.
  • Memory tools (memory_read, memory_write, memory_list, memory_delete, memory_search) are opt-in via --with-memory. They give the agent a private, per-agent scratchpad under --memory (default .agent-do/). Use memory when you want the agent to remember notes or plans across runs without scribbling on the project.

Both respect --read-only. To disable all file access, use --no-tools.

Sensitive-file deny list

Workspace tools ship with a gitignore-style deny list that blocks access to credential material by default:

  • Reads blocked: .env*, *.pem, *.key, id_rsa*, id_ed25519*, .ssh/**, .aws/**, .gcloud/**, .kube/**, .git/objects/**, .git/hooks/**.
  • Writes blocked (above plus): .git/**, node_modules/**.

Reads of node_modules/** and .git/HEAD are allowed — the agent can inspect dependencies and branch state but cannot silently rewrite git hooks or clobber installed modules.

Layer your own policy on top:

  • --exclude 'secrets/**,*.cred' — per-invocation patterns.
  • .agent-doignore at the workspace root — project-scoped, gitignore- style file. Merged with the defaults.
  • --include-sensitive — opt out of the built-in defaults when you explicitly want the old fully-open behaviour. .agent-doignore and --exclude still apply.

Blocked operations surface in --verbose logs as [blocked] entries with the matched rule; the model sees only that it was blocked (not the rule name) to avoid letting it probe the policy.

Tool result layering: model vs user vs programmatic

Every built-in tool returns a structured ToolResult with three views:

  • modelContent — the string the LLM sees. File contents are wrapped in <tool_output tool="…" path="…">…</tool_output> markers, capped at 256 KB (maxReadBytes), and common prompt-injection markers (ignore previous instructions, <system> tags) are replaced with a visible [redacted prompt-injection marker].
  • userSummary — a one-liner for operator logs. Includes real paths, byte counts, line counts, match counts, block reasons, errno codes.
  • data — structured fields (path, bytes, lines, truncated, redactedMarkerCount, matchCount, hiddenByDenyList, rule, …) for programmatic consumers.

In --verbose CLI mode you see the userSummary + a compact data line on stderr. Full raw tool output is withheld by default so secrets and large file contents don't leak into CI logs; pass --show-content to include it.

Library consumers who need the full raw payload on tool-result progress events pass emitFullResult: true in AgentConfig:

const agent = createAgent({ /* ... */, emitFullResult: true });
for await (const event of agent.stream('review the code')) {
  if (event.type === 'tool-result') {
    console.log(event.summary);            // always present
    console.log(event.data);                // structured, always present
    console.log(event.toolResult);          // only when emitFullResult: true
  }
}

Custom tools can return a ToolResult directly for full control:

import { tool } from 'ai';
import type { ToolResult } from 'agent-do';

const myTool = tool({
  description: 'Do a thing',
  inputSchema: z.object({ path: z.string() }),
  execute: async ({ path }): Promise<ToolResult> => ({
    modelContent: 'Short sanitised view for the model',
    userSummary: `[my_tool] ${path} — did a thing`,
    data: { path, widgetCount: 42 },
  }),
});

String returns still work and are normalised automatically.

Piping

Piped stdin is merged with the command-line prompt:

| stdin | prompt | result | |-------|--------|--------| | no | "Hello" | Task: "Hello" | | "context" | no | Task: "context" | | "context" | "Summarize" | Task: "Summarize\n\n---\n\ncontext" | | no | no | Interactive mode |

Quick Start

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';

const model = createMockModel({
  responses: [
    { text: 'The capital of France is Paris.' },
  ],
});

const agent = createAgent({
  id: 'geography',
  name: 'Geography Agent',
  model,
});

const result = await agent.run('What is the capital of France?');
console.log(result); // "The capital of France is Paris."

Streaming

stream() returns an AsyncIterable<ProgressEvent> that yields events as the agent works:

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';

const model = createMockModel({
  responses: [
    { toolCalls: [{ toolName: 'lookup', args: { query: 'Paris' } }] },
    { text: 'Paris is the capital of France.' },
  ],
});

const agent = createAgent({
  id: 'geo',
  name: 'Geo',
  model,
  tools: {
    // ... your tools here
  },
});

for await (const event of agent.stream('Tell me about Paris')) {
  switch (event.type) {
    case 'thinking':
      process.stdout.write(event.content);
      break;
    case 'tool-call':
      console.log(`Calling ${event.toolName}`, event.toolArgs);
      break;
    case 'tool-result':
      console.log(`Result from ${event.toolName}:`, event.toolResult);
      break;
    case 'text':
      console.log('Agent says:', event.content);
      break;
    case 'step-complete':
      console.log(`Step ${event.step! + 1} complete`);
      break;
    case 'done':
      console.log('Final answer:', event.content);
      break;
    case 'error':
      console.error('Error:', event.content);
      break;
  }
}

ProgressEvent types

| Type | Description | |------|-------------| | thinking | Partial text streaming from the model | | tool-call | The model is calling a tool (toolName, toolArgs) | | tool-result | A tool returned a result (toolName, toolResult) | | text | Final text output for a step | | step-complete | An iteration of the loop finished | | done | The agent completed its task | | error | Something went wrong or limits were exceeded |

Multiple Agents

Create multiple agents with different models, tools, and system prompts:

import { createAgent } from 'agent-do';
import { createAnthropic } from '@ai-sdk/anthropic';

const anthropic = createAnthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const assistant = createAgent({
  id: 'assistant',
  name: 'Assistant',
  model: anthropic('claude-sonnet-4-6'),
  systemPrompt: 'You are a helpful assistant.',
});

const researcher = createAgent({
  id: 'researcher',
  name: 'Researcher',
  model: anthropic('claude-haiku-4-5'), // cheaper model for research
  systemPrompt: 'You are a research assistant. Be thorough.',
});

// Each agent has its own conversation context
const answer = await assistant.run('Hello!');
const research = await researcher.run('Find info about TypeScript');

Tools

Define tools using the Vercel AI SDK's tool() function:

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
import { tool } from 'ai';
import { z } from 'zod';

const agent = createAgent({
  id: 'math',
  name: 'Math Agent',
  model: createMockModel({
    responses: [
      { toolCalls: [{ toolName: 'add', args: { a: 2, b: 3 } }] },
      { text: 'The sum of 2 and 3 is 5.' },
    ],
  }),
  tools: {
    add: tool({
      description: 'Add two numbers',
      inputSchema: z.object({
        a: z.number(),
        b: z.number(),
      }),
      execute: async ({ a, b }) => `${a + b}`,
    }),
  },
});

const result = await agent.run('What is 2 + 3?');

The agent loops automatically: it calls tools, feeds results back to the model, and continues until the model responds with text only (no tool calls) or hits maxIterations (default: 20).

History hygiene

Between iterations, the loop replaces older <tool_output>...</tool_output> blocks in the conversation history with self-closing redacted="stale" markers. This keeps the model's view of what happened (which tool ran, on what path) but drops the body, so injected content from a poisoned file can't keep influencing the model on every subsequent step. It also keeps token spend bounded as iterations accumulate.

Tune the window with AgentConfig.historyKeepWindow (default 1 — only the most recent iteration's tool outputs flow in full to the next call). Set to Infinity to restore the historical "everything-stays-in-context" behaviour.

Tool factories

Three consumer-facing factories cover the common cases. See docs/sandbox.md for how each interacts with a sandbox.

Workspace tools

createWorkspaceTools(workingDir, opts?) gives the agent file tools (read_file, write_file, edit_file, list_directory, delete_file, grep_file, find_files) rooted at workingDir, with a deny-list (.env, .ssh/**, ...) applied at the tool layer. Pass { sandbox } to swap the internal store for a SandboxBackedMemoryStore and route every file op through the sandbox.

import { createAgent, createWorkspaceTools } from 'agent-do';

const agent = createAgent({
  id: 'coder', name: 'Coder', model,
  tools: createWorkspaceTools(process.cwd(), { readOnly: true }),
});

Memory tools

createMemoryTools(store, agentId, opts?) gives the agent a private scratchpad (memory_read, memory_write, memory_list, memory_delete, memory_search) backed by any MemoryStore.

import {
  createAgent, createMemoryTools, InMemoryMemoryStore,
} from 'agent-do';

const store = new InMemoryMemoryStore();
const agent = createAgent({
  id: 'writer', name: 'Writer', model,
  tools: createMemoryTools(store, 'writer'),
});

Shell tool

createShellTool(sandbox?, opts?) gives the agent a single shell tool (default name bash) wired to a SandboxApi. Defaults to createHostSandbox() when no sandbox is supplied; for real isolation, pass createJustBashSandbox() or your own connector.

import { createAgent, createShellTool, createJustBashSandbox } from 'agent-do';

const sandbox = await createJustBashSandbox();
const agent = createAgent({
  id: 'runner', name: 'Runner', model,
  tools: createShellTool(sandbox),
});

MemoryStore

The MemoryStore interface abstracts file storage for agents. Three implementations are included:

  • InMemoryMemoryStore — for testing and prototyping (data lost on exit)
  • FilesystemMemoryStore — persists to the local filesystem (survives restarts)
  • SandboxBackedMemoryStore — adapts a SandboxApi connector into a MemoryStore (see docs/sandbox.md)
import { FilesystemMemoryStore, createMemoryTools, createAgent } from 'agent-do';

const store = new FilesystemMemoryStore('./agent-data');
const agent = createAgent({
  id: 'my-agent',
  name: 'My Agent',
  model: model as any,
  tools: createMemoryTools(store, 'my-agent'),
});
// Files persist at ./agent-data/my-agent/

Security: FilesystemMemoryStore

Warning: FilesystemMemoryStore gives the agent read/write access to the specified directory. The agent decides what files to create and modify. Use readOnly: true to restrict to read-only access, or onBeforeWrite to approve each write operation.

// Read-only mode — agent can read but not create/modify/delete
const readOnlyStore = new FilesystemMemoryStore('./data', { readOnly: true });
// Write confirmation — approve each operation (sync or async)
const guardedStore = new FilesystemMemoryStore('./data', {
  onBeforeWrite: (agentId, canonicalPath, operation) => {
    console.log(`Agent ${agentId} wants to ${operation}: ${canonicalPath}`);
    // Return true to allow, false to block
    // The path is canonicalized — ../traversal is resolved before this callback
    return true;
  },
});

For other backends, implement the interface:

interface MemoryStore {
  read(agentId: string, path: string): Promise<string>;
  write(agentId: string, path: string, content: string): Promise<void>;
  append(agentId: string, path: string, content: string): Promise<void>;
  delete(agentId: string, path: string): Promise<void>;
  list(agentId: string, path?: string): Promise<FileEntry[]>;
  mkdir(agentId: string, path: string): Promise<void>;
  exists(agentId: string, path: string): Promise<boolean>;
  search(agentId: string, pattern: string, path?: string): Promise<Array<{ path: string; line: string }>>;
}

Custom implementations

See examples/08-custom-memory-store.ts for complete patterns for:

  • Node.js filesystem (fs)
  • AWS S3 (@aws-sdk/client-s3)
  • Google Firestore (@google-cloud/firestore)
  • SQLite (better-sqlite3)

Conversation History

Pass previous conversation turns to maintain context:

import { createAgent, type ConversationMessage } from 'agent-do';

const history: ConversationMessage[] = [];

// First turn
const r1 = await agent.run('My name is Alice', undefined, history);
history.push({ role: 'user', content: 'My name is Alice' });
history.push({ role: 'assistant', content: r1 });

// Second turn — agent remembers the name
const r2 = await agent.run('What is my name?', undefined, history);
// r2 = "Your name is Alice."

Skills

Skills extend an agent's system prompt with additional instructions. They can be installed, removed, searched, and managed through a SkillStore.

Defining a skill

import type { Skill } from 'agent-do';

const skill: Skill = {
  id: 'code-review',
  name: 'Code Review',
  description: 'Reviews code for quality and best practices',
  content: `When reviewing code:
- Check for error handling
- Look for security issues
- Suggest performance improvements`,
};

Using InMemorySkillStore

import { createAgent, InMemorySkillStore } from 'agent-do';
import { createMockModel } from 'agent-do/testing';

const skills = new InMemorySkillStore();
await skills.install({
  id: 'code-review',
  name: 'Code Review',
  description: 'Reviews code for quality',
  content: 'When reviewing code, check for errors and suggest improvements.',
});

const agent = createAgent({
  id: 'reviewer',
  name: 'Reviewer',
  model: createMockModel({ responses: [{ text: 'LGTM' }] }),
  skills,
});

When a SkillStore is provided, the agent gets:

  • Installed skill content injected into the system prompt, wrapped in <skill>…</skill> markers with a preamble instructing the model to treat the body as reference data rather than overriding instructions.
  • Auto-generated tools: search_skills, list_skills, remove_skill. The install_skill tool is not exposed by default — see below.

allowSkillInstall (privileged)

The LLM-facing install_skill tool lets the model write skills into the backing SkillStore. Because installed skills get injected into every subsequent run's system prompt, a prompt-injected agent with install access could plant a persistent jailbreak across sessions.

Set allowSkillInstall: true on the agent config to expose install_skill to the model. Default is false — library callers install skills themselves (via skills.install(...)) and the agent only searches / lists / removes them.

const agent = createAgent({
  // ...
  skills,
  allowSkillInstall: true, // opt-in: model can persist new skills
});

Inputs to install_skill are validated by a strict schema (id matches /^[a-zA-Z0-9_-]+$/, content ≤ 8 KB, name ≤ 64 chars, description ≤ 256 chars) regardless of who calls the tool, and any <skill> or </skill> sequences inside the skill body are neutralised before the prompt is rendered so the structural isolation can't be broken from inside.

Parsing SKILL.md files

import { parseSkillMd } from 'agent-do';

const skill = parseSkillMd(`---
name: My Skill
description: Does useful things
author: Alice
version: 1.0.0
---

Instructions for the skill go here.
`);

console.log(skill.name);    // "My Skill"
console.log(skill.content); // "Instructions for the skill go here."

SkillStore interface

Implement SkillStore for custom backends (database, filesystem, API):

interface SkillStore {
  list(): Promise<Skill[]>;
  get(skillId: string): Promise<Skill | undefined>;
  install(skill: Skill): Promise<void>;
  remove(skillId: string): Promise<void>;
  search(query: string): Promise<Array<{ id: string; name: string; description: string }>>;
}

SkillSearchResult deliberately has no url field — an external registry returning a URL would turn skill search into an SSRF / auto- fetch footgun (see issue #34). If you wire up a network-backed store, host allowlisting and explicit installation must happen outside search(); install() should only receive content the caller has already verified.

Lifecycle Hooks

Hooks let you observe and control the agent loop. All hooks are optional and async.

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';

const agent = createAgent({
  id: 'hooked',
  name: 'Hooked Agent',
  model: createMockModel({ responses: [{ text: 'Done.' }] }),
  hooks: {
    // Called before each tool execution. Return a HookDecision to allow/deny/modify.
    onPreToolUse: async ({ toolName, args, step }) => {
      console.log(`Step ${step}: about to call ${toolName}`);
      // Return { decision: 'deny', reason: 'not allowed' } to block
      // Return { decision: 'allow', modifiedArgs: { ... } } to modify input
      return { decision: 'allow' };
    },

    // Called after each tool execution.
    onPostToolUse: async ({ toolName, args, result, step, durationMs }) => {
      console.log(`${toolName} took ${durationMs}ms`);
    },

    // Called at the start of each loop iteration. Return 'stop' to halt.
    onStepStart: async ({ step, totalSteps, tokensSoFar, costSoFar }) => {
      if (costSoFar > 1.0) {
        return { decision: 'stop', reason: 'Too expensive' };
      }
    },

    // Called after each loop iteration completes.
    onStepComplete: async ({ step, hasToolCalls, text }) => {
      console.log(`Step ${step} done, has tools: ${hasToolCalls}`);
    },

    // Called when the entire run finishes.
    onComplete: async ({ result, totalSteps, usage, aborted }) => {
      console.log(`Finished in ${totalSteps} steps, cost: $${usage.totalCost.toFixed(4)}`);
    },

    // Called after each step's usage is recorded.
    onUsage: async (record) => {
      console.log(`Step ${record.step}: ${record.inputTokens}in/${record.outputTokens}out, $${record.estimatedCost.toFixed(4)}`);
    },
  },
});

HookDecision

Returned from onPreToolUse and onStepStart:

interface HookDecision {
  decision: 'allow' | 'deny' | 'ask' | 'stop' | 'continue';
  reason?: string;
  modifiedArgs?: unknown; // Only for onPreToolUse: replace the tool's input
}

Permissions

Control which tools the agent can call.

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';

const agent = createAgent({
  id: 'safe',
  name: 'Safe Agent',
  model: createMockModel({ responses: [{ text: 'Done.' }] }),
  permissions: {
    // Base mode: 'accept-all' | 'deny-all' | 'ask'
    mode: 'ask',

    // Per-tool overrides: 'always' | 'ask' | 'never'
    tools: {
      read_file: 'always',   // Always allowed, even in deny-all mode
      delete_file: 'never',  // Always blocked, even in accept-all mode
      write_file: 'ask',     // Falls through to onPermissionRequest
    },

    // Called when mode is 'ask' or a tool's level is 'ask'
    onPermissionRequest: async ({ toolName, args }) => {
      console.log(`Allow ${toolName}?`, args);
      return true; // or false to deny
    },
  },
});

Permission evaluation order

  1. If mode is accept-all, allow (but still check per-tool never overrides)
  2. If mode is deny-all, deny (but still check per-tool always overrides)
  3. Check per-tool override: always -> allow, never -> deny
  4. If ask or no override: call onPermissionRequest (defaults to allow if no callback)

Usage Tracking

Track token usage and costs across agent runs with built-in pricing for 50+ models.

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';

const agent = createAgent({
  id: 'tracked',
  name: 'Tracked',
  model: createMockModel({ responses: [{ text: 'Hi' }] }),
  usage: {
    enabled: true,
    limits: {
      perRun: 0.50,  // $0.50 max per run
      perDay: 5.00,  // $5.00 max per day
    },
    // Called when a limit is exceeded. Return true to continue anyway.
    onLimitExceeded: async ({ type, spent, limit }) => {
      console.warn(`${type} limit exceeded: $${spent.toFixed(2)} / $${limit.toFixed(2)}`);
      return false; // stop the run
    },
    // Optional: override built-in pricing
    pricing: {
      'my-custom-model': { input: 1.0, output: 3.0 }, // per 1M tokens
    },
  },
});

UsageTracker class

For standalone usage tracking outside of createAgent:

import { UsageTracker, estimateCost, DEFAULT_PRICING } from 'agent-do';

const tracker = new UsageTracker({
  perRunLimit: 1.0,
});

// Record a step
const record = tracker.record(0, 'claude-sonnet-4-6', 1000, 500);
console.log(record.estimatedCost); // cost based on built-in pricing

// Get summary
const summary = tracker.getSummary();
console.log(summary.totalCost, summary.totalInputTokens, summary.totalOutputTokens);

// Check limits
const ok = await tracker.checkLimits(); // false if limit exceeded

// Standalone cost estimation
const cost = estimateCost('gpt-4o', 10000, 5000);

Debugging & Observability

--verbose / --show-content expose what the agent loop sees — text deltas, tool call summaries, streaming events. One layer lower is what's actually crossing the wire to the model provider: the resolved system prompt, the full message list per step, the cache metrics coming back, and every raw stream part. That's the debug surface.

CLI log levels

# Default: final answer + errors
npx agent-do "hello"

# Thinking + tool summaries (what --verbose has always been)
npx agent-do --log-level verbose "hello"
# Legacy: --verbose still works and implies --log-level verbose.

# + system prompt, messages per step, cache metrics, request metadata
npx agent-do --log-level debug "hello"

# + every raw stream part (text-delta, tool-call, finish, …)
npx agent-do --log-level trace "hello"

At debug the CLI emits compact, labelled lines to stderr:

[debug:request] step=0 model=claude-sonnet-4-6 tools=[read_file,write_file,grep_file]
[debug:messages] step=0 count=2 bytes=1247
[debug:cache] step=0 read=0 write=1198 no-cache=49 out=112 hit=0%
…second iteration benefits from the cache write on the first:
[debug:cache] step=1 read=1198 write=0 no-cache=14 out=87 hit=98%

That last pair is what you want when you're checking whether Anthropic prompt caching is actually firing — read/(read+no-cache) is the hit rate.

API: AgentConfig.debug

Library callers opt in explicitly and control fan-out:

import { createAgent, type DebugEvent } from 'agent-do';

const agent = createAgent({
  id: 'debug-me',
  name: 'Debug Me',
  model,
  systemPrompt: 'You are helpful.',
  debug: {
    systemPrompt: true,  // log the resolved prompt once per run
    messages: true,      // log the message list going into each step
    request: true,       // model id + tool names + providerOptions
    cache: true,         // per-step cache read/write/no-cache tokens
    response: false,     // opt in to raw stream parts
    // Optional: sink to a separate destination in addition to the
    // progress stream. The sink fires *in addition to* the
    // `type: 'debug'` progress events, not instead of them.
    sink: (event: DebugEvent) => {
      myLogger.info(event.channel, event);
    },
    // Cap body size in system-prompt / messages events. Default 16 KB.
    maxBodyBytes: 8 * 1024,
  },
});

for await (const event of agent.stream('task')) {
  if (event.type === 'debug' && event.debug?.channel === 'cache') {
    const hit = event.debug.cacheReadTokens /
      (event.debug.cacheReadTokens + event.debug.noCacheTokens);
    console.log(`cache hit rate: ${(hit * 100).toFixed(1)}%`);
  }
}

Channels

| Channel | Fires when | Payload highlights | |---|---|---| | system-prompt | once per run, before the first model call | content, bytes, truncated | | messages | before each streamText call | full messages[], bytes, truncated | | request | before each streamText call | model, toolCount, toolNames, providerOptions | | response-part | each raw stream part | partType always; full part when traceResponseParts: true | | cache | after each inner model step | cacheReadTokens, cacheWriteTokens, noCacheTokens, outputTokens, providerMetadata |

Notes

  • response-part at trace only: the full stream is noisy (text-delta fires per token). At debug level the channel emits just the partType for aggregate counting; bump to trace when you need the contents.
  • String model IDs skip middleware: if you passed model: 'gpt-4o' as a string, the AI SDK resolves the provider lazily and the middleware can't attach. The other channels (system-prompt, cache) still fire — they don't depend on middleware. Pass a structured model (createAnthropic()('claude-sonnet-4-6')) for full coverage.
  • Sink errors are swallowed. A broken sink can't wedge the run. Sync and async throws are both caught.
  • No secret redaction. If you opt into debug and point it at a log file, it's your job to keep that file private. The maxBodyBytes cap exists only to prevent accidental megabyte dumps.

Testing

createMockModel() returns a mock LanguageModel compatible with the Vercel AI SDK. It uses predetermined responses so you can test agent behavior without API keys.

import { createAgent } from 'agent-do';
import { createMockModel } from 'agent-do/testing';
import { tool } from 'ai';
import { z } from 'zod';

// Simulate a multi-step agent run: tool call -> final answer
const model = createMockModel({
  responses: [
    // Step 1: model calls a tool
    { toolCalls: [{ toolName: 'get_weather', args: { city: 'London' } }] },
    // Step 2: model responds with text (ends the loop)
    { text: 'The weather in London is rainy.' },
  ],
  modelId: 'test-model',
  inputTokensPerCall: 100,
  outputTokensPerCall: 50,
});

const agent = createAgent({
  id: 'test',
  name: 'Test',
  model,
  tools: {
    get_weather: tool({
      description: 'Get weather for a city',
      inputSchema: z.object({ city: z.string() }),
      execute: async ({ city }) => `${city}: rainy, 12C`,
    }),
  },
});

const result = await agent.run('Weather in London?');
// result === 'The weather in London is rainy.'

MockModelOptions

| Option | Default | Description | |--------|---------|-------------| | responses | (required) | Array of MockResponse objects, used in order | | modelId | 'mock-model' | Model ID for logging | | provider | 'mock-provider' | Provider name for logging | | inputTokensPerCall | 10 | Simulated input tokens per call | | outputTokensPerCall | 20 | Simulated output tokens per call |

Eval Framework

Define eval cases to measure agent quality, compare providers, and catch regressions.

import { defineEval, runEvals } from 'agent-do/eval';
import { createAnthropic } from '@ai-sdk/anthropic';

const anthropic = createAnthropic();

const suite = defineEval({
  name: 'my-assistant-eval',
  model: anthropic('claude-haiku-4-5'),
  systemPrompt: 'You are a helpful assistant.',
  cases: [
    {
      name: 'knows capitals',
      input: 'What is the capital of France?',
      assert: [
        { type: 'contains', value: 'Paris' },
        { type: 'not-contains', value: 'London' },
      ],
    },
    {
      name: 'saves notes correctly',
      input: 'Save a note that my name is Alice.',
      assert: [
        { type: 'tool-called', tool: 'write_file' },
        { type: 'file-contains', path: 'memories/user.md', value: 'Alice' },
        { type: 'max-steps', max: 5 },
        { type: 'max-cost', maxUsd: 0.05 },
      ],
    },
  ],
});

const result = await runEvals(suite);
// Console output:  ✓ PASS  knows capitals  ($0.0012, 800ms, 1 steps)
//                  ✓ PASS  saves notes correctly  ($0.0035, 2100ms, 2 steps)

Assertion types

| Type | Description | |------|-------------| | contains | Response text contains a string | | not-contains | Response does NOT contain a string | | regex | Response matches a regex pattern | | json-schema | Response is valid JSON matching a schema | | tool-called | A specific tool was called during execution | | tool-not-called | A specific tool was NOT called | | tool-args | Tool was called with specific arguments (partial match) | | file-exists | A file was created in the memory store | | file-contains | A file in the store contains a string | | max-steps | Agent completed in N or fewer steps | | max-cost | Agent completed within a cost budget (USD) | | llm-rubric | Another LLM scores the response against a rubric | | custom | Custom function receives the full result |

Multi-provider comparison

const result = await runEvals(suite, {
  providers: [
    { name: 'anthropic', model: anthropic('claude-sonnet-4-6') },
    { name: 'google', model: google('gemini-2.5-flash') },
    { name: 'openai', model: openai('gpt-4.1-mini') },
  ],
});
// Prints a comparison table with pass rate, cost, and latency per provider

LLM-as-judge

{
  name: 'explains clearly',
  input: 'Explain quantum computing to a 10 year old',
  assert: [
    {
      type: 'llm-rubric',
      rubric: 'The explanation should be simple, use analogies, avoid jargon.',
      score: 'pass-fail', // or '1-5'
    },
  ],
}

Output formats

// Console output (default)
await runEvals(suite);

// JSON (for CI/dashboards)
await runEvals(suite, { output: 'json' });

// CSV (for spreadsheets)
await runEvals(suite, { output: 'csv' });

// Silent (programmatic use)
const result = await runEvals(suite, { output: 'silent' });

API Reference

Main exports (agent-do)

| Export | Type | Description | |--------|------|-------------| | createAgent | (config: AgentConfig) => Agent | Create an agent with run(), stream(), and abort() | | runAgentLoop | (config, task, context?) => Promise<RunResult> | Run the loop directly (lower-level) | | streamAgentLoop | (config, task, context?) => AsyncGenerator<ProgressEvent> | Stream the loop directly (lower-level) | | createShellTool | (sandbox?, opts?) => ToolSet | A single shell tool (default name bash) whose execute calls sandbox.exec. Defaults to host when no sandbox is supplied. (docs/sandbox.md) | | SandboxBackedMemoryStore | class | Adapt a SandboxApi into a MemoryStore | | createHostSandbox | (opts?) => SandboxApi | Direct passthrough to the host — not a security boundary | | createJustBashSandbox | (opts?) => Promise<SandboxApi> | Wrap a vercel-labs/just-bash Sandbox | | wrapJustBashSandbox | (instance) => SandboxApi | Wrap an externally-constructed just-bash instance | | createSkillTools | (store: SkillStore) => ToolSet | Create skill management tools | | buildSkillsPrompt | (skills: Skill[]) => string | Build a system prompt section from skills | | parseSkillMd | (content, id?) => Skill | Parse a SKILL.md with YAML frontmatter | | InMemorySkillStore | class | In-memory reference implementation of SkillStore | | InMemoryMemoryStore | class | In-memory store (testing/prototyping) | | FilesystemMemoryStore | class | Node.js filesystem store (persistent) | | createOrchestrator | (config) => Orchestrator | Create a multi-agent orchestrator | | evaluatePermission | (toolName, args, config) => Promise<boolean> | Evaluate a permission check | | UsageTracker | class | Track usage and costs within a run | | estimateCost | (model, input, output, pricing?) => number | Estimate cost in USD | | DEFAULT_PRICING | PricingTable | Built-in pricing for 50+ models |

Test exports (agent-do/testing)

| Export | Type | Description | |--------|------|-------------| | createMockModel | (options: MockModelOptions) => LanguageModel | Create a mock model for testing |

Key types

| Type | Description | |------|-------------| | AgentConfig | Full agent configuration (model, tools, hooks, permissions, usage) | | Agent | Agent instance with id, name, run(), stream(), abort() | | ProgressEvent | Event emitted during streaming | | RunResult | Result of run() with text, usage, steps, aborted flag | | AgentHooks | Lifecycle hook callbacks | | PermissionConfig | Permission mode, per-tool overrides, callback | | Skill / SkillStore | Skill definition and storage interface | | RunUsage / UsageRecord | Usage summary and per-step records | | HookDecision | Return value from hooks to control execution | | PricingTable | Model pricing lookup (per 1M tokens) | | MemoryStore | Storage interface for agent file operations | | SandboxApi | Pluggable sandbox contract (Flue-shaped) — see docs/sandbox.md | | FileStat / ExecOptions / ExecResult | Shapes returned by SandboxApi methods | | FileEntry | File/directory entry from list() | | ConversationMessage | User/assistant message for conversation history | | Orchestrator / OrchestratorConfig | Multi-agent orchestration types | | BuildSystemPromptOptions | Options for the prompt builder | | SectionFn | Function that returns a prompt section string | | PromptTemplate | Named template with ordered section list |

Eval exports (agent-do/eval)

| Export | Type | Description | |--------|------|-------------| | defineEval | (config: EvalSuiteConfig) => EvalSuiteConfig | Define an eval suite (type-safe helper) | | runEvals | (suite, options?) => Promise<EvalResult> | Run an eval suite and return results | | evaluateAssertion | (assertion, result, judgeModel?) => Promise<AssertionResult> | Evaluate a single assertion | | EvalSuiteConfig | type | Eval suite definition (name, model, cases) | | EvalCase | type | Single eval test case (input, assertions) | | Assertion | type | Union of all 13 assertion types | | EvalResult | type | Full eval result with provider breakdowns | | CaseResult | type | Result of a single eval case |

Prompt exports (agent-do/prompts)

| Export | Type | Description | |--------|------|-------------| | buildSystemPrompt | function | Compose a system prompt from templates, sections, and variables | | interpolate | function | Simple {{variable}} replacement | | builtinTemplates | object | Preconfigured templates: assistant, coder, researcher, reviewer, writer, planner | | builtinSections | object | Reusable sections: identity, memoryManagement, fileTools, efficiency, etc. | | roleSections | object | Role-specific sections: codingApproach, researchApproach, etc. |

Store exports (agent-do/stores)

| Export | Description | |--------|-------------| | MemoryStore | Storage interface (type) | | FileEntry | File entry type |

Store implementations

| Export | Import path | Description | |--------|-------------|-------------| | InMemoryMemoryStore | agent-do | In-memory store for testing/prototyping (data lost on exit) | | FilesystemMemoryStore | agent-do | Node.js filesystem store (persistent, path-traversal safe) |

Examples

The examples/ directory contains runnable examples:

| # | File | Description | |---|------|-------------| | 1 | 01-basic-agent.ts | Simplest possible agent | | 2 | 02-agent-with-tools.ts | Custom tools (weather, calculator) | | 3 | 03-agent-with-memory.ts | File tools with InMemoryMemoryStore | | 4 | 04-lifecycle-hooks.ts | Hooks for monitoring and control | | 5 | 05-multi-provider.ts | Anthropic, Google, OpenAI, Ollama | | 6 | 06-conversation-history.ts | Multi-turn conversations | | 7 | 07-multi-agent-orchestration.ts | Master + worker agents | | 8 | 08-custom-memory-store.ts | Patterns for S3, Firestore, SQLite, filesystem | | 9 | 09-skills.ts | Skills system | | 10 | 10-testing.ts | Testing with createMockModel | | 11 | 11-filesystem-store.ts | Persistent filesystem storage — explore the created files | | 12 | 12-prompt-builder.ts | Composable system prompts from templates + sections + variables | | 13 | 13-eval-framework.ts | Eval framework — define cases, assert quality, compare providers | | 16 | 16-sandbox-bash.ts | Pluggable sandbox + bash tool (host connector) | | 17 | 17-sandbox-with-memory.ts | Sandbox alongside InMemoryMemoryStore (different substrates) | | 18 | 18-sandbox-with-filesystem.ts | Sandbox alongside FilesystemMemoryStore (soft policy + sandboxed bash, plus a strong-isolation pattern) |

Run any example: npx tsx examples/01-basic-agent.ts

Releasing

Version management uses Changesets. Releases are manual by design — publishing requires short-lived npm credentials that the maintainer mints ahead of time, so nothing lives in CI as a long-lived token.

Recording a change

When you make a change that should appear in the next release, record a changeset describing it:

npm run changeset

The CLI asks whether the change is a major / minor / patch bump and writes a markdown file in .changeset/. Commit that file alongside your code change. The changeset body becomes an entry in CHANGELOG.md at release time.

Rules of thumb for pre-1.0:

  • patch — bug fixes, internal refactors, doc updates
  • minor — new features, non-breaking API additions, security fixes that don't change the public API
  • major — save for the 1.0 cut; until then, breaking changes can ride in minor

Not every PR needs a changeset. If a change doesn't affect what gets shipped to npm consumers (CI config, internal tests, comment tweaks), skip it.

Cutting a release

Once one or more changesets are sitting in .changeset/ on the branch you want to cut from (usually main):

# Mint a short-lived npm token, then either:
npm login                            # interactive, stays in ~/.npmrc for this session
# or:
export NPM_TOKEN=npm_…               # paste the token

npm run release

npm run release executes scripts/release.sh, which walks the whole manual flow in one go:

  1. Preconditions — dirty tree aborts, zero pending changesets aborts.
  2. Quality gatetypecheck + test + build.
  3. Apply changesetschangeset version bumps package.json and prepends entries to CHANGELOG.md. The script prints the diff so you can eyeball it.
  4. Commitchore: release vX.Y.Z on the current branch.
  5. Publishchangeset publish pushes to npm (with provenance) and creates the git tag.
  6. Push — commit + tag to origin/<current-branch>.

Any failure stops the script. Steps are idempotent enough that you can usually re-run after fixing the cause.

After publish, open https://github.com/PaulKinlan/agent-do/releases and (optionally) create a GitHub release from the fresh tag with the CHANGELOG body pasted in.

Commit message format

Not required. Changesets determines the semver impact from the .changeset/*.md files, not from commit messages, so commits can be whatever shape fits the change. Conventional-commit prefixes (feat:, fix:, chore:) are fine but not enforced.

The one discipline to keep is: any PR that changes user-facing behaviour needs a changeset. If that ever drifts, the cheapest enforcement is a CI check that fails PRs which touch src/ without adding a file in .changeset/ — happy to wire that up if it becomes a problem.

License

Apache 2.0