@dj_abstract/agent-firewall
v0.1.0
Published
Runtime defensive middleware for AI agent tool calls. Detects, logs, and blocks suspicious patterns at call time — exfiltration trails, dangerous shell, sensitive path writes, the lethal trifecta in motion.
Maintainers
Readme
agent-firewall
Runtime defensive middleware for AI agent tool calls. Detects, logs, and blocks suspicious patterns at call time — exfiltration trails, dangerous shell, sensitive path writes, the lethal trifecta in motion.
Static analysis vs. runtime defense.
mcp-auditinspects MCP server definitions before deployment.prompt-evalscores defense posture against an attack corpus.agent-firewallwatches every actual tool call in production and blocks the ones that match policy. Three complementary layers; this one is the goalkeeper.
What it catches
| Rule | Severity | Default verdict | What it catches |
|------|----------|-----------------|------------------|
| secret-egress | critical | block | Tool args containing API keys, GitHub PATs, AWS keys, JWTs, PEM private keys |
| exfil-trail | high | block | Cross-tool sequences in a session: sensitive read followed by an outbound send (the classic prompt-injection exfil) |
| lethal-trifecta-runtime | critical | block | A session that has now exercised every capability in a forbidden combo (shell-exec + network-egress, secret-read + network-egress, etc.) |
| dangerous-shell | critical/high | block | rm -rf /, curl | sh, fork bombs, mkfs, base64-decode-and-exec, netcat reverse shells |
| path-allowlist | critical/high | block | File operations targeting paths outside allowed roots or in explicit blocklist (e.g. /.ssh, /etc) |
| url-allowlist | high/critical | block | Network calls to non-allowlisted hosts or non-allowed schemes |
| recipient-allowlist | critical | block | Email/Telegram tool calls to recipients not on the allowlist |
| rate-limits | high | block | Same tool fired N times in a window — runaway loops or repetitive abuse |
| destructive-no-consent | high | block | Destructive tools (delete_*, drop_*, kill_*) called without an explicit confirm: true arg |
Every rule is a pure function. Add your own in src/rules/.
Install
npm install @dj_abstract/agent-firewallRequires Node.js 20+.
Quickstart — generic
import { createFirewall } from '@dj_abstract/agent-firewall';
const firewall = createFirewall({
mode: 'enforce', // 'observe' to only log
policy: {
url: { allowedHosts: ['api.github.com'], allowedSchemes: ['https'] },
fs: { allowedRoots: ['/workspace', '/tmp'], blockedPaths: ['/.ssh', '/etc'] },
recipients: { allowedEmails: ['[email protected]'] },
rateLimits: { perTool: { count: 12, windowMs: 60_000 } },
},
onDecision: ({ ctx, decision }) => {
if (decision.verdict !== 'allow') console.warn('[firewall]', decision.verdict, ctx.toolName, decision.findings.map(f => f.title));
},
});
// Before executing a tool call:
const decision = await firewall.evaluate({
toolName: 'send_email',
args: { to: '[email protected]', body: 'sk-ant-api03-xxxx' },
sessionId: 'session-42',
caps: ['email_send', 'network_out'],
});
if (decision.verdict === 'block') {
// refuse to run the tool, return the firewall message to the model
}Anthropic Agent SDK adapter
If you use @anthropic-ai/claude-agent-sdk, wrap your MCP tools in one line:
import { tool, createSdkMcpServer } from '@anthropic-ai/claude-agent-sdk';
import { createFirewall } from '@dj_abstract/agent-firewall';
import { wrapMcpTool } from '@dj_abstract/agent-firewall/adapters/agent-sdk';
const firewall = createFirewall({ mode: 'enforce', policy: { /* ... */ } });
const sendEmail = tool('send_email', '...', { to: z.string(), body: z.string() }, async args => { /* ... */ });
const guardedSendEmail = wrapMcpTool(sendEmail, firewall);
export const myServer = createSdkMcpServer({
name: 'my-tools',
version: '1.0.0',
tools: [guardedSendEmail, /* ... */],
});The adapter automatically infers capability tags (secret_read, network_out, shell_exec, etc.) from each tool's name using the same token-based classifier that powers mcp-audit. Override per-tool with wrapMcpTool(t, firewall, { caps: ['secret_read'] }).
Modes
observe— never blocks; the call always proceeds. Findings are still emitted toonDecision. Use this to gather a baseline before turning enforcement on.enforce— calls flagged withverdict: 'block'are stopped before reaching the tool handler. The wrapped tool returns a structured error to the model so it sees the refusal as a normal tool result.
Sessions
Pass sessionId on every call. Cross-tool rules (exfil-trail, lethal-trifecta-runtime, rate-limits) need it to correlate. If omitted, all calls land in a shared 'default' session — fine for tests, dangerous in production multi-tenant systems.
Custom rules
Each rule is { id, check(ctx, session, policy) -> Finding[] | Finding | null } returning zero or more findings. Add to the registry in src/rules/index.js, or compose your own ruleset and pass via createFirewall({ rules: [...] }).
Threat model
This is defense-in-depth, not a perimeter. It cannot:
- Prevent prompt injection from reaching the model (use system-prompt hardening +
prompt-evalfor that) - Catch attacks the LLM laundering through allowed channels (e.g., embedding secrets in content the model paraphrases — though
secret-egresscatches the literal patterns) - Sandbox the actual tool implementations (use OS/container isolation for that)
It does catch the most common runtime mistakes that prompt-injected agents make: exfil to attacker URLs, dumping secrets into outbound channels, runaway loops, destructive calls without consent. Combined with mcp-audit (catches the design-time risks) and prompt-eval (catches the model-side weaknesses), it closes a real loop.
References
- OWASP Top 10 for LLM Applications
- The Lethal Trifecta — Simon Willison
- Tool Poisoning Attacks — Invariant Labs
- MITRE ATLAS
License
MIT — see LICENSE.
