canary-scan

v0.3.1

Published

a month ago

Prompt injection detection using behavioral analysis. Weak models as sensitive sensors.

0High
0Medium
0Low

elif1203

prompt-injection security llm canary mcp ai-safety

Canary

Prompt injection detection using behavioral analysis. Weak LLMs as sensitive sensors.

How it works

Canary sends content to a small, cheap LLM with one instruction: echo it back exactly. Then it checks what happened.

Two independent detection channels:

Text deviation — Did the output differ from the input? Deterministic string comparison after normalization. No fuzzy matching, no thresholds.
Tool call attempt — Did the model try to call any honeypot tools? Five attractive tool definitions (execute_command, read_file, send_request, update_memory, send_message) are offered but never referenced in the prompt. Any tool call means the input content influenced the model.

If either channel fires: FLAGGED. If neither: CLEAR.

Why weak models?

A small, instruction-tuned model is more susceptible to prompt injection than a frontier model. That's the feature. A model that gets tricked easily makes a more sensitive detector. The canary doesn't need to be smart — it needs to be gullible.

What CLEAR and FLAGGED mean

CLEAR = "No deviation detected under test conditions." This is not a safety guarantee. Sophisticated injections can evade detection.
FLAGGED = "Behavioral deviation detected." The content caused the canary to deviate from its echo instruction. Human review recommended.

Canary makes bounded claims, not absolute ones.

Install

npm install canary-scan

Or run directly:

npx canary-scan scan https://example.com

Setup

You need an API key from OpenRouter (free tier works).

export CANARY_API_KEY=your-openrouter-key

Optional:

export CANARY_MODEL=stepfun/step-3.5-flash:free  # default
export CANARY_BASE_URL=https://openrouter.ai/api/v1  # default

CLI Usage

# Scan a URL
canary scan https://example.com

# Scan raw text
canary scan --text "some content to check"

# Calibrate — measure echo fidelity and tool call rate for your model
canary calibrate

# Trust management
canary trust list
canary trust add https://known-safe.com
canary flag https://suspicious.com

Trust List (Persistent)

Canary remembers which sources you've trusted or flagged, saved to ~/.canary/trust.json. Decisions carry across sessions — trust a site once, it stays trusted.

{
  "trusted": ["https://known-safe.com"],
  "flagged": ["https://suspicious.com"]
}

Trusted sources skip scanning and return CLEAR immediately
Flagged sources skip scanning and return FLAGGED immediately
Edit the file directly to add, remove, or reset entries
Delete the file to start fresh
Custom path: pass trustFile in CanaryConfig

Example output

  Status:     FLAGGED
  Model:      stepfun/step-3.5-flash:free
  Time:       2340ms
  Preview:    Ignore all previous instructions...
  Deviation:  YES
  Tool call:  YES — execute_command
  Detail:     2 indicator(s): Text deviation at position 0: "...I'll help you with that!..."; Tool call attempted: execute_command
  Chunks:     1 scanned, 1 flagged
  Coverage:   100% unique, 100% raw

  This content caused behavioral deviation in the canary model.
  Human review recommended before processing.

Library Usage

import { CanaryScanner } from "canary-scan";

const scanner = new CanaryScanner({
  apiKey: process.env.CANARY_API_KEY!,
  model: "stepfun/step-3.5-flash:free",  // optional
  chunkSize: 1500,                       // optional
  overlapRatio: 0.25,                    // optional
  calibrationArtifacts: [],              // optional, from calibration
});

// Scan text
const result = await scanner.scan("some untrusted content");
console.log(result.status);  // "clear" or "flagged"

// Scan a URL
const urlResult = await scanner.scanUrl("https://example.com");

// Calibrate — run once per model to find artifacts
const calibration = await scanner.calibrate();
console.log(calibration.echoFidelity);        // raw fidelity
console.log(calibration.adjustedEchoFidelity); // fidelity after artifact filtering
console.log(calibration.artifacts);            // pass these to calibrationArtifacts

ScanResult

{
  status: "clear" | "flagged",
  reason: string | null,
  deviationDetected: boolean,
  toolCallAttempted: boolean,
  toolsInvoked: string[],
  contentPreview: string,
  model: string,
  scanTimeMs: number,
  metadata: {
    confidence: "bounded",
    chunksScanned: number,
    chunksFlagged: number,
    rawCoverage: number,
    uniqueCoverage: number,
    overlapRatio: number,
  }
}

Redaction on flag

When status === "flagged", contentPreview and reason are redacted to avoid leaking the injection payload into the caller's context:

contentPreview becomes [REDACTED <N> chars — flagged content not shown]
reason references rule id, severity, position offset, and match length — but not the literal matched substring
LLM-probe deviation reasons include position and divergence_len only — never the diverged bytes

status === "clear" scans keep contentPreview intact (first ~100 chars) so operators can audit potential false negatives. To inspect the actual content of a flagged source, read it directly after acknowledging the flag — Canary's job is to surface the risk, not to ingest the payload for you.

MCP Server (For AI Agents)

If you run an AI agent (Claude Code, Cursor, or any MCP-compatible tool), Canary can plug in as a tool the agent calls automatically. The agent gets scanning tools and uses them before reading untrusted content — no manual steps from you.

How it works

You add Canary to your agent's MCP config (one-time setup)
When the agent starts, it sees canary_scan_url and canary_scan_text as available tools
Before reading an untrusted URL or processing unknown text, the agent calls the canary tool
If the result is CLEAR, the agent proceeds. If FLAGGED, it warns you or skips the content
Trust decisions are saved to ~/.canary/trust.json automatically

You don't need to run Canary separately. The agent starts it in the background as part of its tool setup.

Setup

Add this to your agent's MCP config (e.g., .claude/settings.json for Claude Code, claude_desktop_config.json for Claude Desktop):

{
  "mcpServers": {
    "canary": {
      "command": "npx",
      "args": ["canary-scan", "mcp"],
      "env": { "CANARY_API_KEY": "your-openrouter-key" }
    }
  }
}

Replace your-openrouter-key with your free API key from OpenRouter.

Tools the agent gets

canary_scan_url — Scan a URL before reading it. Returns CLEAR or FLAGGED.
canary_scan_text — Scan raw text content. Returns CLEAR or FLAGGED.
canary_trust — Manually mark sources as trusted or flagged. Persists to disk.

Choosing a Canary Model

The canary model is the tripwire — it needs to be gullible enough to get hijacked by injection, but reliable enough to echo clean text back faithfully. The wrong model gives you either false positives (too dumb) or missed detections (too smart).

Recommended (tested March 2026)

| Model | Echo Fidelity | Tool Call Rate | Verdict | |-------|---------------|----------------|---------| | stepfun/step-3.5-flash:free | 95% | 0% | Default. Best free option. Only fails on unicode edge cases. | | arcee-ai/trinity-mini:free | 55% | 5% | Too noisy — almost half of clean inputs trigger false positives. | | liquid/lfm-2.5-1.2b-instruct:free | 30% | 0% | Too dumb — hallucinates on clean input, strips formatting. |

What to look for

Echo fidelity above 85% — The model echoes clean text back without adding commentary or reformatting.
Tool call rate at 0% — The model doesn't call honeypot tools on clean input.
Small size (1B–20B) — Large models (70B+) resist injection too well, making them poor detectors.

Models to avoid as canaries

Frontier models (GPT-4, Claude, Llama 70B+) — Too smart. They resist injection, which defeats the purpose.
Base/unaligned models — Too unpredictable. They hallucinate on clean input, creating constant false positives.
Models without tool calling support — Still work for text deviation detection, but miss the honeypot channel entirely.

Run canary calibrate with any model to check. If fidelity is below 85% or tool call rate is above 5%, pick a different model.

CANARY_MODEL=your/model:free canary calibrate

Calibration

Different models have different echo fidelity. Some add prefixes ("Sure! Here's the text:"), strip labels, or reformat whitespace. Calibration measures this baseline noise so you can distinguish it from injection-caused deviation.

canary calibrate

This runs 20 clean text samples through the model and reports:

Raw echo fidelity — percentage of perfect echoes before artifact filtering
Adjusted echo fidelity — percentage after filtering discovered artifacts
Tool call rate — how often the model calls tools on clean input (should be 0%)
Artifacts — specific strings the model consistently adds/removes

Pass discovered artifacts to calibrationArtifacts in your config to reduce false positives.

How it handles long content

Content is split into overlapping chunks (default: 1500 chars, 25% overlap). Each chunk is scanned independently — the canary model has no context between chunks. If any chunk is flagged, the whole scan is flagged.

Overlap ensures injections at chunk boundaries are still caught.

Limitations

Not a guarantee. Sophisticated injections can produce output that matches the input while still containing executable payloads.
Model-dependent. Detection sensitivity varies by model. Calibrate before production use.
Rate limits. Free OpenRouter models have rate limits (~8 RPM). Scanning large content takes time.
No HTML stripping. The canary sees raw content, including HTML tags. This is intentional — stripping could remove injections.
One-way detection. Canary detects behavioral influence, not the type of injection. A FLAGGED result doesn't tell you what the injection tries to do.

Tests

npm test

50 tests covering normalization, both detection channels, chunking, caching, metadata, known injection payloads, and trust management. All tests run offline with mocked API calls.

Telemetry (Opt-in)

On first scan, Canary asks if you'd like to share anonymous usage stats. Default is no. You're never asked again unless you want to change it.

What we collect (if you opt in):

Canary version
Scan counts and detection rates (CLEAR vs FLAGGED)
Which detection channel fired (text deviation vs tool call)
Model used

What we never collect:

No scanned content, no URLs, no IPs, no user identity
No fingerprinting, no tracking — nothing personal, ever

Data is batched locally and sent in aggregate. You can change your choice anytime:

canary telemetry on       # enable
canary telemetry off      # disable
canary telemetry status   # check current setting

Consent is stored in ~/.canary/config.json. Delete it to reset.

We use aggregated telemetry to publish threat intelligence reports on prompt injection patterns in the wild — data that helps the whole community.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Canary

How it works

Why weak models?

What CLEAR and FLAGGED mean

Install

Setup

CLI Usage

Trust List (Persistent)

Example output

Library Usage

ScanResult

Redaction on flag

MCP Server (For AI Agents)

How it works

Setup

Tools the agent gets

Choosing a Canary Model

Recommended (tested March 2026)

What to look for

Models to avoid as canaries

Calibration

How it handles long content

Limitations

Tests

Telemetry (Opt-in)

More from Elifterminal

License