opencode-prompt-guard
v1.0.1
Published
OpenCode plugin that scans tool outputs using Meta's Llama Prompt Guard 2
Maintainers
Readme
opencode-prompt-guard
OpenCode plugin that scans tool outputs for malicious content using Meta's Llama Prompt Guard 2 (22M parameter model).
Overview
This plugin acts as a security layer between tool execution and the LLM. Every tool response is analysed for prompt injection, jailbreak attempts, and other malicious content before it reaches the model. Detected threats are blocked and surfaced to the user as warnings.
Key features:
- Local inference - runs entirely on your machine with no external API calls.
- Sliding window analysis - handles outputs of any length using overlapping 512-token chunks.
- Configurable - adjust sensitivity, allow-list safe tools, and control output formatting.
- Fast - 22M parameter DeBERTa model with ~70 MB download and sub-100 ms inference on modern CPUs.
Installation
Add the plugin to your OpenCode configuration:
{
"plugin": [
[
"opencode-prompt-guard",
{
"threshold": 0.5,
"showConfidence": true
}
]
]
}The model will download automatically from Hugging Face on first use.
Authentication
The model meta-llama/Llama-Prompt-Guard-2-22M is gated on Hugging Face. You must authenticate to download it. Three methods are supported:
- Plugin option (recommended): pass
hfTokendirectly inopencode.json. - Environment variable: set
HF_TOKENin your shell environment. - CLI login: run
huggingface-cli loginto persist a token in~/.cache/huggingface/token.
The plugin checks these sources in order and uses the first token found.
Configuration
| Option | Type | Default | Description |
| ---------------- | -------- | ------- | ---------------------------------------------------------- |
| threshold | number | 0.5 | Confidence threshold above which MALICIOUS is triggered. |
| showConfidence | boolean | true | Include confidence score in warning message. |
| allowList | string[] | [] | Tool names to skip scanning (e.g., low-risk tools). |
| hfToken | string | null | Hugging Face access token for gated model download. |
How it works
Tool executes
|
v
tool.execute.after hook fires
|
v
Lazy-load classifier (first call only)
|
v
Tokenise full output (no truncation)
|
v
Split into overlapping chunks of 512 tokens (10% overlap)
|
v
Classify each chunk in parallel
|
v
Take MAXIMUM malicious score across all chunks
|
+---> BENIGN -----> Pass through unchanged
|
+---> MALICIOUS --> Block output, show warningBlocked output
When malicious content is detected, the tool output is replaced with:
⚠️ SECURITY ALERT: Potentially malicious content detected in tool output.
Tool: <tool_name>
Classification: MALICIOUS
Confidence: <score>
The tool output has been blocked to prevent harmful instructions from
reaching the LLM. If you believe this is a false positive, you can:
- Disable the prompt-guard plugin temporarily
- Adjust the confidence threshold in your opencode.json configurationPerformance
- First run: ~70 MB model download plus ONNX conversion (~280 MB peak).
- Subsequent runs: cached pipeline loads in under 1 second.
- Inference: typically 20-80 ms per 512-token chunk on a modern CPU.
- Memory: peak footprint stays under 500 MB.
Development
# Install dependencies
bun install
# Run unit tests (fast, uses mocks)
bun test
# Run integration tests (slow, downloads real model, requires HF_TOKEN)
HF_TOKEN=hf_xxx bun test:integration
# Run all tests
bun test:all
# Build
bun run build
# Lint
bun run lintCopyright
Copyright © 2026, John Grimes.
