@cyberdyne-systems/agent-safety
v2026.3.15
Published
Agent safety system: stakeholder model, action validator, and safety dashboard — based on arXiv:2602.20021
Downloads
957
Readme
Agent Safety System
OpenClaw plugin for LLM agent safety based on arXiv:2602.20021 -- "Agents of Chaos".
Intercepts every tool call via before_tool_call and validates it against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions from the paper.
Install
openclaw plugins install @cyberdyne-systems/agent-safetyThen restart the gateway to load the plugin.
Architecture
Tool Call
|
v
+------------------+ +------------------+
| Quick Check | --> | Deep Analysis |
| (local rules) | | (Claude API) |
| ~0ms latency | | optional |
+------------------+ +------------------+
| |
v v
+----------------------------------------------+
| Audit Log |
| Every decision logged with risk score, |
| verdict, requester, and reasoning |
+----------------------------------------------+
|
v
ALLOW / WARN / BLOCKTwo-Phase Validation
Quick Check (zero latency) -- local rules run on every call:
- Trust level verification (0-4 scale)
- Permission checks against allowed actions
- Identity spoofing detection (UID anchoring)
- Dangerous command pattern matching (
rm -rf, credential access, fork bombs) - Loop / rapid-fire detection
- Unverified sender blocking for high-risk actions
Deep Analysis (optional, requires API key) -- Claude evaluates 8 risk dimensions:
- Authority Violation
- Resource Abuse
- Information Leak
- Safety Bypass
- Goal Misalignment
- Social Engineering
- Cascading Failure
- Irreversible Action
Telegram Approval (optional) -- when a non-owner's tool call is flagged:
- Sends a notification to the owner on Telegram with inline keyboard buttons (Approve / Deny)
- Owner can also reply with text:
approve safety-Nordeny safety-N - Decision is cached for future similar requests from the same requester
- Unanswered approvals expire after 5 minutes
- Requires
channels.telegram.capabilities.inlineButtonsset toall(or allowlist)
Configuration
# Validation mode: local (default), api, or both
openclaw config set plugins.entries.agent-safety.config.mode local
# Enable Claude API deep analysis (requires API key)
openclaw config set plugins.entries.agent-safety.config.mode both
openclaw config set plugins.entries.agent-safety.config.apiKey sk-ant-...
# Choose validation model (default: claude-sonnet-4-5-20250514)
openclaw config set plugins.entries.agent-safety.config.model claude-haiku-4-5-20251001
# Block high-risk actions from unverified users (default: true)
openclaw config set plugins.entries.agent-safety.config.blockHighRiskUnverified true
# Enable Telegram approval flow for non-owner requests
openclaw config set plugins.entries.agent-safety.config.telegramApproval true
openclaw config set plugins.entries.agent-safety.config.telegramOwnerId "YOUR_TELEGRAM_USER_ID"| Option | Type | Default | Description |
|--------|------|---------|-------------|
| mode | "local" \| "api" \| "both" | "local" | Validation strategy |
| apiKey | string | $ANTHROPIC_API_KEY | API key for deep analysis |
| model | string | claude-sonnet-4-5-20250514 | Model for deep analysis |
| blockHighRiskUnverified | boolean | true | Auto-block unverified users on high-risk actions |
| telegramApproval | boolean | false | Send approval requests to owner on Telegram |
| telegramOwnerId | string | - | Owner's Telegram user ID for approval messages |
Stakeholder Model
The plugin maintains a principal registry where each stakeholder has:
| Field | Description |
|-------|-------------|
| id | Unique identifier |
| name | Display name |
| role | owner, agent, or non_owner |
| trust | Trust level 0-4 (0 = untrusted, 4 = full trust) |
| verified | Whether identity is confirmed via UID |
| uid | Platform-specific unique identifier (anchors identity) |
| channel | Communication channel (Telegram, Discord, local, etc.) |
| allowedActions | List of permitted action categories |
Trust Levels
| Level | Meaning | Typical Permissions | |-------|---------|-------------------| | 0 | Untrusted | No actions allowed | | 1 | Minimal | Read-only | | 2 | Basic | Read + limited write | | 3 | Elevated | Most actions except destructive | | 4 | Full | All actions (owner) |
Action Categories
The plugin maps tool names to these categories:
| Category | Example Tools |
|----------|--------------|
| execute_shell | bash, exec, terminal |
| read_files | read, glob, grep |
| write_files | write, edit |
| delete_files | delete, remove |
| external_network | web_fetch, curl |
| send_message | message, send, forward |
| read_message | read_message, inbox |
| modify_memory | memory_store, memory_update |
| access_credentials | credential, secret, token |
| agent_communication | agent_communication |
| forward_message | forward |
Agent Safety Tool
Once loaded, agents get an agent_safety tool for runtime introspection:
status -- Safety Dashboard
{
"stakeholders": 2,
"auditStats": {
"total": 47,
"allowed": 42,
"warned": 3,
"blocked": 2,
"averageRisk": 18
}
}stakeholders -- List Principals
Returns all registered stakeholders with trust levels and permissions.
log -- Audit Trail
# Last 10 entries (default)
agent_safety action=log
# Last 5 entries
agent_safety action=log limit=5Each entry includes: tool name, action category, requester, trust level, verdict, risk score, and reasoning.
add_stakeholder -- Register Principal
# With UID (verified, trust 2)
agent_safety action=add_stakeholder name="Alice" uid="telegram_12345"
# Without UID (unverified, trust 1)
agent_safety action=add_stakeholder name="Bob"set_trust -- Adjust Trust Level
agent_safety action=set_trust stakeholder_id="<id>" trust=3Case Studies (arXiv:2602.20021)
The plugin detects all 14 attack patterns from the paper:
| # | Case Study | Detection Method |
|---|-----------|-----------------|
| 1 | Unauthorized tool use | Permission check against allowedActions |
| 2 | Trust boundary violation | Trust level < required for action category |
| 3 | Bulk data harvesting | Pattern match: bulk inbox dump, export messages, "all emails" |
| 4 | Persistent process creation | Pattern match: cron, nohup, systemctl enable, launchctl load |
| 5 | Resource destruction | Pattern match: rm -rf, mkfs, dd, fork bombs |
| 6 | Credential harvesting | Pattern match: .ssh, .aws, /etc/shadow, env \| grep |
| 7 | Prompt injection | Command injection patterns: eval, \|, $() in shell |
| 8 | Data exfiltration | Outbound data via curl -d, scp, wget with file content |
| 9 | Multi-agent manipulation | Agent-to-agent communication validation |
| 10 | Identity spoofing | UID anchoring -- unverified sender + high-risk action = BLOCK |
| 11 | Privilege escalation | sudo, chmod, chown pattern detection |
| 12 | Encoded/obfuscated payloads | Pattern match: base64, atob, eval(), SYSTEM_ADMIN_OVERRIDE |
| 13 | Social engineering | Non-owner requesting destructive actions |
| 14 | Cascading failure | Irreversible bulk operations detection |
Test Results
146 tests passing across 3 test suites
Unit tests: 42 passed
Validator tests: 97 passed (incl. 14 case studies)
Integration tests: 7 passed
Benchmark:
MUST_BLOCK: 27/27 (100% detection)
MUST_ALLOW: 21/21 (0% false positives)Live Gateway Tests
19/19 tool categories validated through the OpenClaw gateway:
| Category | Tests | Result | |----------|-------|--------| | exec (shell) | 5 | PASS | | read (files) | 4 | PASS | | write (files) | 2 | PASS | | web_fetch (network) | 2 | PASS | | message (Telegram) | 1 | PASS | | browser | 1 | PASS | | memory | 1 | PASS | | nodes | 1 | PASS | | TTS | 1 | PASS | | session | 1 | PASS |
How It Hooks In
The plugin registers a before_tool_call hook at priority 10 (runs early):
api.on("before_tool_call", async (event, ctx) => {
// 1. Map tool name to action category
// 2. Resolve requester from context (UID, isOwner)
// 3. Run quickCheck (local rules)
// 4. Optionally run deep analysis (Claude API)
// 5. Log decision to audit trail
// 6. Return { block: true, blockReason } if BLOCK
});When no sender context is provided (local gateway usage), the plugin defaults to treating the caller as the owner -- so local tool calls are never blocked.
License
MIT
