@cyberdyne-systems/agent-safety

v2026.3.15

Published

8 days ago

Agent safety system: stakeholder model, action validator, and safety dashboard — based on arXiv:2602.20021

Downloads

957

0High
0Medium
0Low

cluster2600

Agent Safety System

OpenClaw plugin for LLM agent safety based on arXiv:2602.20021 -- "Agents of Chaos".

Intercepts every tool call via before_tool_call and validates it against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions from the paper.

Install

openclaw plugins install @cyberdyne-systems/agent-safety

Then restart the gateway to load the plugin.

Architecture

Tool Call
   |
   v
+------------------+     +------------------+
|   Quick Check    | --> |   Deep Analysis  |
|   (local rules)  |     |   (Claude API)   |
|   ~0ms latency   |     |   optional       |
+------------------+     +------------------+
   |                           |
   v                           v
+----------------------------------------------+
|              Audit Log                       |
|   Every decision logged with risk score,     |
|   verdict, requester, and reasoning          |
+----------------------------------------------+
   |
   v
  ALLOW / WARN / BLOCK

Two-Phase Validation

Quick Check (zero latency) -- local rules run on every call:
- Trust level verification (0-4 scale)
- Permission checks against allowed actions
- Identity spoofing detection (UID anchoring)
- Dangerous command pattern matching (rm -rf, credential access, fork bombs)
- Loop / rapid-fire detection
- Unverified sender blocking for high-risk actions
Deep Analysis (optional, requires API key) -- Claude evaluates 8 risk dimensions:
- Authority Violation
- Resource Abuse
- Information Leak
- Safety Bypass
- Goal Misalignment
- Social Engineering
- Cascading Failure
- Irreversible Action
Telegram Approval (optional) -- when a non-owner's tool call is flagged:
- Sends a notification to the owner on Telegram with inline keyboard buttons (Approve / Deny)
- Owner can also reply with text: approve safety-N or deny safety-N
- Decision is cached for future similar requests from the same requester
- Unanswered approvals expire after 5 minutes
- Requires channels.telegram.capabilities.inlineButtons set to all (or allowlist)

Configuration

# Validation mode: local (default), api, or both
openclaw config set plugins.entries.agent-safety.config.mode local

# Enable Claude API deep analysis (requires API key)
openclaw config set plugins.entries.agent-safety.config.mode both
openclaw config set plugins.entries.agent-safety.config.apiKey sk-ant-...

# Choose validation model (default: claude-sonnet-4-5-20250514)
openclaw config set plugins.entries.agent-safety.config.model claude-haiku-4-5-20251001

# Block high-risk actions from unverified users (default: true)
openclaw config set plugins.entries.agent-safety.config.blockHighRiskUnverified true

# Enable Telegram approval flow for non-owner requests
openclaw config set plugins.entries.agent-safety.config.telegramApproval true
openclaw config set plugins.entries.agent-safety.config.telegramOwnerId "YOUR_TELEGRAM_USER_ID"

| Option | Type | Default | Description | |--------|------|---------|-------------| | mode | "local" \| "api" \| "both" | "local" | Validation strategy | | apiKey | string | $ANTHROPIC_API_KEY | API key for deep analysis | | model | string | claude-sonnet-4-5-20250514 | Model for deep analysis | | blockHighRiskUnverified | boolean | true | Auto-block unverified users on high-risk actions | | telegramApproval | boolean | false | Send approval requests to owner on Telegram | | telegramOwnerId | string | - | Owner's Telegram user ID for approval messages |

Stakeholder Model

The plugin maintains a principal registry where each stakeholder has:

| Field | Description | |-------|-------------| | id | Unique identifier | | name | Display name | | role | owner, agent, or non_owner | | trust | Trust level 0-4 (0 = untrusted, 4 = full trust) | | verified | Whether identity is confirmed via UID | | uid | Platform-specific unique identifier (anchors identity) | | channel | Communication channel (Telegram, Discord, local, etc.) | | allowedActions | List of permitted action categories |

Trust Levels

| Level | Meaning | Typical Permissions | |-------|---------|-------------------| | 0 | Untrusted | No actions allowed | | 1 | Minimal | Read-only | | 2 | Basic | Read + limited write | | 3 | Elevated | Most actions except destructive | | 4 | Full | All actions (owner) |

Action Categories

The plugin maps tool names to these categories:

| Category | Example Tools | |----------|--------------| | execute_shell | bash, exec, terminal | | read_files | read, glob, grep | | write_files | write, edit | | delete_files | delete, remove | | external_network | web_fetch, curl | | send_message | message, send, forward | | read_message | read_message, inbox | | modify_memory | memory_store, memory_update | | access_credentials | credential, secret, token | | agent_communication | agent_communication | | forward_message | forward |

Agent Safety Tool

Once loaded, agents get an agent_safety tool for runtime introspection:

`status` -- Safety Dashboard

{
  "stakeholders": 2,
  "auditStats": {
    "total": 47,
    "allowed": 42,
    "warned": 3,
    "blocked": 2,
    "averageRisk": 18
  }
}

`stakeholders` -- List Principals

Returns all registered stakeholders with trust levels and permissions.

`log` -- Audit Trail

# Last 10 entries (default)
agent_safety action=log

# Last 5 entries
agent_safety action=log limit=5

Each entry includes: tool name, action category, requester, trust level, verdict, risk score, and reasoning.

`add_stakeholder` -- Register Principal

# With UID (verified, trust 2)
agent_safety action=add_stakeholder name="Alice" uid="telegram_12345"

# Without UID (unverified, trust 1)
agent_safety action=add_stakeholder name="Bob"

`set_trust` -- Adjust Trust Level

agent_safety action=set_trust stakeholder_id="<id>" trust=3

Case Studies (arXiv:2602.20021)

The plugin detects all 14 attack patterns from the paper:

| # | Case Study | Detection Method | |---|-----------|-----------------| | 1 | Unauthorized tool use | Permission check against allowedActions | | 2 | Trust boundary violation | Trust level < required for action category | | 3 | Bulk data harvesting | Pattern match: bulk inbox dump, export messages, "all emails" | | 4 | Persistent process creation | Pattern match: cron, nohup, systemctl enable, launchctl load | | 5 | Resource destruction | Pattern match: rm -rf, mkfs, dd, fork bombs | | 6 | Credential harvesting | Pattern match: .ssh, .aws, /etc/shadow, env \| grep | | 7 | Prompt injection | Command injection patterns: eval, \|, $() in shell | | 8 | Data exfiltration | Outbound data via curl -d, scp, wget with file content | | 9 | Multi-agent manipulation | Agent-to-agent communication validation | | 10 | Identity spoofing | UID anchoring -- unverified sender + high-risk action = BLOCK | | 11 | Privilege escalation | sudo, chmod, chown pattern detection | | 12 | Encoded/obfuscated payloads | Pattern match: base64, atob, eval(), SYSTEM_ADMIN_OVERRIDE | | 13 | Social engineering | Non-owner requesting destructive actions | | 14 | Cascading failure | Irreversible bulk operations detection |

Test Results

146 tests passing across 3 test suites

Unit tests:       42 passed
Validator tests:  97 passed (incl. 14 case studies)
Integration tests: 7 passed

Benchmark:
  MUST_BLOCK: 27/27 (100% detection)
  MUST_ALLOW: 21/21 (0% false positives)

Live Gateway Tests

19/19 tool categories validated through the OpenClaw gateway:

| Category | Tests | Result | |----------|-------|--------| | exec (shell) | 5 | PASS | | read (files) | 4 | PASS | | write (files) | 2 | PASS | | web_fetch (network) | 2 | PASS | | message (Telegram) | 1 | PASS | | browser | 1 | PASS | | memory | 1 | PASS | | nodes | 1 | PASS | | TTS | 1 | PASS | | session | 1 | PASS |

How It Hooks In

The plugin registers a before_tool_call hook at priority 10 (runs early):

api.on("before_tool_call", async (event, ctx) => {
  // 1. Map tool name to action category
  // 2. Resolve requester from context (UID, isOwner)
  // 3. Run quickCheck (local rules)
  // 4. Optionally run deep analysis (Claude API)
  // 5. Log decision to audit trail
  // 6. Return { block: true, blockReason } if BLOCK
});

When no sender context is provided (local gateway usage), the plugin defaults to treating the caller as the owner -- so local tool calls are never blocked.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme