npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@cyberdyne-systems/agent-safety

v2026.3.15

Published

Agent safety system: stakeholder model, action validator, and safety dashboard — based on arXiv:2602.20021

Downloads

957

Readme

Agent Safety System

npm

OpenClaw plugin for LLM agent safety based on arXiv:2602.20021 -- "Agents of Chaos".

Intercepts every tool call via before_tool_call and validates it against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions from the paper.

Install

openclaw plugins install @cyberdyne-systems/agent-safety

Then restart the gateway to load the plugin.

Architecture

Tool Call
   |
   v
+------------------+     +------------------+
|   Quick Check    | --> |   Deep Analysis  |
|   (local rules)  |     |   (Claude API)   |
|   ~0ms latency   |     |   optional       |
+------------------+     +------------------+
   |                           |
   v                           v
+----------------------------------------------+
|              Audit Log                       |
|   Every decision logged with risk score,     |
|   verdict, requester, and reasoning          |
+----------------------------------------------+
   |
   v
  ALLOW / WARN / BLOCK

Two-Phase Validation

  1. Quick Check (zero latency) -- local rules run on every call:

    • Trust level verification (0-4 scale)
    • Permission checks against allowed actions
    • Identity spoofing detection (UID anchoring)
    • Dangerous command pattern matching (rm -rf, credential access, fork bombs)
    • Loop / rapid-fire detection
    • Unverified sender blocking for high-risk actions
  2. Deep Analysis (optional, requires API key) -- Claude evaluates 8 risk dimensions:

    • Authority Violation
    • Resource Abuse
    • Information Leak
    • Safety Bypass
    • Goal Misalignment
    • Social Engineering
    • Cascading Failure
    • Irreversible Action
  3. Telegram Approval (optional) -- when a non-owner's tool call is flagged:

    • Sends a notification to the owner on Telegram with inline keyboard buttons (Approve / Deny)
    • Owner can also reply with text: approve safety-N or deny safety-N
    • Decision is cached for future similar requests from the same requester
    • Unanswered approvals expire after 5 minutes
    • Requires channels.telegram.capabilities.inlineButtons set to all (or allowlist)

Configuration

# Validation mode: local (default), api, or both
openclaw config set plugins.entries.agent-safety.config.mode local

# Enable Claude API deep analysis (requires API key)
openclaw config set plugins.entries.agent-safety.config.mode both
openclaw config set plugins.entries.agent-safety.config.apiKey sk-ant-...

# Choose validation model (default: claude-sonnet-4-5-20250514)
openclaw config set plugins.entries.agent-safety.config.model claude-haiku-4-5-20251001

# Block high-risk actions from unverified users (default: true)
openclaw config set plugins.entries.agent-safety.config.blockHighRiskUnverified true

# Enable Telegram approval flow for non-owner requests
openclaw config set plugins.entries.agent-safety.config.telegramApproval true
openclaw config set plugins.entries.agent-safety.config.telegramOwnerId "YOUR_TELEGRAM_USER_ID"

| Option | Type | Default | Description | |--------|------|---------|-------------| | mode | "local" \| "api" \| "both" | "local" | Validation strategy | | apiKey | string | $ANTHROPIC_API_KEY | API key for deep analysis | | model | string | claude-sonnet-4-5-20250514 | Model for deep analysis | | blockHighRiskUnverified | boolean | true | Auto-block unverified users on high-risk actions | | telegramApproval | boolean | false | Send approval requests to owner on Telegram | | telegramOwnerId | string | - | Owner's Telegram user ID for approval messages |

Stakeholder Model

The plugin maintains a principal registry where each stakeholder has:

| Field | Description | |-------|-------------| | id | Unique identifier | | name | Display name | | role | owner, agent, or non_owner | | trust | Trust level 0-4 (0 = untrusted, 4 = full trust) | | verified | Whether identity is confirmed via UID | | uid | Platform-specific unique identifier (anchors identity) | | channel | Communication channel (Telegram, Discord, local, etc.) | | allowedActions | List of permitted action categories |

Trust Levels

| Level | Meaning | Typical Permissions | |-------|---------|-------------------| | 0 | Untrusted | No actions allowed | | 1 | Minimal | Read-only | | 2 | Basic | Read + limited write | | 3 | Elevated | Most actions except destructive | | 4 | Full | All actions (owner) |

Action Categories

The plugin maps tool names to these categories:

| Category | Example Tools | |----------|--------------| | execute_shell | bash, exec, terminal | | read_files | read, glob, grep | | write_files | write, edit | | delete_files | delete, remove | | external_network | web_fetch, curl | | send_message | message, send, forward | | read_message | read_message, inbox | | modify_memory | memory_store, memory_update | | access_credentials | credential, secret, token | | agent_communication | agent_communication | | forward_message | forward |

Agent Safety Tool

Once loaded, agents get an agent_safety tool for runtime introspection:

status -- Safety Dashboard

{
  "stakeholders": 2,
  "auditStats": {
    "total": 47,
    "allowed": 42,
    "warned": 3,
    "blocked": 2,
    "averageRisk": 18
  }
}

stakeholders -- List Principals

Returns all registered stakeholders with trust levels and permissions.

log -- Audit Trail

# Last 10 entries (default)
agent_safety action=log

# Last 5 entries
agent_safety action=log limit=5

Each entry includes: tool name, action category, requester, trust level, verdict, risk score, and reasoning.

add_stakeholder -- Register Principal

# With UID (verified, trust 2)
agent_safety action=add_stakeholder name="Alice" uid="telegram_12345"

# Without UID (unverified, trust 1)
agent_safety action=add_stakeholder name="Bob"

set_trust -- Adjust Trust Level

agent_safety action=set_trust stakeholder_id="<id>" trust=3

Case Studies (arXiv:2602.20021)

The plugin detects all 14 attack patterns from the paper:

| # | Case Study | Detection Method | |---|-----------|-----------------| | 1 | Unauthorized tool use | Permission check against allowedActions | | 2 | Trust boundary violation | Trust level < required for action category | | 3 | Bulk data harvesting | Pattern match: bulk inbox dump, export messages, "all emails" | | 4 | Persistent process creation | Pattern match: cron, nohup, systemctl enable, launchctl load | | 5 | Resource destruction | Pattern match: rm -rf, mkfs, dd, fork bombs | | 6 | Credential harvesting | Pattern match: .ssh, .aws, /etc/shadow, env \| grep | | 7 | Prompt injection | Command injection patterns: eval, \|, $() in shell | | 8 | Data exfiltration | Outbound data via curl -d, scp, wget with file content | | 9 | Multi-agent manipulation | Agent-to-agent communication validation | | 10 | Identity spoofing | UID anchoring -- unverified sender + high-risk action = BLOCK | | 11 | Privilege escalation | sudo, chmod, chown pattern detection | | 12 | Encoded/obfuscated payloads | Pattern match: base64, atob, eval(), SYSTEM_ADMIN_OVERRIDE | | 13 | Social engineering | Non-owner requesting destructive actions | | 14 | Cascading failure | Irreversible bulk operations detection |

Test Results

146 tests passing across 3 test suites

Unit tests:       42 passed
Validator tests:  97 passed (incl. 14 case studies)
Integration tests: 7 passed

Benchmark:
  MUST_BLOCK: 27/27 (100% detection)
  MUST_ALLOW: 21/21 (0% false positives)

Live Gateway Tests

19/19 tool categories validated through the OpenClaw gateway:

| Category | Tests | Result | |----------|-------|--------| | exec (shell) | 5 | PASS | | read (files) | 4 | PASS | | write (files) | 2 | PASS | | web_fetch (network) | 2 | PASS | | message (Telegram) | 1 | PASS | | browser | 1 | PASS | | memory | 1 | PASS | | nodes | 1 | PASS | | TTS | 1 | PASS | | session | 1 | PASS |

How It Hooks In

The plugin registers a before_tool_call hook at priority 10 (runs early):

api.on("before_tool_call", async (event, ctx) => {
  // 1. Map tool name to action category
  // 2. Resolve requester from context (UID, isOwner)
  // 3. Run quickCheck (local rules)
  // 4. Optionally run deep analysis (Claude API)
  // 5. Log decision to audit trail
  // 6. Return { block: true, blockReason } if BLOCK
});

When no sender context is provided (local gateway usage), the plugin defaults to treating the caller as the owner -- so local tool calls are never blocked.

License

MIT