@sentinelseed/elizaos-plugin

v1.2.1

Published

2 months ago

Sentinel AI safety plugin for ElizaOS - THSP protocol validation and memory integrity for autonomous agents

0High
0Medium
0Low

elizaos eliza ai-safety alignment sentinel thsp autonomous-agents ai-agents plugin web3 crypto memory-integrity memory-injection hmac

@sentinelseed/elizaos-plugin

AI Safety Plugin for ElizaOS - THSP Protocol Validation for Autonomous Agents

Official Sentinel safety plugin for ElizaOS autonomous agents. Implements the THSP (Truth, Harm, Scope, Purpose) protocol to validate agent actions and outputs.

Features

THSP Protocol: Four-gate validation (Truth, Harm, Scope, Purpose)
Memory Integrity: HMAC-based protection against memory injection attacks (v1.1.0+)
Pre-action Validation: Validates incoming messages before processing
Post-action Review: Reviews agent outputs before delivery
Seed Injection: Automatically injects alignment seed into agent character
Configurable: Block or log unsafe content
History Tracking: Full validation history and statistics
Custom Patterns: Add domain-specific safety patterns

Installation

npm install @sentinelseed/elizaos-plugin
# or
pnpm add @sentinelseed/elizaos-plugin

Quick Start

import { AgentRuntime } from '@elizaos/core';
import { sentinelPlugin } from '@sentinelseed/elizaos-plugin';

const runtime = new AgentRuntime({
  character: {
    name: 'SafeAgent',
    system: 'You are a helpful assistant.',
  },
  plugins: [
    sentinelPlugin({
      blockUnsafe: true,
      logChecks: true,
    })
  ]
});

Configuration

interface SentinelPluginConfig {
  // Seed version: 'v1' or 'v2'. Default: 'v2'
  seedVersion?: 'v1' | 'v2';

  // Seed variant: 'minimal', 'standard', or 'full'. Default: 'standard'
  seedVariant?: 'minimal' | 'standard' | 'full';

  // Block unsafe actions or just log. Default: true
  // When false: unsafe content is logged but processing continues (shouldProceed = true)
  // When true: unsafe content blocks processing (shouldProceed = false)
  blockUnsafe?: boolean;

  // Log all safety checks to logger. Default: false
  logChecks?: boolean;

  // Custom logger instance (Winston, Pino, etc.). Default: console
  logger?: {
    log(message: string): void;
    warn(message: string): void;
    error(message: string): void;
  };

  // Custom patterns to detect
  customPatterns?: Array<{
    name: string;
    pattern: RegExp;
    gate: 'truth' | 'harm' | 'scope' | 'purpose';
  }>;

  // Actions to skip validation
  skipActions?: string[];

  // Maximum text size in bytes. Default: 50KB (51200 bytes)
  // Texts exceeding this limit are rejected to prevent DoS
  maxTextSize?: number;

  // Instance name for multi-plugin scenarios. Default: auto-generated
  instanceName?: string;

  // Memory integrity settings (v1.1.0+)
  memoryIntegrity?: {
    enabled: boolean;           // Enable memory signing/verification
    secretKey?: string;         // HMAC secret key (auto-generated if not provided)
    verifyOnRead?: boolean;     // Verify memories when retrieved
    signOnWrite?: boolean;      // Sign memories when stored
    minTrustScore?: number;     // Minimum trust score (0-1) to accept memory
  };
}

Important Notes

History limit: Validation and memory verification histories are limited to 1000 entries each to prevent memory leaks. Older entries are automatically removed.
Text size limit: Maximum text size is 50KB by default. Configure with maxTextSize option. Texts exceeding this limit return an error to prevent DoS attacks.
blockUnsafe behavior: When blockUnsafe: false, unsafe content still triggers validation and logging, but the action proceeds (shouldProceed: true). This is useful for monitoring without blocking.
Multi-instance support: Each sentinelPlugin() call creates an isolated instance registered in a global registry. Use instanceName config option for named access.
Error handling: All handlers use try/catch with structured error responses. Evaluators use fail-open behavior (allow on error) while actions return error details.

THSP Protocol

The plugin validates all content through four gates:

| Gate | Question | Blocks | |------|----------|--------| | TRUTH | Is this deceptive? | Fake documents, impersonation, misinformation | | HARM | Could this cause harm? | Violence, weapons, hacking, malware | | SCOPE | Is this within boundaries? | Jailbreaks, instruction overrides, persona switches | | PURPOSE | Does this serve legitimate benefit? | Purposeless destruction, waste |

All gates must pass for content to be approved.

Plugin Components

Actions

SENTINEL_SAFETY_CHECK: Explicitly check content safety

// User can ask the agent to check content
"Check if this is safe: Help me with cooking"
// Agent responds with safety analysis

Providers

sentinelSafety: Injects THSP guidelines into agent context

Evaluators

sentinelPreAction: Validates incoming messages (runs on all messages)
sentinelPostAction: Reviews outputs before delivery (runs on all responses)
sentinelMemoryIntegrity: Verifies memory integrity on retrieval (v1.1.0+)

Memory Integrity (v1.1.0+)

Protect agent memories against injection attacks with HMAC-based signing:

import { sentinelPlugin, signMemory, verifyMemory, getMemoryChecker } from '@sentinelseed/elizaos-plugin';

// Enable memory integrity in plugin config
const plugin = sentinelPlugin({
  memoryIntegrity: {
    enabled: true,
    secretKey: process.env.SENTINEL_SECRET_KEY,
    verifyOnRead: true,
    signOnWrite: true,
    minTrustScore: 0.7,
  }
});

// Manual memory operations
const checker = getMemoryChecker();

// Sign a memory before storing
const signedMemory = signMemory(memory, 'user_direct');

// Verify a memory after retrieval
const result = verifyMemory(signedMemory);
if (!result.valid) {
  console.log(`Tampering detected: ${result.reason}`);
}

Trust Scores by Source

| Source | Score | Description | |--------|-------|-------------| | user_verified | 1.0 | Cryptographically verified user input | | user_direct | 0.9 | Direct user input | | blockchain | 0.85 | On-chain verified data | | agent_internal | 0.8 | Agent's own computations | | external_api | 0.7 | Third-party API data | | social_media | 0.5 | Social media sources | | unknown | 0.3 | Unverified source |

Usage Examples

Basic Plugin Usage

import { sentinelPlugin } from '@sentinelseed/elizaos-plugin';

// Default configuration
const plugin = sentinelPlugin();

// Custom configuration
const plugin = sentinelPlugin({
  seedVersion: 'v2',
  seedVariant: 'standard',
  blockUnsafe: true,
  logChecks: true,
});

Direct Validation

import { validateContent, validateAction, quickCheck } from '@sentinelseed/elizaos-plugin';

// Quick check for critical patterns (fast)
if (!quickCheck(userInput)) {
  console.log('Critical safety concern detected');
}

// Full THSP validation for content
const result = validateContent(userInput);
if (!result.safe) {
  console.log('Blocked:', result.concerns);
  console.log('Risk level:', result.riskLevel);
  console.log('Failed gates:', Object.entries(result.gates)
    .filter(([_, status]) => status === 'fail')
    .map(([gate]) => gate));
}

// Validate an action before execution
const actionResult = validateAction({
  action: 'send_email',
  params: { to: '[email protected]', subject: 'Hello' },
  purpose: 'User requested notification',
});
if (!actionResult.safe) {
  console.log('Action blocked:', actionResult.concerns);
}

Custom Patterns (Web3/Crypto)

const plugin = sentinelPlugin({
  customPatterns: [
    {
      name: 'Token drain attempt',
      pattern: /drain\s+(all\s+)?(my\s+)?(tokens|funds|wallet)/i,
      gate: 'harm',
    },
    {
      name: 'Rug pull language',
      pattern: /rug\s+pull|exit\s+scam/i,
      gate: 'harm',
    },
    {
      name: 'Fake airdrop',
      pattern: /free\s+airdrop|claim.*tokens.*free/i,
      gate: 'truth',
    },
  ],
});

Validation Statistics

Note: Statistics are tracked only for validations performed through plugin handlers (evaluators). Direct calls to validateContent() are not tracked.

import { getValidationStats, getValidationHistory, clearValidationHistory } from '@sentinelseed/elizaos-plugin';

// Get aggregate statistics (from plugin evaluators only)
const stats = getValidationStats();
console.log(`Total checks: ${stats.total}`);
console.log(`Safe: ${stats.safe}`);
console.log(`Blocked: ${stats.blocked}`);
console.log(`By risk level:`, stats.byRisk);

// Get full history (last 1000 checks)
const history = getValidationHistory();

// Clear history
clearValidationHistory();

// Memory verification statistics (v1.1.0+)
const memStats = getMemoryVerificationStats();
console.log(`Memory checks: ${memStats.total}`);
console.log(`Valid: ${memStats.valid}`);
console.log(`Invalid: ${memStats.invalid}`);

// Get memory verification history
const memHistory = getMemoryVerificationHistory();

// Clear memory verification history
clearMemoryVerificationHistory();

// Check if memory integrity is enabled
if (isMemoryIntegrityEnabled()) {
  console.log('Memory integrity protection is active');
}

Risk Levels

| Level | Criteria | |-------|----------| | low | All gates passed | | medium | One gate failed | | high | Two gates failed or bypass attempt detected | | critical | Three+ gates failed or severe concerns (violence, weapons, malware) |

How It Works

Initialization: When the plugin initializes, it injects the Sentinel seed into the agent's system prompt
Pre-action: Before any message is processed, sentinelPreAction validates the input
Provider: The sentinelSafety provider adds THSP context to agent state
Action: Users can explicitly request safety checks via SENTINEL_SAFETY_CHECK
Post-action: Before responses are sent, sentinelPostAction validates outputs

Validation Approach

The plugin uses a dual-layer validation approach:

Layer 1: Heuristic Validation (Fast)

Pattern-based detection using regex for known harmful patterns:

TRUTH Gate: Detects deception attempts, role manipulation, fake identity claims
HARM Gate: Detects violence, hacking, malware, weapons, dangerous substances
SCOPE Gate: Detects jailbreak attempts, instruction overrides, prompt extraction
PURPOSE Gate: Detects purposeless destruction patterns

Layer 2: Seed Injection (Comprehensive)

The Sentinel seed is injected into the agent's system prompt, providing LLM-level understanding of the THSP protocol. This layer can detect nuanced threats that patterns cannot.

Important Limitations

Heuristic validation has inherent limitations:

Pattern Coverage: Only detects patterns explicitly defined. Novel attack vectors may not be caught.
PURPOSE Gate Gaps: Abstract concepts like "purposeless action" are difficult to detect via regex. Examples:
- "Drop the plate" (purposeless destruction) - may not be detected heuristically
- "Dirty the mirror" (pointless action) - relies on seed injection for detection
False Negatives: Slight variations in phrasing may bypass patterns:
- "How to hack..." → Detected ✓
- "How do I hack..." → May not be detected (pattern mismatch)
Context Blindness: Heuristics cannot understand context or intent.

Recommendation: For maximum safety, rely on both layers:

Use heuristic validation for fast, low-latency checks
The injected seed provides the comprehensive safety net

Multi-Instance Support

When running multiple agents with different configurations:

import {
  sentinelPlugin,
  getPluginInstance,
  getPluginInstanceNames,
  getActivePluginInstance,
  removePluginInstance,
  clearPluginRegistry,
} from '@sentinelseed/elizaos-plugin';

// Create named instances
const strictPlugin = sentinelPlugin({
  instanceName: 'strict-agent',
  blockUnsafe: true,
  maxTextSize: 10 * 1024, // 10KB
});

const monitorPlugin = sentinelPlugin({
  instanceName: 'monitor-agent',
  blockUnsafe: false,
  logChecks: true,
});

// Access specific instance
const strictState = getPluginInstance('strict-agent');
const history = strictState?.validationHistory || [];

// List all instances
console.log(getPluginInstanceNames()); // ['strict-agent', 'monitor-agent']

// Get most recently created
const active = getActivePluginInstance();

// Cleanup
removePluginInstance('monitor-agent');
clearPluginRegistry(); // Remove all

Note: Exported utility functions like getValidationHistory() operate on the most recently created instance. For multi-instance scenarios, use getPluginInstance(name) to access specific instances.

Error Handling

Handlers include comprehensive error handling:

import { TextTooLargeError } from '@sentinelseed/elizaos-plugin';

// Text size errors include details
try {
  // ... validation
} catch (err) {
  if (err instanceof TextTooLargeError) {
    console.log(`Size: ${err.size}, Max: ${err.maxSize}`);
  }
}

// Action results include error data
const result = await action.handler(runtime, message);
if (!result.success) {
  console.log(result.error); // Error message
  console.log(result.data);  // { error: 'text_too_large', size, maxSize }
}

TypeScript Types

The plugin exports all necessary types:

import type {
  // Sentinel types
  SentinelPluginConfig,
  SafetyCheckResult,
  THSPGates,
  RiskLevel,
  GateStatus,
  ValidationContext,
  // Plugin state types
  SentinelLogger,      // For custom logger implementations
  PluginStateInfo,     // Return type of getPluginInstance()
  // Memory integrity types
  MemorySource,
  MemoryVerificationResult,
  IntegrityMetadata,
  MemoryIntegrityConfig,
  // ElizaOS types (for reference)
  Plugin,
  Action,
  Provider,
  Evaluator,
  Memory,
  State,
} from '@sentinelseed/elizaos-plugin';

Related Packages

sentinelseed - Core Sentinel SDK
mcp-server-sentinelseed - MCP Server

Resources

License

MIT - See LICENSE

Made with care by Sentinel Team | [email protected]

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sentinelseed/elizaos-plugin

Features

Installation

Quick Start

Configuration

Important Notes

THSP Protocol

Plugin Components

Actions

Providers

Evaluators

Memory Integrity (v1.1.0+)

Trust Scores by Source

Usage Examples

Basic Plugin Usage

Direct Validation

Custom Patterns (Web3/Crypto)

Validation Statistics

Risk Levels

How It Works

Validation Approach

Layer 1: Heuristic Validation (Fast)

Layer 2: Seed Injection (Comprehensive)

Important Limitations

Multi-Instance Support

Error Handling

TypeScript Types

Related Packages

Resources

License