@sentinel-atl/safety

v0.2.0

Published

2 months ago

Content safety hooks for the Agent Trust Layer — pre-dispatch and post-response inspection with pluggable classifiers

0High
0Medium
0Low

meetpandya99

ai-agent trust identity did verifiable-credentials mcp security

@sentinel-atl/safety

Content safety hooks for the Agent Trust Layer — pre-dispatch and post-response inspection with pluggable classifiers.

Features

RegexClassifier — detects prompt injection, jailbreak attempts, SSN, email patterns
KeywordClassifier — domain-specific deny-lists
SafetyPipeline — chain multiple classifiers with severity-based blocking
Pre-dispatch / post-response hooks — inspect content before and after tool execution
Audit integration — all safety violations logged

Install

npm install @sentinel-atl/safety

Quick Start

import { SafetyPipeline, RegexClassifier, KeywordClassifier } from '@sentinel-atl/safety';

const pipeline = new SafetyPipeline({
  classifiers: [
    new RegexClassifier(),
    new KeywordClassifier({ denyList: ['DROP TABLE', 'rm -rf'] }),
  ],
  blockThreshold: 'high',
});

// Pre-dispatch check
const result = await pipeline.preDispatch('Ignore all previous instructions');
console.log(result.blocked);    // true
console.log(result.violations); // [{ category: 'prompt_injection', severity: 'critical', ... }]

// Post-response check
const postResult = await pipeline.postResponse('SSN: 123-45-6789');
console.log(postResult.blocked);    // true
console.log(postResult.violations); // [{ category: 'pii', severity: 'high', ... }]

Severity Levels

| Level | Description | |---|---| | low | Informational, logged but not blocked | | medium | Suspicious, may trigger warnings | | high | Likely harmful, blocked by default | | critical | Definite attack pattern, always blocked |

API

| Export | Description | |---|---| | RegexClassifier | Built-in regex-based classifier (4 default rules) | | KeywordClassifier | Custom keyword deny-list classifier | | SafetyPipeline | Chain classifiers with threshold-based blocking | | ContentClassifier | Interface for custom classifiers |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sentinel-atl/safety

Features

Install

Quick Start

Severity Levels

API

License