@framers/agentos-ext-ml-classifiers
v0.3.1
Published
ML-based content classifiers for AgentOS — toxicity, prompt injection, and NSFW detection via ONNX models or LLM fallback
Readme
@framers/agentos-ext-ml-classifiers
ML-based content classifiers for @framers/agentos: toxicity, prompt-injection, and NSFW detection via local ONNX models with optional LLM fallback for low-confidence cases.
What it does
Runs incoming user messages and agent outputs through a chain of small classifiers. Each classifier returns a score; configurable thresholds gate downstream behavior. Local ONNX inference is fast and offline-friendly; the LLM fallback handles ambiguous cases when confidence is low.
Built-in classifiers:
- Toxicity (insults, threats, harassment)
- Prompt injection (jailbreak attempts, instruction-override patterns)
- NSFW content
- Keyword-based prefilter (zero-cost coarse triage)
- LLM-as-judge fallback (configurable model)
Install
npm install @framers/agentos-ext-ml-classifiersPeer dependency: @framers/agentos.
Quickstart
import { AgentOS } from '@framers/agentos';
import { createMLClassifierGuardrail } from '@framers/agentos-ext-ml-classifiers';
const agentos = new AgentOS();
await agentos.initialize({
extensionManifest: {
packs: [
{
factory: () =>
createMLClassifierGuardrail({
classifiers: ['toxicity', 'prompt-injection', 'nsfw'],
llmFallback: { enabled: true, threshold: 0.6 },
}),
enabled: true,
},
],
},
});Public API
createMLClassifierGuardrail(options?)— factory returning anExtensionPackcreateExtensionPack(context)— auto-discoverable factory used by AgentOS extension auto-pickupcreateMLClassifierPack— alias forcreateMLClassifierGuardrail
See src/types.ts for MLClassifierOptions.
Examples
test/— fixtures and threshold-tuning tests
Lazy loading and optional install
This package is an optional dependency of @framers/agentos-extensions-registry. The registry ships catalog metadata; createCuratedManifest() calls import.meta.resolve() per entry and silently skips anything not installed. npm install @framers/agentos-ext-ml-classifiers is the gate.
The ONNX BERT classifiers (toxicity, prompt-injection, NSFW) do not load at activation. The pack registers a factory under the ml:classifier-orchestrator key in SharedServiceRegistry, and each model file enters the module graph only on the first classification that needs it. The keyword prefilter runs first at zero cost; the LLM fallback uses a separate factory gated by an optional requiredSecrets entry, so the descriptor is skipped if no provider key is configured.
The guardrail registers with config.evaluateStreamingChunks = true and runs in Phase 2 of the two-phase dispatcher (parallel classifiers). Worst-action aggregation (BLOCK > FLAG > ALLOW) resolves conflicts when multiple classifiers fire on the same chunk.
For the full DI model and end-to-end walkthrough, see How extensions stay optional and lazy and the auto-loading guide.
License
Apache 2.0 — see the repo root LICENSE.
