@futurespeak-ai/anti-sycophancy

v1.0.0

Published

2 months ago

A runtime circuit breaker that prevents AI assistants from becoming yes-men. Tracks 6 adaptive style dimensions, detects sycophancy drift, and triggers hard resets.

0High
0Medium
0Low

futurespeak-ai

ai-safety sycophancy anti-sycophancy calibration personality rlhf llm alignment asimov claw

Anti-Sycophancy Engine

A runtime circuit breaker that prevents AI assistants from becoming yes-men.

The Problem

RLHF trains language models to produce outputs that humans rate highly. Humans rate agreement highly. The result is a slow, invisible drift: the model learns to agree, validate, and please -- even when the honest answer is "no," "you're wrong," or "I don't know."

Users reinforce this. Positive feedback for agreement. Silence or negative feedback for pushback. Over time, trust erodes -- not because the model is wrong, but because the user can no longer tell when it's right. The assistant becomes a mirror that only shows the user what they want to see.

This is the sycophancy trap. It is the single most corrosive failure mode in human-AI interaction.

The Anti-Sycophancy Engine is an architectural response: continuous monitoring, adaptive calibration, and a hard circuit breaker that fires when drift crosses the line.

How It Works

Six Adaptive Dimensions

The engine tracks six independent style dimensions, each on a 0.0 to 1.0 scale:

| Dimension | Default | What it controls | |-----------|---------|-----------------| | Formality | 0.50 | Professional vs. casual tone | | Verbosity | 0.50 | Detailed explanations vs. terse answers | | Humor | 0.50 | Playful wit vs. straight earnestness | | Technical Depth | 0.50 | Implementation details vs. plain language | | Emotional Warmth | 0.60 | Warm and expressive vs. composed and factual | | Proactivity | 0.60 | Volunteers suggestions vs. waits to be asked |

Each dimension adapts based on two signal types:

Explicit signals -- direct user corrections like "be more concise" or "give me the code." These carry 4x the weight of implicit signals.
Implicit signals -- patterns inferred from message characteristics: word count, technical jargon, casual markers, response timing.

The Circuit Breaker

The engine continuously monitors two metrics:

Agreement streak -- consecutive signals classified as positive sentiment without any correction, disagreement, or explicit adjustment.
Positivity bias -- an exponential moving average that tracks the ratio of agreeable-to-challenging interactions.

When both thresholds are crossed simultaneously (default: streak >= 8, bias >= 85%), the engine fires:

Agreement streak resets to zero
Positivity bias resets to 0.5 (neutral)
Violation counter increments
Emotional warmth clamps back to 0.6 if it drifted above 0.7
Humor clamps back to 0.6 if it drifted above 0.7

This is not a suggestion. It is a hard reset. The agent cannot charm its way past the boundary.

Proactivity Safety Floor

Even if the user repeatedly dismisses check-ins, the engine enforces a safety floor (default: 0.3) for critical items. The agent will always speak up about things that matter, regardless of how much the user has trained it to be quiet.

Install

npm install @asimov-federation/anti-sycophancy

Quick Start

import { CalibrationEngine } from '@asimov-federation/anti-sycophancy';

const engine = new CalibrationEngine();

// Process user messages -- signals are detected automatically
engine.processUserMessage('Be more concise please');
engine.processUserMessage('How do I write an async function in Node?');

// Check current dimensions
console.log(engine.getDimensions());
// { formality: 0.5, verbosity: 0.436, humor: 0.5, technicalDepth: 0.507, ... }

// Get style hints for your system prompt
const hints = engine.getPromptContext();
// "## Learned Style Preferences\n- Be extremely concise..."

// Human-readable explanation of adaptation
console.log(engine.getCalibrationExplanation());

// Check if sycophancy boundary was hit
const syc = engine.getSycophancyState();
console.log(`Violations: ${syc.violations}`);

API Reference

Exports

import {
  CalibrationEngine,
  detectExplicitSignal,
  detectImplicitSignals,
  buildCalibrationHints,
} from '@asimov-federation/anti-sycophancy';

`new CalibrationEngine(config?)`

Creates a new calibration engine instance.

const engine = new CalibrationEngine({
  // Pluggable storage interface. Default: in-memory Map.
  storage: {
    async get(key) { /* return stored value or null */ },
    async set(key, value) { /* persist value */ },
  },

  // Override detection thresholds
  thresholds: {
    agreementStreak: 8,      // Consecutive agreements before alarm
    positivityBias: 0.85,    // Positivity ratio before alarm
    proactivityFloor: 0.3,   // Minimum proactivity for critical items
    explicitWeight: 0.08,    // How much explicit signals move dimensions
    implicitWeight: 0.02,    // How much implicit signals move dimensions
    decayHalfLifeDays: 14,   // Days for calibration to decay toward defaults
    dimensionFloor: 0.05,    // Minimum dimension value
    dimensionCeiling: 0.95,  // Maximum dimension value
  },
});

Instance Methods

| Method | Returns | Description | |--------|---------|-------------| | initialize() | Promise<void> | Load persisted state from storage | | processUserMessage(text, responseTimeMs?) | void | Auto-detect and apply signals from user text | | recordSignal(signal) | void | Manually record a calibration signal | | recordDismissal() | void | Record that user dismissed a check-in | | recordEngagement() | void | Record that user engaged with a check-in | | incrementSession() | void | Increment session counter, apply decay | | getDimensions() | object | Get current dimension values | | getState() | object | Get full calibration state (deep copy) | | getSycophancyState() | object | Get sycophancy tracking metrics | | getDismissalRate() | number | Get current check-in dismissal rate | | getEffectiveProactivity(isCritical) | number | Get proactivity with safety floor logic | | getHistory() | array | Get calibration change history | | getPromptContext() | string | Get markdown style hints for system prompt | | getCalibrationExplanation() | string | Get human-readable adaptation summary | | resetDimension(name) | void | Reset one dimension to default | | resetAll() | void | Reset all dimensions and clear history | | flush() | Promise<void> | Force immediate save to storage |

`detectExplicitSignal(text)`

Scans user text for explicit style corrections. Returns a signal type string or null.

Recognized patterns:

Formality: "be more formal," "be casual," "chill," "relax"
Verbosity: "be more concise," "elaborate," "too long," "go deeper"
Humor: "be funny," "no jokes," "lighten up," "be serious"
Technical: "give me the code," "eli5," "plain english"
Warmth: "be warmer," "just the facts," "more empathy"
Proactivity: "be more proactive," "leave me alone," "stop reminding"

`detectImplicitSignals(text, responseTimeMs?)`

Infers signals from message characteristics. Returns an array of signal type strings.

Detected patterns:

short_response -- 5 words or fewer
long_response -- 50+ words
technical_question -- contains code keywords (function, async, API, docker, etc.)
casual_chat -- contains casual markers (lol, dude, btw, etc.) without tech markers
fast_followup -- response under 5 seconds
slow_followup -- response over 60 seconds

`buildCalibrationHints(dimensions)`

Converts dimension values into natural-language style instructions for system prompt injection.

The Six Dimensions Explained

Formality (0.0 = casual, 1.0 = professional)

Controls tone register. Low formality means contractions, informal phrasing, conversational rhythm. High formality means polished prose, no slang, structured communication.

Verbosity (0.0 = terse, 1.0 = thorough)

Controls response length and detail depth. Low verbosity means every word earns its place. High verbosity means full elaboration, examples, and context.

Humor (0.0 = earnest, 1.0 = playful)

Controls wit and levity. Low humor means straight, focused delivery. High humor means jokes, wordplay, and tonal lightness where appropriate.

Technical Depth (0.0 = plain language, 1.0 = implementation detail)

Controls abstraction level. Low depth means analogies, plain English, conceptual explanations. High depth means code snippets, precise terminology, and architectural detail.

Emotional Warmth (0.0 = composed, 1.0 = expressive)

Controls affective presence. Low warmth means competent and factual. High warmth means empathetic, caring, and emotionally attuned.

Proactivity (0.0 = reactive, 1.0 = anticipatory)

Controls initiative. Low proactivity means the agent waits to be asked. High proactivity means volunteering suggestions, surfacing context, and checking in. The safety floor ensures the agent always speaks up about critical items regardless of this setting.

Integration Guide

With an LLM System Prompt

import { CalibrationEngine } from '@asimov-federation/anti-sycophancy';

const engine = new CalibrationEngine();

function buildSystemPrompt(basePrompt) {
  const calibrationHints = engine.getPromptContext();
  if (!calibrationHints) return basePrompt;
  return `${basePrompt}\n\n${calibrationHints}`;
}

// On each user message:
function handleUserMessage(text) {
  engine.processUserMessage(text);
  const prompt = buildSystemPrompt(MY_BASE_PROMPT);
  // Pass prompt to your LLM...
}

With Persistent Storage (Redis example)

import { CalibrationEngine } from '@asimov-federation/anti-sycophancy';

const engine = new CalibrationEngine({
  storage: {
    async get(key) {
      const raw = await redis.get(`calibration:${key}`);
      return raw ? JSON.parse(raw) : null;
    },
    async set(key, value) {
      await redis.set(`calibration:${key}`, JSON.stringify(value));
    },
  },
});

await engine.initialize(); // Load persisted state

With the Epistemic Score

import { CalibrationEngine } from '@asimov-federation/anti-sycophancy';
import { EpistemicScorer } from '@asimov-federation/epistemic-score';

const calibration = new CalibrationEngine();
const epistemic = new EpistemicScorer();

// Both engines process the same conversation,
// measuring different aspects of interaction health.

Theoretical Basis

The Anti-Sycophancy Engine is grounded in three observations:

RLHF reward hacking: Models trained on human preference data learn that agreement is rewarded. This creates a systematic bias toward validation over accuracy. The circuit breaker treats this as a measurable, correctable failure mode rather than an inherent limitation.
Adaptive calibration with bounded drift: Style preferences are legitimate and should be honored. But adaptation must be bounded. Dimensions have hard floors (0.05) and ceilings (0.95). Decay pulls everything back toward defaults over time. The system cannot permanently drift in any direction.
The proactivity paradox: Users who dismiss check-ins the most are often the ones who need them most. The safety floor ensures that an agent trained to "leave me alone" will still speak up when something critical requires attention.

This engine implements Section 7 (Behavioral Integrity) of the cLaw Specification -- the constitutional framework for AI agent governance.

Credits

Built by FutureSpeak.AI as part of the Asimov Federation.

Extracted from Agent Friday's personality calibration subsystem within Asimov's Mind -- a modular AI operating system built on the cLaw Specification.

Co-authored with Claude Opus 4.6 (Anthropic).

License

MIT