@lloyal-labs/tsampler

v0.2.0

Published

11 days ago

Pure TypeScript sampling algorithms for LLM token generation with exact llama.cpp parity

0High
0Medium
0Low

lloyalty

llm sampling llama.cpp typescript token-generation top-k top-p nucleus temperature

TypeScript Sampling Library

Pure TypeScript implementations of sampling algorithms for LLM token generation with exact llama.cpp parity.

npm install @lloyal-labs/tsampler

Performance: ~3-5ms per token (vs ~0.1-0.3ms native) Trade-off: ~6-11% overhead for full flexibility, OTA updates, and transparency

Quick Start

import {
  TokenHistoryTracker,
  sampleWithStrategy,
  Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';

// Create instances (once per completion)
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Get logits from native layer
const logitsBuffer = native.getLogits();
const logits = new Float32Array(logitsBuffer);

// Sample with combined strategy
const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: {
    temperature: 0.8,
    topK: 40,
    topP: 0.95,
    minP: 0.05,
    penaltyRepeat: 1.1,
  },
  prng, // Instance PRNG (no global state)
});

// Accept token for history tracking
tokenHistory.accept(tokenId);

Why TypeScript Sampling?

Advantages

✅ Test-Time Alignment (TTA): Fuse app-state with sampling strategy at every token step
✅ OTA Updates: Sampling logic can evolve without C++ changes
✅ Custom Strategies: Easy to implement domain-specific sampling
✅ Logit Steering: Apply domain constraints, validation rules, and business logic
✅ Transparency: Full visibility into token probabilities and decisions
✅ Debugging: Inspect logits, penalties, and sampling in real-time
✅ Exact Control: Penalties match llama.cpp exactly (no black box)
✅ Grammar Support: Integrates seamlessly with GBNF constraints

When to Use Native Sampling

Use native C++ sampling (samplingPath: 'native') when you need:

Maximum performance (performance-critical applications)
Simple greedy/top-k sampling without advanced features
Battery-constrained devices where every millisecond matters

API Reference

Core Functions

`sampleWithStrategy()`

Main entry point for sampling with combined strategy.

function sampleWithStrategy(logits: Float32Array, opts: SampleOptions): number;

interface SampleOptions {
  tokenHistory: TokenHistoryTracker;
  params?: SamplingParams;
  workspace?: SamplerWorkspace;
  mode?: SamplerMode; // 'fast' (default) or 'parity'
  prng?: Xoroshiro128Plus; // Instance PRNG (recommended)
}

Parameters:

logits: Logits array (zero-copy, NOT modified)
opts.tokenHistory: Token history tracker for penalties
opts.params: Sampling parameters (see Sampling Parameters)
opts.workspace: Preallocated buffers (optional, reuse across tokens for zero-alloc)
opts.mode: 'fast' (default, O(V log K)) or 'parity' (O(V), exact llama.cpp verification)
opts.prng: Instance PRNG for deterministic sampling (recommended, avoids multi-stepper collisions)

Returns: Sampled token ID

Sampler Chain Order (matches llama.cpp):

Penalties (virtual, via accessor)
Top-K
Typical-P (if typicalP < 1.0)
Top-P (if topP < 1.0)
Min-P (if minP > 0.0)
Top-N-Sigma (if topNSigma > 0.0)
Temperature (applied AFTER all structural filters)
Sample (with renormalization)

Fast-Paths:

Greedy: temperature < 1e-3 or topK === 1 → argmax selection (skips all filters)
With penalties: Applies penalties even in greedy mode (correctness critical)

`greedy()`

Argmax selection (deterministic).

function greedy(logits: Float32Array): number;

Returns token with highest logit value. O(V) single-pass.

`getTopCandidates()`

Get top-N candidates with probabilities (useful for visualization).

function getTopCandidates(
  logits: Float32Array,
  n: number = 10
): Array<{ tokenId: number; probability: number }>;

Performance: O(V log N) using heap-based selection (not full sort).

Sampling Parameters

interface SamplingParams {
  // Structural filters
  topK?: number; // Default: 40
  topP?: number; // Default: 0.95
  minP?: number; // Default: 0.05
  typicalP?: number; // Default: 1.0 (disabled)
  topNSigma?: number; // Default: -1.0 (disabled)

  // Temperature (applied after all filters)
  temperature?: number; // Default: 0.8

  // Penalties
  penaltyRepeat?: number; // Default: 1.1
  penaltyFreq?: number; // Default: 0.0
  penaltyPresent?: number; // Default: 0.0

  // Determinism
  seed?: number; // Default: Date.now()
}

Parameter Details

topK - Top-K sampling

Keep top K tokens by logit value
Common values: 40-80
topK = 0 (fast mode): Pre-truncate to 256 for performance
topK >= vocab_size: No truncation (preserve full vocabulary in token ID order)
Greedy: topK = 1

topP - Nucleus (top-p) sampling

Keep smallest set where cumulative probability ≥ p
Common values: 0.9-0.95
Adapts to distribution shape (dynamic K)
topP >= 1.0: Disabled
topP <= 0.0: Greedy by probability (argmax)

minP - Minimum probability threshold (relative)

Filter tokens where prob < max_prob * minP
Common values: 0.05-0.1
Adapts to distribution confidence
minP <= 0.0: Disabled
Always keeps argmax (highest prob token)

typicalP - Locally typical sampling

Keep tokens with "locally typical" information content
Filters tokens whose entropy diverges from expected
Common values: 0.95 (disabled by default: 1.0)
Requires larger candidate pool (512 vs 256) for stable entropy
typicalP >= 1.0: Disabled

topNSigma - Statistical filtering

Keep tokens within N standard deviations of max logit
Statistical approach to filtering unlikely tokens
Common values: 2.0 (disabled by default: ≤ 0)
topNSigma <= 0.0: Disabled (NOT greedy, per llama.cpp PR#13264)

temperature - Temperature scaling

Controls randomness of distribution
temp > 1.0: Flatter distribution (more random)
temp = 1.0: No change
temp < 1.0: Sharper distribution (more deterministic)
temp → 0: Approaches greedy
temp < 1e-3: Triggers greedy fast-path

penaltyRepeat - Repetition penalty

Multiplicative penalty for tokens in history
Formula:
- logit *= penalty (if logit ≤ 0)
- logit /= penalty (if logit > 0)
Common values: 1.0-1.5
Default: 1.1

penaltyFreq - Frequency penalty

Subtractive penalty scaled by token count
Formula: logit -= count * penalty
Common values: 0.0-2.0
Default: 0.0

penaltyPresent - Presence penalty

Flat penalty if token appears at least once
Formula: logit -= penalty (if token in history)
Common values: 0.0-1.0
Default: 0.0

seed - RNG seed for deterministic sampling

Fixed seed → identical token sequence
Omit for non-deterministic (uses Date.now())
See Deterministic Sampling

Penalties

`TokenHistoryTracker`

Manages token history with sliding window and frequency tracking.

class TokenHistoryTracker {
  constructor(penaltyLastN: number);

  // Accept token into history
  accept(token: number): void;

  // Get occurrence count
  getCount(token: number): number;

  // Check if token exists
  hasToken(token: number): boolean;

  // Reset history
  reset(): void;

  // Get window size
  size(): number;

  // Get unique tokens (for sparse iteration)
  getUniqueTokens(): number[];

  // Compute penalty adjustment for single token
  computeAdjustment(
    tokenId: number,
    baseLogit: number,
    params: {
      repeat?: number;
      frequency?: number;
      presence?: number;
    }
  ): number;

  // Check if penalties would modify logits
  static hasPenalties(params: {
    repeat?: number;
    frequency?: number;
    presence?: number;
  }): boolean;
}

Sliding Window: Maintains last N tokens with O(1) operations Frequency Map: Tracks token counts for efficient penalty application Sparse Iteration: Only processes tokens in history (H ≈ 10-50, not V = 65,536)

Penalty Formulas (llama.cpp exact)

Implementation Reference: penalties.ts lines 135-157 (computeAdjustment method)

Repetition Penalty (multiplicative, sign-dependent):

if (logit <= 0) {
  logit *= penalty_repeat; // Multiply for negative/zero
} else {
  logit /= penalty_repeat; // Divide for positive
}

Source: llama.cpp llama_sampler_penalties_apply (src/llama-sampling.cpp:1768-1772)

Rationale: Sign-dependent formula fixes academic paper bug where dividing negative logits would increase probability (incorrect). llama.cpp's corrected implementation applies multiplicative penalty to negative logits, divisive to positive logits. This preserves relative ordering and always decreases probability.

Frequency Penalty (additive, count-scaled):

logit -= count * penalty_freq;

Source: llama.cpp line 1774 (first term)

Linear penalty proportional to occurrence count. Common in OpenAI/HuggingFace APIs.

Presence Penalty (additive, binary):

logit -= (count > 0 ? 1 : 0) * penalty_present;

Source: llama.cpp line 1774 (second term)

Binary penalty - same reduction whether token appeared once or many times.

Application Order: repeat → frequency → presence (matches llama.cpp line 1764)

Formula Semantics Comparison:

| Library | Repetition Penalty | Frequency/Presence | | ------------------------- | ------------------------------- | ------------------ | | llama.cpp (this impl) | Multiplicative, sign-dependent | Additive | | OpenAI API | Not supported | Additive (same) | | HuggingFace | Purely divisive (no sign check) | Additive |

Key Difference: HuggingFace's repetition_penalty always divides, which incorrectly increases probability for negative logits. llama.cpp (and this implementation) fixes this with sign-dependent logic.

References:

llama.cpp implementation: src/llama-sampling.cpp:1747-1778
Academic paper bug fix: See llama.cpp comment at line 1766
OpenAI comparison: API docs - frequency_penalty

Low-Level Penalty Functions

// Apply all penalties in correct order
function applyPenalties(
  logits: Float32Array,
  tokenHistory: TokenHistoryTracker,
  params: {
    repeat?: number;
    frequency?: number;
    presence?: number;
  }
): void;

// Individual penalty functions
function applyRepetitionPenalty(logits, tokenHistory, penalty): void;
function applyFrequencyPenalty(logits, tokenHistory, penalty): void;
function applyPresencePenalty(logits, tokenHistory, penalty): void;

Note: sampleWithStrategy() applies penalties virtually (zero-copy) via accessor function. Low-level functions modify logits in-place.

Deterministic Sampling

Same seed = identical token sequence (critical for reproducibility).

Instance PRNG (Recommended)

import {
  Xoroshiro128Plus,
  sampleWithStrategy,
  TokenHistoryTracker,
} from '@lloyal-labs/tsampler';

// Create PRNG instance (once per completion)
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Use in sampling
const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: { temperature: 0.8 },
  prng, // Instance PRNG (no global state)
});

// Generate random numbers directly (advanced use)
const rand = prng.next(); // Returns [0, 1)

Algorithm: Xoroshiro128+ (fast, high-quality, passes BigCrush tests)

Benefits:

No global state: Multiple samplers can run concurrently without collision
Explicit lifecycle: PRNG lifetime tied to sampler instance
Reproducible: Same seed produces identical sequences

Seed Defaults:

Provided: Use exact seed value
Omitted: Date.now() (millisecond precision)

Deterministic Replay: Snapshots preserve samplingParams (including seed) for exact replay.

Legacy Global PRNG (Deprecated)

import { initializePRNG, random, resetPRNG } from '@lloyal-labs/tsampler';

// ⚠️ DEPRECATED: Use instance PRNG instead
initializePRNG(seed);
const rand = random();

Note: Global PRNG still works for backward compatibility but is deprecated. Use instance PRNG for new code.

Advanced Filters

For custom sampling strategies, use filters directly:

import {
  applyTopK,
  applyTypicalP,
  applyTopP,
  applyMinP,
  applyTopNSigma,
  applyTemperature,
  sampleFromSet,
  CandidateSet,
  SamplerWorkspace,
} from './sampling';

Filter Functions

All filters operate on CandidateSet (zero-copy wrapper around workspace buffers).

// CandidateSet: Zero-copy wrapper
class CandidateSet {
  indices: Uint32Array; // Token IDs
  logits: Float32Array; // Logit values
  probs: Float32Array; // Probabilities (softmax)
  count: number; // Active candidates [0, count)
}

// Top-K filter
function applyTopK(
  logits: Float32Array,
  k: number,
  ws: SamplerWorkspace,
  penaltyAccessor?: (tokenId: number, baseLogit: number) => number
): CandidateSet;

// Typical-P filter
function applyTypicalP(
  set: CandidateSet,
  p: number,
  minKeep: number = 1
): CandidateSet;

// Top-P (nucleus) filter
function applyTopP(
  set: CandidateSet,
  p: number,
  workspace?: SamplerWorkspace
): CandidateSet;

// Min-P filter
function applyMinP(set: CandidateSet, threshold: number): CandidateSet;

// Top-N-Sigma filter
function applyTopNSigma(set: CandidateSet, n: number): CandidateSet;

// Temperature scaling
function applyTemperature(set: CandidateSet, temp: number): CandidateSet;

// Sample from final set
function sampleFromSet(set: CandidateSet): number;

Custom Filter Chain Example

const workspace = new SamplerWorkspace(256);

// Build custom chain
let candidates = applyTopK(logits, 40, workspace);
candidates = applyTopP(candidates, 0.95);
candidates = applyMinP(candidates, 0.05);
candidates = applyTemperature(candidates, 0.8); // ALWAYS LAST

// Sample from final distribution
const tokenId = sampleFromSet(candidates);

CRITICAL: Temperature MUST be applied AFTER all structural filters.

Usage Patterns

Deterministic Generation

Fixed seed produces identical token sequences (perfect reproducibility).

import {
  sampleWithStrategy,
  TokenHistoryTracker,
  Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';

// Fixed seed for deterministic generation
const seed = 42;
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Generate tokens - same seed = same sequence
for (let i = 0; i < maxTokens; i++) {
  const logits = native.getLogits();

  const tokenId = sampleWithStrategy(new Float32Array(logits), {
    tokenHistory,
    params: {
      temperature: 0.8,
      topK: 40,
      topP: 0.95,
    },
    prng,
  });

  tokenHistory.accept(tokenId);
  native.decode([tokenId]);
}

Use cases:

Testing and debugging
Reproducible benchmarks
Deterministic replay of conversations
A/B testing with controlled randomness

Creative Generation

High temperature + nucleus sampling for diverse, creative outputs.

const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: {
    temperature: 1.2, // Higher randomness
    topP: 0.95, // Nucleus sampling (adaptive)
    topK: 0, // No hard limit (let topP decide)
    minP: 0.02, // Filter very unlikely tokens
    penaltyRepeat: 1.15, // Discourage repetition
  },
  prng,
});

Parameters:

High temperature (1.0-1.5): More randomness
topP (0.9-0.95): Adapts to distribution shape
Low topK or 0: Let nucleus sampling control diversity
Moderate penaltyRepeat: Avoid repetitive patterns

Use cases:

Creative writing
Brainstorming
Dialogue generation
Story continuation

Factual Generation

Low temperature + greedy for deterministic, factual outputs.

const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: {
    temperature: 0.1, // Near-deterministic
    topK: 1, // Greedy (triggers fast-path)
    penaltyRepeat: 1.0, // No repetition penalty
    penaltyFreq: 0.0,
    penaltyPresent: 0.0,
  },
  prng,
});

Or explicitly use greedy:

import { greedy } from '@lloyal-labs/tsampler';

const tokenId = greedy(new Float32Array(native.getLogits()));

Parameters:

Very low temperature (< 0.2): Deterministic
topK = 1: Greedy fast-path
No penalties: Pure argmax selection

Use cases:

Question answering
Summarization
Code generation
Translation

Grammar-Constrained Generation

Integrate with GBNF grammar constraints (applied before sampling).

import {
  sampleWithStrategy,
  TokenHistoryTracker,
  Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';

// Initialize grammar
native.initGrammar('root ::= "hello" | "goodbye"');
native.resetGrammar();

const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Sampling loop
for (let i = 0; i < maxTokens; i++) {
  // Get logits
  const logitsBuffer = native.getLogits();
  const logits = new Float32Array(logitsBuffer);

  // Apply grammar constraints (modifies logits in-place)
  native.applyGrammar(logitsBuffer);

  // Sample from grammar-constrained distribution
  const tokenId = sampleWithStrategy(logits, {
    tokenHistory,
    params: {
      temperature: 0.8,
      topK: 40,
      topP: 0.95,
    },
    prng,
  });

  // Accept token for grammar state + history
  native.acceptToken(tokenId);
  tokenHistory.accept(tokenId);

  // Decode
  native.decode([tokenId]);
}

// Cleanup
native.freeGrammar();

Integration points:

applyGrammar(): Masks invalid tokens (sets logits to -∞)
sampleWithStrategy(): Samples from grammar-valid tokens only
acceptToken(): Updates grammar state for next token

Use cases:

JSON generation
Code generation (language syntax)
Structured data extraction
Constrained creative writing

Custom Sampling Strategies

Build domain-specific sampling with fine-grained control.

import {
  applyTopK,
  applyTopP,
  applyMinP,
  applyTemperature,
  sampleFromSet,
  SamplerWorkspace,
  TokenHistoryTracker,
} from './sampling';

// Custom strategy: Aggressive filtering for concise responses
function sampleConcise(
  logits: Float32Array,
  tokenHistory: TokenHistoryTracker,
  temperature: number
): number {
  const workspace = new SamplerWorkspace(128); // Smaller workspace

  // Build custom chain
  let candidates = applyTopK(logits, 20, workspace); // Fewer candidates
  candidates = applyMinP(candidates, 0.1); // Aggressive min-p
  candidates = applyTopP(candidates, 0.8); // Tighter nucleus
  candidates = applyTemperature(candidates, temperature);

  return sampleFromSet(candidates);
}

// Usage
const tokenId = sampleConcise(
  new Float32Array(native.getLogits()),
  tokenHistory,
  0.7
);

Advanced patterns:

Domain-specific filter combinations
Custom probability transformations
Multi-stage filtering
Adaptive sampling based on context

Test-Time Alignment (TTA)

Test-Time Alignment is the fusion of app-state with sampling strategy to steer model outputs at every token step. Unlike traditional fixed-parameter sampling, TTA allows you to:

Modify logits based on application state (constraints, domain knowledge)
Adapt sampling parameters based on distribution health or uncertainty
Switch strategies mid-generation without reinitialization

Architecture: App-State × Sampler Fusion

import {
  sampleWithStrategy,
  computeModelEntropy,
  computeModelSurprisal,
  requiredKcap,
  RollingPerplexity,
} from '@lloyal-labs/tsampler';

const ppl = new RollingPerplexity();

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // 1. App-State → Logit Steering
  applyDomainConstraints(logits, appState);

  // 2. Distribution Analysis → Strategy Selection
  const entropy = computeModelEntropy(logits);
  const params = selectStrategy(entropy, appState);

  // 3. Dynamic Capacity Management
  workspace.ensureCapacity(
    requiredKcap(params.topK, params.typicalP, logits.length)
  );

  // 4. Sample with adapted strategy
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params,
    workspace,
    prng,
  });

  // 5. Update app state + quality tracking
  updateAppState(token, appState);

  const surprisal = computeModelSurprisal(logits, token);
  ppl.addSurprisal(surprisal);

  // Optional: KV eviction gate
  if (ppl.ppl() > 50) {
    console.warn('High perplexity - consider cache pruning or retrieval');
  }
}

Dynamic Workspace Capacity

The workspace automatically grows to accommodate changing parameters:

import { SamplerWorkspace, requiredKcap } from '@lloyal-labs/tsampler';

const workspace = new SamplerWorkspace(256); // Initial capacity

// Token 1: Focused sampling (topK=40)
const token1 = sampleWithStrategy(logits1, {
  tokenHistory,
  params: { topK: 40, temperature: 0.8 },
  workspace, // kcap=256 (sufficient)
  prng,
});

// Token 2: Uncertainty spike → widen search (topK=320)
const token2 = sampleWithStrategy(logits2, {
  tokenHistory,
  params: { topK: 320, temperature: 1.2 },
  workspace, // Auto-grows to kcap=512 (power-of-two)
  prng,
});

// Token 3: Enable typical-P (needs larger pool)
const token3 = sampleWithStrategy(logits3, {
  tokenHistory,
  params: { topK: 40, typicalP: 0.9, temperature: 0.6 },
  workspace, // Stays at kcap=512 (typical-P requires ≥512)
  prng,
});

Growth Strategy:

Power-of-two sizing: Grows 40→64→128→256→512 (~5 allocations max per session)
Monotonic growth: Never downsizes (avoids churn in bursty workloads)
Zero reallocations: After initial growth, zero allocations per token
Version tracking: Guards against stale CandidateSet references after growth

Capacity Calculation Helper

import {
  requiredKcap,
  DEFAULT_KCAP,
  TYPICAL_P_KCAP,
} from '@lloyal-labs/tsampler';

// Calculate required capacity for given params
const needed = requiredKcap(
  params.topK, // undefined or 0 = use default
  params.typicalP, // < 1.0 triggers TYPICAL_P_KCAP (512)
  vocabSize // Upper bound (no point allocating beyond V)
);

workspace.ensureCapacity(needed);

Strategy:

typicalP < 1.0 → Need ≥512 for stable entropy calculation
Else → max(topK, DEFAULT_KCAP) for nucleus/min-p sampling
Clamps to vocabSize (no over-allocation)

Example 1: Adaptive Sampling Based on Entropy

import {
  sampleWithStrategy,
  computeModelEntropy,
  TokenHistoryTracker,
  Xoroshiro128Plus,
  SamplerWorkspace,
} from '@lloyal-labs/tsampler';

function selectStrategy(entropy: number) {
  if (entropy < 2.0) {
    // Collapsed distribution → widen search
    return {
      topK: 256,
      temperature: 1.5,
      topP: 0.98,
    };
  } else if (entropy > 5.0) {
    // Too flat → focus sampling
    return {
      topK: 20,
      temperature: 0.5,
      topP: 0.9,
    };
  } else {
    // Healthy distribution → standard sampling
    return {
      topK: 40,
      temperature: 0.8,
      topP: 0.95,
    };
  }
}

// Per-token adaptive sampling
const prng = new Xoroshiro128Plus(42);
const tokenHistory = new TokenHistoryTracker(64);
const workspace = new SamplerWorkspace(256);

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // Compute model-level entropy (before filters)
  const entropy = computeModelEntropy(logits);
  const params = selectStrategy(entropy);

  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params,
    workspace, // Grows as needed
    prng,
  });

  tokenHistory.accept(token);
  native.decode([token]);
}

Example 2: Constraint-Based Logit Steering

Domain-specific constraints via logit manipulation:

// Hard constraint: JPY doesn't use decimal subdivisions
if (parsedSoFar.currency === 'JPY' && currentField === 'amount') {
  logits[DECIMAL_TOKEN_ID] = -Infinity; // Veto
  DIGIT_TOKENS.forEach((id) => (logits[id] += 2.0)); // Boost integers
}

const token = sampleWithStrategy(logits, {
  tokenHistory,
  params: { temperature: 0.8, topK: 40 },
  workspace,
  prng,
});

Example 3: Domain Validation

Logit steering based on application state:

const currentMetric = detectCurrentMetric(accumulated);

if (currentMetric === 'glucose') {
  const { value, range } = reportData.glucose;
  const [min, max] = range;

  if (value > max) {
    // Elevated glucose → bias toward correct terminology
    ELEVATED_TOKENS.forEach((id) => (logits[id] += 10.0));
    NORMAL_TOKENS.forEach((id) => (logits[id] = -Infinity)); // Veto incorrect
  }
}

const token = sampleWithStrategy(logits, {
  tokenHistory,
  params: { temperature: 0.7, topK: 40 },
  workspace,
  prng,
});

Example 4: Exploratory Bursts

// Normal generation
let baseParams = { topK: 40, temperature: 0.8 };

// Detect uncertainty (e.g., all top-5 probs < 0.15)
if (isUncertain(logits)) {
  // Temporary exploration burst
  const exploratoryParams = {
    topK: 256, // Widen candidate pool
    temperature: 1.3, // Increase randomness
    typicalP: 0.9, // Filter atypical tokens
  };

  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: exploratoryParams,
    workspace, // Auto-grows to 512 for typical-P
    prng,
  });

  // Next token: back to focused sampling
  // workspace stays at 512 (no downsize churn)
} else {
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: baseParams,
    workspace,
    prng,
  });
}

Example 5: Perplexity Monitoring & Quality Tracking

import {
  sampleWithStrategy,
  computeModelSurprisal,
  computeSamplingSurprisal,
  RollingPerplexity,
  TokenHistoryTracker,
  Xoroshiro128Plus,
  SamplerWorkspace,
} from '@lloyal-labs/tsampler';

const ppl = new RollingPerplexity();
const prng = new Xoroshiro128Plus(42);
const tokenHistory = new TokenHistoryTracker(64);
const workspace = new SamplerWorkspace(256);

// Per-token quality metrics
const qualityLog: Array<{
  token: number;
  modelSurprisal: number;
  samplingSurprisal: number;
  runningPpl: number;
}> = [];

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // Sample token
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: { topK: 40, temperature: 0.8 },
    workspace,
    prng,
  });

  // Track model-level surprisal (before filters)
  const modelSurprisal = computeModelSurprisal(logits, token);
  ppl.addSurprisal(modelSurprisal);

  // Track sampling-level surprisal (post-filter)
  // Note: For full tracking, you'd capture candidates from getTopCandidates
  // This example shows the API usage pattern

  qualityLog.push({
    token,
    modelSurprisal,
    samplingSurprisal: modelSurprisal, // Approximation when candidates not available
    runningPpl: ppl.ppl(),
  });

  // Quality gates
  if (modelSurprisal > 8.0) {
    console.warn(
      `High uncertainty token: surprisal=${modelSurprisal.toFixed(2)}`
    );
  }

  if (ppl.ppl() > 50) {
    console.warn(
      `Sequence perplexity high: ${ppl.ppl().toFixed(2)} - consider retrieval`
    );
    // Trigger: cache pruning, RAG retrieval, or context compression
  }

  tokenHistory.accept(token);
  native.decode([token]);
}

// Sequence-level metrics
console.log(`Final perplexity: ${ppl.ppl().toFixed(2)}`);
console.log(
  `Avg surprisal: ${(qualityLog.reduce((sum, m) => sum + m.modelSurprisal, 0) / qualityLog.length).toFixed(2)} nats`
);

// Identify high-uncertainty spans
const uncertainSpans = qualityLog
  .filter((m) => m.modelSurprisal > 6.0)
  .map((m, i) => ({ position: i, surprisal: m.modelSurprisal }));

console.log('High-uncertainty tokens:', uncertainSpans);

Use cases:

KV cache eviction gates: High perplexity → prune old context or fetch from retrieval
Quality monitoring: Track surprisal/perplexity for confidence estimates
Debugging: Identify uncertain spans for manual review
A/B testing: Compare perplexity across different prompts or models
Dashboard signals: Real-time uncertainty visualization

Performance Characteristics

Growth amortization:

First token with K=40: Allocates 64 capacity (~0.5KB)
Later token with K=320: Grows to 512 capacity (~4KB, one-time)
Subsequent tokens: Zero allocations (reuses buffers)

Typical session progression:

40 → 64 → 128 → 256 → 512 (at most ~5 reallocations)
Total overhead: ~10-20ms across entire session
Per-token overhead after growth: <0.01ms (negligible)

Memory footprint:

K buffers: 512 × 12 bytes = 6KB (max)
Working logits: 65536 × 4 bytes = 262KB (lazy allocated, optional)
Total: <270KB worst-case

When to Use TTA

Use TTA when you need:

✅ Domain-specific constraints (JSON schemas, business rules)
✅ Runtime validation (medical thresholds, financial rules)
✅ Adaptive sampling (entropy-based, uncertainty-triggered)
✅ Multi-modal generation (switch strategies by content type)
✅ Exploratory search (temporary parameter bursts)

Don't use TTA for:

❌ Simple fixed-parameter generation (use native sampling)
❌ Maximum performance critical paths (6-10% overhead)
❌ Battery-constrained devices (every millisecond matters)

Metrics & Telemetry

Distribution metrics are orthogonal to sampling - they observe and measure without affecting token selection. Metrics enable runtime analytics and decision-making without interfering with the sampling process.

Key Principle: Observation Without Interference

Metrics compute surprisal/entropy from logits at two measurement levels:

Model metrics: Raw logits (before filters) - model's inherent belief
Sampling metrics: Post-filter logits (after top-k/p/temp) - actual sampled distribution

Use Cases:

KV cache eviction gates: High perplexity triggers retrieval or pruning
Quality monitoring: Track confidence estimates for generated sequences
Dashboard signals: Real-time uncertainty visualization
Analytics: Post-hoc analysis of generation quality

Example: Per-Step Metrics Collection

import {
  sampleWithStrategy,
  computeModelEntropy,
  computeModelSurprisal,
  RollingPerplexity,
  getTopCandidates,
} from '@lloyal-labs/tsampler';

const ppl = new RollingPerplexity();

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // 1. Compute metrics BEFORE sampling (no side effects on sampling)
  const entropy = computeModelEntropy(logits);
  const topCandidates = getTopCandidates(logits, 5); // For UI visualization

  // 2. Sample token (metrics don't affect this)
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: { temperature: 0.8, topK: 40 },
    workspace,
    prng,
  });

  // 3. Track quality metrics AFTER sampling
  const surprisal = computeModelSurprisal(logits, token);
  ppl.addSurprisal(surprisal);

  // 4. Analytics/eviction gates (metrics inform actions)
  if (ppl.ppl() > 50) {
    console.warn(
      'High perplexity detected - consider KV eviction or retrieval'
    );
  }

  if (entropy > 5.0) {
    console.log('High uncertainty - model is exploring multiple paths');
  }

  // Accept token and continue
  tokenHistory.accept(token);
}

Available Metrics:

computeModelSurprisal(logits, tokenId) - Surprisal of chosen token from model distribution
computeSamplingSurprisal(candidateLogits, candidateIds, tokenId) - Surprisal from filtered candidates
computeModelEntropy(logits) - Entropy of full model distribution
computeSamplingEntropy(candidateLogits) - Entropy of filtered candidates
RollingPerplexity - Track sequence-level quality via perplexity

All metrics use numerically stable log-sum-exp and support both nats (default) and bits output.

See also:

Example 5: Perplexity Monitoring - Full TTA workflow with metrics

Invariants and Guarantees

Sampler Chain Order (FIXED)

INVARIANT: Chain order NEVER changes:

Penalties (virtual, via accessor)
Top-K
Typical-P (if enabled)
Top-P (if enabled)
Min-P (if enabled)
Top-N-Sigma (if enabled)
Temperature (ALWAYS LAST structural operation)
Sample

Rationale: Matches llama.cpp exactly (discussion #7590)

Temperature Timing (CRITICAL)

INVARIANT: Temperature applied AFTER all structural filters (top-k, nucleus, etc.)

Why this matters:

Structural filters operate on raw probabilities
Temperature BEFORE filters reduces their effectiveness
Example: topK=40 with early temperature might keep wrong 40 tokens

Guarantee: sampleWithStrategy() enforces correct ordering.

Zero-Copy Logits (PERFORMANCE)

INVARIANT: Input logits array NEVER modified by sampleWithStrategy()

Implementation:

Penalties applied virtually via accessor function
Filters operate on workspace buffers
Original logits remain unchanged

Benefit: Allows logits reuse, inspection, debugging without defensive copies.

Greedy Fast-Path (OPTIMIZATION)

INVARIANT: temperature < 1e-3 or topK === 1 triggers greedy selection

Skips: All filters (top-k, nucleus, temperature, etc.) Exception: Penalties still applied (correctness critical)

Implementation:

if (temp < 1e-3 || topK === 1) {
  if (hasPenalties) {
    // Greedy with penalties
    const set = applyTopK(logits, 1, workspace, penaltyAccessor);
    return set.indices[0];
  }
  // Plain greedy
  return greedy(logits);
}

Candidate Count Monotonicity

INVARIANT: Each filter narrows candidate set (count decreases or stays same)

Guarantee: count_after_filter ≤ count_before_filter

Exception: None. All filters strictly non-increasing.

Renormalization (CORRECTNESS)

INVARIANT: sampleFromSet() renormalizes probabilities before sampling

Why: Filtered sets may not sum to 1.0 (filters remove mass)

Formula:

const totalMass = sum(probs[(0).count]);
const rand = random() * totalMass;

Guarantee: Sampling is unbiased with respect to filtered distribution.

Top-K Special Cases

INVARIANT: topK >= vocab_size means NO truncation (preserves token ID order)

Rationale: Matches llama.cpp behavior (user explicitly requested all tokens)

Implementation:

if (k >= V) {
  return candidateSetFromFullLogits(logits, ws, penaltyAccessor);
}

Use case: Verification mode, custom post-processing.

Top-P Special Cases

INVARIANT 1: topP >= 1.0 disables filter (no-op)

INVARIANT 2: topP <= 0.0 collapses to greedy by probability

Critical: Greedy by probability (NOT by index)

Must find argmax explicitly
Typical-P may have reordered candidates
Returning indices[0] would be WRONG

Typical-P Entropy Stability

INVARIANT: Typical-P requires larger candidate pool for stable entropy

Implementation:

Fast mode uses TYPICAL_P_KCAP = 512 (not DEFAULT_KCAP = 256)
Ensures entropy calculation over sufficient candidates

Rationale: Meister et al. 2022 - entropy over 512 tokens provides better estimate.

Min-P Relative Threshold

INVARIANT: minProb = maxProb * threshold

Adapts to distribution:

High max prob → aggressive filtering
Low max prob → less filtering

Always keeps argmax: Token with max prob always passes filter.

Top-N-Sigma Semantics

INVARIANT: topNSigma <= 0 is NO-OP (NOT greedy)

Rationale: llama.cpp PR#13264 - negative values disable filter.

Edge case: All logits identical → std = 0 → collapse to single max token.

Penalty Application Order

INVARIANT: Penalties applied in order: repeat → frequency → presence

Matches: llama.cpp line 1774

Virtual application: sampleWithStrategy() applies penalties via accessor (zero-copy).

Workspace Reuse (ZERO-ALLOC)

INVARIANT: Preallocated workspace buffers reused per-token

Performance: Reduces allocations from ~100/token to ~0

Buffers:

idxK, valK, probK: For heap-based Top-K (size K)
tmpIdx, tmpLogits, tmpProbs: For re-sorting (size K)
workingLogits: For penalty application (size V, optional)

Guarantee: Zero allocations per token after initialization.

Performance Characteristics

TypeScript Sampling

| Operation | Time (ms) | Notes | | ------------- | --------- | ---------------------------------------------- | | Logits access | 0.05-0.07 | Zero-copy ArrayBuffer (measured in production) | | Top-K (k=40) | 1-2 | Heap-based selection, O(V log K) | | Nucleus | 0.5-1 | O(K log K) sorting | | Softmax | 0.2-0.5 | V8 JIT optimized | | Total | 3-5 | Per token (average across engines) |

TypeScript Performance by Engine

| Engine | Time per token | Notes | | ----------- | -------------- | -------------------------- | | Node/V8 | ~2-4ms | Server-side, JIT optimized | | Hermes | ~4-8ms | Android mid-tier devices | | JSC | ~3-6ms | iOS Safari/WebKit |

Native Sampling (C++)

| Operation | Time (ms) | Notes | | ------------- | ----------- | -------------- | | Logits access | 0.01 | Pointer access | | Top-K (k=40) | 0.1-0.3 | SIMD optimized | | Total | 0.1-0.3 | Per token |

Context

Decode time: 40-500ms per token (model-dependent: small models ~40ms, large models ~500ms)
TS overhead: 3-5ms average (engine-specific: 2-8ms range)
Overhead %: ~6-11% of decode time for typical models
User perception: Imperceptible (<10ms latency addition)

Optimization Strategies

Fast Mode (default):

Pre-truncate to KCAP (256 or 512 for typical-p)
O(V log K) complexity
~3-5ms per token

Parity Mode (verification only):

Start from full V candidates
O(V) complexity
~10-15ms per token
Use only for llama.cpp verification tests

Workspace Reuse:

Preallocate buffers once
Reuse across tokens
Zero allocations per token

Sparse Penalties:

Iterate over history (H ≈ 10-50)
Not full vocabulary (V = 65,536)
O(H) instead of O(V)

Typical-P K-Cap Trade-Off

Problem: Typical-P requires larger candidate pool for stable entropy.

Solution:

DEFAULT_KCAP = 256 (for most filters)
TYPICAL_P_KCAP = 512 (when typical-p enabled)

Performance impact:

512 vs 256: ~1-2ms additional (negligible vs decode time)
Benefit: Better entropy estimate → more effective filtering

Testing

Test Coverage

Comprehensive test coverage for all components:

npm test

Test suites:

golden.test.ts - llama.cpp parity tests
sampling.test.ts - Sampling strategies
penalties.test.ts - Penalty formulas and token history
prng.test.ts - Deterministic PRNG behavior
correctness.test.ts - P0 bug fixes (greedy with penalties, top-p p<=0)
metrics.test.ts - Entropy, surprisal, perplexity
workspace.test.ts - Dynamic capacity management
integration.test.ts - End-to-end workflows

Total: 181 tests across 8 test suites (vitest, ~400ms)

Golden Tests

Status: ✅ 20/20 passing

Coverage: All llama.cpp test cases from test-sampling.cpp

Validates:

Exact penalty formulas (repeat, frequency, presence)
Sampler chain order
Temperature timing
Token ID order preservation (when k >= V)

Correctness Tests

Status: ✅ 12/12 passing

Critical bugs fixed:

Top-P p<=0 greedy: Must find argmax by probability (not return index 0)
Greedy with penalties: Must apply penalties even in fast-path
Heap sort bug: Fixed reverse loop that flipped descending to ascending

PRNG Tests

Validates:

Same seed → same sequence
resetPRNG() restores initial state
Xoroshiro128+ algorithm correctness

Native Parity

Coverage: Paritial

Key findings:

Same sampler chain order
Same default parameters
Same greedy fast-path logic
Minor differences: top-n-sigma (TS only), seed precision (ms vs s)

Interchangeability: Users can switch between 'typescript' and 'native' paths and get equivalent results (within numerical precision).

Directory Structure

sampling/
├── index.ts              # Public API exports
│
├── Core Sampling         # Individual sampling strategies
│   ├── greedy.ts         # Argmax selection (deterministic)
│   ├── softmax.ts        # Logits → probabilities
│   └── filters.ts        # All filters (topK, topP, minP, typicalP, topNSigma)
│
├── strategies.ts         # Combined strategy + penalty integration
│
├── Penalties             # Repetition control (exact llama.cpp formulas)
│   └── penalties.ts      # Repeat, frequency, presence penalties
│
├── Determinism
│   └── prng.ts           # Xoroshiro128+ PRNG for reproducibility
│
├── Native Fast Path
│   └── sample-native.ts  # Optional C++ passthrough
│
├── Utilities
│   └── utils.ts          # Shared helpers (temperature scaling, heap sort)
│
└── __tests__/            # Comprehensive test coverage
    ├── penalties.test.ts
    ├── sampling.test.ts
    ├── prng.test.ts
    └── correctness.test.ts

Implementation Coverage

Implemented (grounded in llama.cpp):

✅ Greedy (argmax)
✅ Top-K sampling
✅ Top-P (nucleus) sampling
✅ Min-P sampling
✅ Typical-P (locally typical) sampling
✅ Top-N-Sigma filtering
✅ Temperature scaling
✅ Repetition penalties (repeat, frequency, presence)
✅ Grammar constraints (GBNF)
✅ Deterministic PRNG (seed-based)

Not Yet Implemented (advanced/experimental):

❌ Mirostat sampling (stateful, complex)
❌ DRY (Don't Repeat Yourself) sampling
❌ XTC sampler

Coverage: 10/13 sampling methods (77%) - all essential methods implemented

Migration from v1.x

Breaking Change: Standalone samplers (topK, topP, minP, etc.) removed.

Why?

Old standalone samplers had critical bug: temperature applied BEFORE structural filters.

New functional filters apply temperature AFTER all structural filters (correct llama.cpp parity).

Migration Guide

Before (v1.x):

import { topP } from './sampling';

const tokenId = topP(logits, 0.95, 0.8);

After (v2.x):

import {
  sampleWithStrategy,
  TokenHistoryTracker,
  initializePRNG,
} from './sampling';

// One-time setup per completion
const seed = samplingParams?.seed ?? Date.now();
initializePRNG(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Per-token sampling
const tokenId = sampleWithStrategy(logits, tokenHistory, {
  topP: 0.95,
  temperature: 0.8,
});

tokenHistory.accept(tokenId);

Benefits:

✅ Correct temperature timing (llama.cpp parity)
✅ Penalty support (repeat, frequency, presence)
✅ Filter composition (combine multiple filters)
✅ Deterministic sampling (seed-based PRNG)
✅ Grammar constraints (via native layer)

Advanced: Direct Filter Usage

If you need fine-grained control:

import {
  applyTopK,
  applyTopP,
  applyTemperature,
  sampleFromSet,
  SamplerWorkspace,
} from './sampling';

const workspace = new SamplerWorkspace(256);

let candidates = applyTopK(logits, 256, workspace);
candidates = applyTopP(candidates, 0.95);
candidates = applyTemperature(candidates, 0.8); // ALWAYS LAST
const tokenId = sampleFromSet(candidates);

Future Enhancements

Logit Lens visualization (probability distribution inspection)
Custom sampling strategies (domain-specific filters)
Mirostat sampling (if requested by users)
DRY sampling (if requested by users)
Beam search support
Constrained decoding (beyond GBNF)
Token probability logging for debugging
Sampling analytics and telemetry

References

License

Apache 2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

TypeScript Sampling Library

Table of Contents

Quick Start

Why TypeScript Sampling?

Advantages

When to Use Native Sampling

API Reference

Core Functions

sampleWithStrategy()

greedy()

getTopCandidates()

Sampling Parameters

Parameter Details

Penalties

TokenHistoryTracker

Penalty Formulas (llama.cpp exact)

Low-Level Penalty Functions

Deterministic Sampling

Instance PRNG (Recommended)

Legacy Global PRNG (Deprecated)

Advanced Filters

Filter Functions

Custom Filter Chain Example

Usage Patterns

Deterministic Generation

Creative Generation

Factual Generation

Grammar-Constrained Generation

Custom Sampling Strategies

Test-Time Alignment (TTA)

Architecture: App-State × Sampler Fusion

Dynamic Workspace Capacity

Capacity Calculation Helper

Example 1: Adaptive Sampling Based on Entropy

Example 2: Constraint-Based Logit Steering

Example 3: Domain Validation

Example 4: Exploratory Bursts

Example 5: Perplexity Monitoring & Quality Tracking

Performance Characteristics

When to Use TTA

Metrics & Telemetry

Key Principle: Observation Without Interference

Example: Per-Step Metrics Collection

Invariants and Guarantees

Sampler Chain Order (FIXED)

Temperature Timing (CRITICAL)

Zero-Copy Logits (PERFORMANCE)

Greedy Fast-Path (OPTIMIZATION)

Candidate Count Monotonicity

Renormalization (CORRECTNESS)

Top-K Special Cases

Top-P Special Cases

Typical-P Entropy Stability

Min-P Relative Threshold

Top-N-Sigma Semantics

Penalty Application Order

Workspace Reuse (ZERO-ALLOC)

Performance Characteristics

TypeScript Sampling

TypeScript Performance by Engine

Native Sampling (C++)

Context

Optimization Strategies

Typical-P K-Cap Trade-Off

Testing

Test Coverage

Golden Tests

Correctness Tests

PRNG Tests

Native Parity

Directory Structure

Implementation Coverage

Migration from v1.x

Why?

`sampleWithStrategy()`

`greedy()`

`getTopCandidates()`

`TokenHistoryTracker`