@lloyal-labs/tsampler
v0.2.0
Published
Pure TypeScript sampling algorithms for LLM token generation with exact llama.cpp parity
Maintainers
Readme
TypeScript Sampling Library
Pure TypeScript implementations of sampling algorithms for LLM token generation with exact llama.cpp parity.
npm install @lloyal-labs/tsamplerPerformance: ~3-5ms per token (vs ~0.1-0.3ms native) Trade-off: ~6-11% overhead for full flexibility, OTA updates, and transparency
Table of Contents
- Quick Start
- Why TypeScript Sampling?
- API Reference
- Usage Patterns
- Test-Time Alignment (TTA)
- Architecture
- Dynamic Workspace Capacity
- Capacity Calculation Helper
- Example 1: Adaptive Sampling Based on Entropy
- Example 2: Constraint-Based Logit Steering
- Example 3: Medical Report Validation
- Example 4: Exploratory Bursts
- Example 5: Perplexity Monitoring & Quality Tracking
- Performance Characteristics
- When to Use TTA
- Invariants and Guarantees
- Performance Characteristics
- Testing
Quick Start
import {
TokenHistoryTracker,
sampleWithStrategy,
Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';
// Create instances (once per completion)
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);
// Get logits from native layer
const logitsBuffer = native.getLogits();
const logits = new Float32Array(logitsBuffer);
// Sample with combined strategy
const tokenId = sampleWithStrategy(logits, {
tokenHistory,
params: {
temperature: 0.8,
topK: 40,
topP: 0.95,
minP: 0.05,
penaltyRepeat: 1.1,
},
prng, // Instance PRNG (no global state)
});
// Accept token for history tracking
tokenHistory.accept(tokenId);Why TypeScript Sampling?
Advantages
- ✅ Test-Time Alignment (TTA): Fuse app-state with sampling strategy at every token step
- ✅ OTA Updates: Sampling logic can evolve without C++ changes
- ✅ Custom Strategies: Easy to implement domain-specific sampling
- ✅ Logit Steering: Apply domain constraints, validation rules, and business logic
- ✅ Transparency: Full visibility into token probabilities and decisions
- ✅ Debugging: Inspect logits, penalties, and sampling in real-time
- ✅ Exact Control: Penalties match llama.cpp exactly (no black box)
- ✅ Grammar Support: Integrates seamlessly with GBNF constraints
When to Use Native Sampling
Use native C++ sampling (samplingPath: 'native') when you need:
- Maximum performance (performance-critical applications)
- Simple greedy/top-k sampling without advanced features
- Battery-constrained devices where every millisecond matters
API Reference
Core Functions
sampleWithStrategy()
Main entry point for sampling with combined strategy.
function sampleWithStrategy(logits: Float32Array, opts: SampleOptions): number;
interface SampleOptions {
tokenHistory: TokenHistoryTracker;
params?: SamplingParams;
workspace?: SamplerWorkspace;
mode?: SamplerMode; // 'fast' (default) or 'parity'
prng?: Xoroshiro128Plus; // Instance PRNG (recommended)
}Parameters:
logits: Logits array (zero-copy, NOT modified)opts.tokenHistory: Token history tracker for penaltiesopts.params: Sampling parameters (see Sampling Parameters)opts.workspace: Preallocated buffers (optional, reuse across tokens for zero-alloc)opts.mode:'fast'(default, O(V log K)) or'parity'(O(V), exact llama.cpp verification)opts.prng: Instance PRNG for deterministic sampling (recommended, avoids multi-stepper collisions)
Returns: Sampled token ID
Sampler Chain Order (matches llama.cpp):
- Penalties (virtual, via accessor)
- Top-K
- Typical-P (if
typicalP < 1.0) - Top-P (if
topP < 1.0) - Min-P (if
minP > 0.0) - Top-N-Sigma (if
topNSigma > 0.0) - Temperature (applied AFTER all structural filters)
- Sample (with renormalization)
Fast-Paths:
- Greedy:
temperature < 1e-3ortopK === 1→ argmax selection (skips all filters) - With penalties: Applies penalties even in greedy mode (correctness critical)
greedy()
Argmax selection (deterministic).
function greedy(logits: Float32Array): number;Returns token with highest logit value. O(V) single-pass.
getTopCandidates()
Get top-N candidates with probabilities (useful for visualization).
function getTopCandidates(
logits: Float32Array,
n: number = 10
): Array<{ tokenId: number; probability: number }>;Performance: O(V log N) using heap-based selection (not full sort).
Sampling Parameters
interface SamplingParams {
// Structural filters
topK?: number; // Default: 40
topP?: number; // Default: 0.95
minP?: number; // Default: 0.05
typicalP?: number; // Default: 1.0 (disabled)
topNSigma?: number; // Default: -1.0 (disabled)
// Temperature (applied after all filters)
temperature?: number; // Default: 0.8
// Penalties
penaltyRepeat?: number; // Default: 1.1
penaltyFreq?: number; // Default: 0.0
penaltyPresent?: number; // Default: 0.0
// Determinism
seed?: number; // Default: Date.now()
}Parameter Details
topK - Top-K sampling
- Keep top K tokens by logit value
- Common values: 40-80
topK = 0(fast mode): Pre-truncate to 256 for performancetopK >= vocab_size: No truncation (preserve full vocabulary in token ID order)- Greedy:
topK = 1
topP - Nucleus (top-p) sampling
- Keep smallest set where cumulative probability ≥ p
- Common values: 0.9-0.95
- Adapts to distribution shape (dynamic K)
topP >= 1.0: DisabledtopP <= 0.0: Greedy by probability (argmax)
minP - Minimum probability threshold (relative)
- Filter tokens where
prob < max_prob * minP - Common values: 0.05-0.1
- Adapts to distribution confidence
minP <= 0.0: Disabled- Always keeps argmax (highest prob token)
typicalP - Locally typical sampling
- Keep tokens with "locally typical" information content
- Filters tokens whose entropy diverges from expected
- Common values: 0.95 (disabled by default: 1.0)
- Requires larger candidate pool (512 vs 256) for stable entropy
typicalP >= 1.0: Disabled
topNSigma - Statistical filtering
- Keep tokens within N standard deviations of max logit
- Statistical approach to filtering unlikely tokens
- Common values: 2.0 (disabled by default: ≤ 0)
topNSigma <= 0.0: Disabled (NOT greedy, per llama.cpp PR#13264)
temperature - Temperature scaling
- Controls randomness of distribution
temp > 1.0: Flatter distribution (more random)temp = 1.0: No changetemp < 1.0: Sharper distribution (more deterministic)temp → 0: Approaches greedytemp < 1e-3: Triggers greedy fast-path
penaltyRepeat - Repetition penalty
- Multiplicative penalty for tokens in history
- Formula:
logit *= penalty(if logit ≤ 0)logit /= penalty(if logit > 0)
- Common values: 1.0-1.5
- Default: 1.1
penaltyFreq - Frequency penalty
- Subtractive penalty scaled by token count
- Formula:
logit -= count * penalty - Common values: 0.0-2.0
- Default: 0.0
penaltyPresent - Presence penalty
- Flat penalty if token appears at least once
- Formula:
logit -= penalty(if token in history) - Common values: 0.0-1.0
- Default: 0.0
seed - RNG seed for deterministic sampling
- Fixed seed → identical token sequence
- Omit for non-deterministic (uses
Date.now()) - See Deterministic Sampling
Penalties
TokenHistoryTracker
Manages token history with sliding window and frequency tracking.
class TokenHistoryTracker {
constructor(penaltyLastN: number);
// Accept token into history
accept(token: number): void;
// Get occurrence count
getCount(token: number): number;
// Check if token exists
hasToken(token: number): boolean;
// Reset history
reset(): void;
// Get window size
size(): number;
// Get unique tokens (for sparse iteration)
getUniqueTokens(): number[];
// Compute penalty adjustment for single token
computeAdjustment(
tokenId: number,
baseLogit: number,
params: {
repeat?: number;
frequency?: number;
presence?: number;
}
): number;
// Check if penalties would modify logits
static hasPenalties(params: {
repeat?: number;
frequency?: number;
presence?: number;
}): boolean;
}Sliding Window: Maintains last N tokens with O(1) operations Frequency Map: Tracks token counts for efficient penalty application Sparse Iteration: Only processes tokens in history (H ≈ 10-50, not V = 65,536)
Penalty Formulas (llama.cpp exact)
Implementation Reference: penalties.ts lines 135-157 (computeAdjustment method)
Repetition Penalty (multiplicative, sign-dependent):
if (logit <= 0) {
logit *= penalty_repeat; // Multiply for negative/zero
} else {
logit /= penalty_repeat; // Divide for positive
}Source: llama.cpp llama_sampler_penalties_apply (src/llama-sampling.cpp:1768-1772)
Rationale: Sign-dependent formula fixes academic paper bug where dividing negative logits would increase probability (incorrect). llama.cpp's corrected implementation applies multiplicative penalty to negative logits, divisive to positive logits. This preserves relative ordering and always decreases probability.
Frequency Penalty (additive, count-scaled):
logit -= count * penalty_freq;Source: llama.cpp line 1774 (first term)
Linear penalty proportional to occurrence count. Common in OpenAI/HuggingFace APIs.
Presence Penalty (additive, binary):
logit -= (count > 0 ? 1 : 0) * penalty_present;Source: llama.cpp line 1774 (second term)
Binary penalty - same reduction whether token appeared once or many times.
Application Order: repeat → frequency → presence (matches llama.cpp line 1764)
Formula Semantics Comparison:
| Library | Repetition Penalty | Frequency/Presence | | ------------------------- | ------------------------------- | ------------------ | | llama.cpp (this impl) | Multiplicative, sign-dependent | Additive | | OpenAI API | Not supported | Additive (same) | | HuggingFace | Purely divisive (no sign check) | Additive |
Key Difference: HuggingFace's repetition_penalty always divides, which incorrectly increases probability for negative logits. llama.cpp (and this implementation) fixes this with sign-dependent logic.
References:
- llama.cpp implementation: src/llama-sampling.cpp:1747-1778
- Academic paper bug fix: See llama.cpp comment at line 1766
- OpenAI comparison: API docs - frequency_penalty
Low-Level Penalty Functions
// Apply all penalties in correct order
function applyPenalties(
logits: Float32Array,
tokenHistory: TokenHistoryTracker,
params: {
repeat?: number;
frequency?: number;
presence?: number;
}
): void;
// Individual penalty functions
function applyRepetitionPenalty(logits, tokenHistory, penalty): void;
function applyFrequencyPenalty(logits, tokenHistory, penalty): void;
function applyPresencePenalty(logits, tokenHistory, penalty): void;Note: sampleWithStrategy() applies penalties virtually (zero-copy) via accessor function. Low-level functions modify logits in-place.
Deterministic Sampling
Same seed = identical token sequence (critical for reproducibility).
Instance PRNG (Recommended)
import {
Xoroshiro128Plus,
sampleWithStrategy,
TokenHistoryTracker,
} from '@lloyal-labs/tsampler';
// Create PRNG instance (once per completion)
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);
// Use in sampling
const tokenId = sampleWithStrategy(logits, {
tokenHistory,
params: { temperature: 0.8 },
prng, // Instance PRNG (no global state)
});
// Generate random numbers directly (advanced use)
const rand = prng.next(); // Returns [0, 1)Algorithm: Xoroshiro128+ (fast, high-quality, passes BigCrush tests)
Benefits:
- No global state: Multiple samplers can run concurrently without collision
- Explicit lifecycle: PRNG lifetime tied to sampler instance
- Reproducible: Same seed produces identical sequences
Seed Defaults:
- Provided: Use exact seed value
- Omitted:
Date.now()(millisecond precision)
Deterministic Replay: Snapshots preserve samplingParams (including seed) for exact replay.
Legacy Global PRNG (Deprecated)
import { initializePRNG, random, resetPRNG } from '@lloyal-labs/tsampler';
// ⚠️ DEPRECATED: Use instance PRNG instead
initializePRNG(seed);
const rand = random();Note: Global PRNG still works for backward compatibility but is deprecated. Use instance PRNG for new code.
Advanced Filters
For custom sampling strategies, use filters directly:
import {
applyTopK,
applyTypicalP,
applyTopP,
applyMinP,
applyTopNSigma,
applyTemperature,
sampleFromSet,
CandidateSet,
SamplerWorkspace,
} from './sampling';Filter Functions
All filters operate on CandidateSet (zero-copy wrapper around workspace buffers).
// CandidateSet: Zero-copy wrapper
class CandidateSet {
indices: Uint32Array; // Token IDs
logits: Float32Array; // Logit values
probs: Float32Array; // Probabilities (softmax)
count: number; // Active candidates [0, count)
}
// Top-K filter
function applyTopK(
logits: Float32Array,
k: number,
ws: SamplerWorkspace,
penaltyAccessor?: (tokenId: number, baseLogit: number) => number
): CandidateSet;
// Typical-P filter
function applyTypicalP(
set: CandidateSet,
p: number,
minKeep: number = 1
): CandidateSet;
// Top-P (nucleus) filter
function applyTopP(
set: CandidateSet,
p: number,
workspace?: SamplerWorkspace
): CandidateSet;
// Min-P filter
function applyMinP(set: CandidateSet, threshold: number): CandidateSet;
// Top-N-Sigma filter
function applyTopNSigma(set: CandidateSet, n: number): CandidateSet;
// Temperature scaling
function applyTemperature(set: CandidateSet, temp: number): CandidateSet;
// Sample from final set
function sampleFromSet(set: CandidateSet): number;Custom Filter Chain Example
const workspace = new SamplerWorkspace(256);
// Build custom chain
let candidates = applyTopK(logits, 40, workspace);
candidates = applyTopP(candidates, 0.95);
candidates = applyMinP(candidates, 0.05);
candidates = applyTemperature(candidates, 0.8); // ALWAYS LAST
// Sample from final distribution
const tokenId = sampleFromSet(candidates);CRITICAL: Temperature MUST be applied AFTER all structural filters.
Usage Patterns
Deterministic Generation
Fixed seed produces identical token sequences (perfect reproducibility).
import {
sampleWithStrategy,
TokenHistoryTracker,
Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';
// Fixed seed for deterministic generation
const seed = 42;
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);
// Generate tokens - same seed = same sequence
for (let i = 0; i < maxTokens; i++) {
const logits = native.getLogits();
const tokenId = sampleWithStrategy(new Float32Array(logits), {
tokenHistory,
params: {
temperature: 0.8,
topK: 40,
topP: 0.95,
},
prng,
});
tokenHistory.accept(tokenId);
native.decode([tokenId]);
}Use cases:
- Testing and debugging
- Reproducible benchmarks
- Deterministic replay of conversations
- A/B testing with controlled randomness
Creative Generation
High temperature + nucleus sampling for diverse, creative outputs.
const tokenId = sampleWithStrategy(logits, {
tokenHistory,
params: {
temperature: 1.2, // Higher randomness
topP: 0.95, // Nucleus sampling (adaptive)
topK: 0, // No hard limit (let topP decide)
minP: 0.02, // Filter very unlikely tokens
penaltyRepeat: 1.15, // Discourage repetition
},
prng,
});Parameters:
- High
temperature(1.0-1.5): More randomness topP(0.9-0.95): Adapts to distribution shape- Low
topKor 0: Let nucleus sampling control diversity - Moderate
penaltyRepeat: Avoid repetitive patterns
Use cases:
- Creative writing
- Brainstorming
- Dialogue generation
- Story continuation
Factual Generation
Low temperature + greedy for deterministic, factual outputs.
const tokenId = sampleWithStrategy(logits, {
tokenHistory,
params: {
temperature: 0.1, // Near-deterministic
topK: 1, // Greedy (triggers fast-path)
penaltyRepeat: 1.0, // No repetition penalty
penaltyFreq: 0.0,
penaltyPresent: 0.0,
},
prng,
});Or explicitly use greedy:
import { greedy } from '@lloyal-labs/tsampler';
const tokenId = greedy(new Float32Array(native.getLogits()));Parameters:
- Very low
temperature(< 0.2): Deterministic topK = 1: Greedy fast-path- No penalties: Pure argmax selection
Use cases:
- Question answering
- Summarization
- Code generation
- Translation
Grammar-Constrained Generation
Integrate with GBNF grammar constraints (applied before sampling).
import {
sampleWithStrategy,
TokenHistoryTracker,
Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';
// Initialize grammar
native.initGrammar('root ::= "hello" | "goodbye"');
native.resetGrammar();
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);
// Sampling loop
for (let i = 0; i < maxTokens; i++) {
// Get logits
const logitsBuffer = native.getLogits();
const logits = new Float32Array(logitsBuffer);
// Apply grammar constraints (modifies logits in-place)
native.applyGrammar(logitsBuffer);
// Sample from grammar-constrained distribution
const tokenId = sampleWithStrategy(logits, {
tokenHistory,
params: {
temperature: 0.8,
topK: 40,
topP: 0.95,
},
prng,
});
// Accept token for grammar state + history
native.acceptToken(tokenId);
tokenHistory.accept(tokenId);
// Decode
native.decode([tokenId]);
}
// Cleanup
native.freeGrammar();Integration points:
applyGrammar(): Masks invalid tokens (sets logits to -∞)sampleWithStrategy(): Samples from grammar-valid tokens onlyacceptToken(): Updates grammar state for next token
Use cases:
- JSON generation
- Code generation (language syntax)
- Structured data extraction
- Constrained creative writing
Custom Sampling Strategies
Build domain-specific sampling with fine-grained control.
import {
applyTopK,
applyTopP,
applyMinP,
applyTemperature,
sampleFromSet,
SamplerWorkspace,
TokenHistoryTracker,
} from './sampling';
// Custom strategy: Aggressive filtering for concise responses
function sampleConcise(
logits: Float32Array,
tokenHistory: TokenHistoryTracker,
temperature: number
): number {
const workspace = new SamplerWorkspace(128); // Smaller workspace
// Build custom chain
let candidates = applyTopK(logits, 20, workspace); // Fewer candidates
candidates = applyMinP(candidates, 0.1); // Aggressive min-p
candidates = applyTopP(candidates, 0.8); // Tighter nucleus
candidates = applyTemperature(candidates, temperature);
return sampleFromSet(candidates);
}
// Usage
const tokenId = sampleConcise(
new Float32Array(native.getLogits()),
tokenHistory,
0.7
);Advanced patterns:
- Domain-specific filter combinations
- Custom probability transformations
- Multi-stage filtering
- Adaptive sampling based on context
Test-Time Alignment (TTA)
Test-Time Alignment is the fusion of app-state with sampling strategy to steer model outputs at every token step. Unlike traditional fixed-parameter sampling, TTA allows you to:
- Modify logits based on application state (constraints, domain knowledge)
- Adapt sampling parameters based on distribution health or uncertainty
- Switch strategies mid-generation without reinitialization
Architecture: App-State × Sampler Fusion
import {
sampleWithStrategy,
computeModelEntropy,
computeModelSurprisal,
requiredKcap,
RollingPerplexity,
} from '@lloyal-labs/tsampler';
const ppl = new RollingPerplexity();
while (generating) {
const logits = new Float32Array(native.getLogits());
// 1. App-State → Logit Steering
applyDomainConstraints(logits, appState);
// 2. Distribution Analysis → Strategy Selection
const entropy = computeModelEntropy(logits);
const params = selectStrategy(entropy, appState);
// 3. Dynamic Capacity Management
workspace.ensureCapacity(
requiredKcap(params.topK, params.typicalP, logits.length)
);
// 4. Sample with adapted strategy
const token = sampleWithStrategy(logits, {
tokenHistory,
params,
workspace,
prng,
});
// 5. Update app state + quality tracking
updateAppState(token, appState);
const surprisal = computeModelSurprisal(logits, token);
ppl.addSurprisal(surprisal);
// Optional: KV eviction gate
if (ppl.ppl() > 50) {
console.warn('High perplexity - consider cache pruning or retrieval');
}
}Dynamic Workspace Capacity
The workspace automatically grows to accommodate changing parameters:
import { SamplerWorkspace, requiredKcap } from '@lloyal-labs/tsampler';
const workspace = new SamplerWorkspace(256); // Initial capacity
// Token 1: Focused sampling (topK=40)
const token1 = sampleWithStrategy(logits1, {
tokenHistory,
params: { topK: 40, temperature: 0.8 },
workspace, // kcap=256 (sufficient)
prng,
});
// Token 2: Uncertainty spike → widen search (topK=320)
const token2 = sampleWithStrategy(logits2, {
tokenHistory,
params: { topK: 320, temperature: 1.2 },
workspace, // Auto-grows to kcap=512 (power-of-two)
prng,
});
// Token 3: Enable typical-P (needs larger pool)
const token3 = sampleWithStrategy(logits3, {
tokenHistory,
params: { topK: 40, typicalP: 0.9, temperature: 0.6 },
workspace, // Stays at kcap=512 (typical-P requires ≥512)
prng,
});Growth Strategy:
- Power-of-two sizing: Grows 40→64→128→256→512 (~5 allocations max per session)
- Monotonic growth: Never downsizes (avoids churn in bursty workloads)
- Zero reallocations: After initial growth, zero allocations per token
- Version tracking: Guards against stale
CandidateSetreferences after growth
Capacity Calculation Helper
import {
requiredKcap,
DEFAULT_KCAP,
TYPICAL_P_KCAP,
} from '@lloyal-labs/tsampler';
// Calculate required capacity for given params
const needed = requiredKcap(
params.topK, // undefined or 0 = use default
params.typicalP, // < 1.0 triggers TYPICAL_P_KCAP (512)
vocabSize // Upper bound (no point allocating beyond V)
);
workspace.ensureCapacity(needed);Strategy:
typicalP < 1.0→ Need ≥512 for stable entropy calculation- Else →
max(topK, DEFAULT_KCAP)for nucleus/min-p sampling - Clamps to
vocabSize(no over-allocation)
Example 1: Adaptive Sampling Based on Entropy
import {
sampleWithStrategy,
computeModelEntropy,
TokenHistoryTracker,
Xoroshiro128Plus,
SamplerWorkspace,
} from '@lloyal-labs/tsampler';
function selectStrategy(entropy: number) {
if (entropy < 2.0) {
// Collapsed distribution → widen search
return {
topK: 256,
temperature: 1.5,
topP: 0.98,
};
} else if (entropy > 5.0) {
// Too flat → focus sampling
return {
topK: 20,
temperature: 0.5,
topP: 0.9,
};
} else {
// Healthy distribution → standard sampling
return {
topK: 40,
temperature: 0.8,
topP: 0.95,
};
}
}
// Per-token adaptive sampling
const prng = new Xoroshiro128Plus(42);
const tokenHistory = new TokenHistoryTracker(64);
const workspace = new SamplerWorkspace(256);
while (generating) {
const logits = new Float32Array(native.getLogits());
// Compute model-level entropy (before filters)
const entropy = computeModelEntropy(logits);
const params = selectStrategy(entropy);
const token = sampleWithStrategy(logits, {
tokenHistory,
params,
workspace, // Grows as needed
prng,
});
tokenHistory.accept(token);
native.decode([token]);
}Example 2: Constraint-Based Logit Steering
Domain-specific constraints via logit manipulation:
// Hard constraint: JPY doesn't use decimal subdivisions
if (parsedSoFar.currency === 'JPY' && currentField === 'amount') {
logits[DECIMAL_TOKEN_ID] = -Infinity; // Veto
DIGIT_TOKENS.forEach((id) => (logits[id] += 2.0)); // Boost integers
}
const token = sampleWithStrategy(logits, {
tokenHistory,
params: { temperature: 0.8, topK: 40 },
workspace,
prng,
});Example 3: Domain Validation
Logit steering based on application state:
const currentMetric = detectCurrentMetric(accumulated);
if (currentMetric === 'glucose') {
const { value, range } = reportData.glucose;
const [min, max] = range;
if (value > max) {
// Elevated glucose → bias toward correct terminology
ELEVATED_TOKENS.forEach((id) => (logits[id] += 10.0));
NORMAL_TOKENS.forEach((id) => (logits[id] = -Infinity)); // Veto incorrect
}
}
const token = sampleWithStrategy(logits, {
tokenHistory,
params: { temperature: 0.7, topK: 40 },
workspace,
prng,
});Example 4: Exploratory Bursts
// Normal generation
let baseParams = { topK: 40, temperature: 0.8 };
// Detect uncertainty (e.g., all top-5 probs < 0.15)
if (isUncertain(logits)) {
// Temporary exploration burst
const exploratoryParams = {
topK: 256, // Widen candidate pool
temperature: 1.3, // Increase randomness
typicalP: 0.9, // Filter atypical tokens
};
const token = sampleWithStrategy(logits, {
tokenHistory,
params: exploratoryParams,
workspace, // Auto-grows to 512 for typical-P
prng,
});
// Next token: back to focused sampling
// workspace stays at 512 (no downsize churn)
} else {
const token = sampleWithStrategy(logits, {
tokenHistory,
params: baseParams,
workspace,
prng,
});
}Example 5: Perplexity Monitoring & Quality Tracking
import {
sampleWithStrategy,
computeModelSurprisal,
computeSamplingSurprisal,
RollingPerplexity,
TokenHistoryTracker,
Xoroshiro128Plus,
SamplerWorkspace,
} from '@lloyal-labs/tsampler';
const ppl = new RollingPerplexity();
const prng = new Xoroshiro128Plus(42);
const tokenHistory = new TokenHistoryTracker(64);
const workspace = new SamplerWorkspace(256);
// Per-token quality metrics
const qualityLog: Array<{
token: number;
modelSurprisal: number;
samplingSurprisal: number;
runningPpl: number;
}> = [];
while (generating) {
const logits = new Float32Array(native.getLogits());
// Sample token
const token = sampleWithStrategy(logits, {
tokenHistory,
params: { topK: 40, temperature: 0.8 },
workspace,
prng,
});
// Track model-level surprisal (before filters)
const modelSurprisal = computeModelSurprisal(logits, token);
ppl.addSurprisal(modelSurprisal);
// Track sampling-level surprisal (post-filter)
// Note: For full tracking, you'd capture candidates from getTopCandidates
// This example shows the API usage pattern
qualityLog.push({
token,
modelSurprisal,
samplingSurprisal: modelSurprisal, // Approximation when candidates not available
runningPpl: ppl.ppl(),
});
// Quality gates
if (modelSurprisal > 8.0) {
console.warn(
`High uncertainty token: surprisal=${modelSurprisal.toFixed(2)}`
);
}
if (ppl.ppl() > 50) {
console.warn(
`Sequence perplexity high: ${ppl.ppl().toFixed(2)} - consider retrieval`
);
// Trigger: cache pruning, RAG retrieval, or context compression
}
tokenHistory.accept(token);
native.decode([token]);
}
// Sequence-level metrics
console.log(`Final perplexity: ${ppl.ppl().toFixed(2)}`);
console.log(
`Avg surprisal: ${(qualityLog.reduce((sum, m) => sum + m.modelSurprisal, 0) / qualityLog.length).toFixed(2)} nats`
);
// Identify high-uncertainty spans
const uncertainSpans = qualityLog
.filter((m) => m.modelSurprisal > 6.0)
.map((m, i) => ({ position: i, surprisal: m.modelSurprisal }));
console.log('High-uncertainty tokens:', uncertainSpans);Use cases:
- KV cache eviction gates: High perplexity → prune old context or fetch from retrieval
- Quality monitoring: Track surprisal/perplexity for confidence estimates
- Debugging: Identify uncertain spans for manual review
- A/B testing: Compare perplexity across different prompts or models
- Dashboard signals: Real-time uncertainty visualization
Performance Characteristics
Growth amortization:
- First token with K=40: Allocates 64 capacity (~0.5KB)
- Later token with K=320: Grows to 512 capacity (~4KB, one-time)
- Subsequent tokens: Zero allocations (reuses buffers)
Typical session progression:
- 40 → 64 → 128 → 256 → 512 (at most ~5 reallocations)
- Total overhead: ~10-20ms across entire session
- Per-token overhead after growth: <0.01ms (negligible)
Memory footprint:
- K buffers: 512 × 12 bytes = 6KB (max)
- Working logits: 65536 × 4 bytes = 262KB (lazy allocated, optional)
- Total: <270KB worst-case
When to Use TTA
Use TTA when you need:
- ✅ Domain-specific constraints (JSON schemas, business rules)
- ✅ Runtime validation (medical thresholds, financial rules)
- ✅ Adaptive sampling (entropy-based, uncertainty-triggered)
- ✅ Multi-modal generation (switch strategies by content type)
- ✅ Exploratory search (temporary parameter bursts)
Don't use TTA for:
- ❌ Simple fixed-parameter generation (use native sampling)
- ❌ Maximum performance critical paths (6-10% overhead)
- ❌ Battery-constrained devices (every millisecond matters)
Metrics & Telemetry
Distribution metrics are orthogonal to sampling - they observe and measure without affecting token selection. Metrics enable runtime analytics and decision-making without interfering with the sampling process.
Key Principle: Observation Without Interference
Metrics compute surprisal/entropy from logits at two measurement levels:
- Model metrics: Raw logits (before filters) - model's inherent belief
- Sampling metrics: Post-filter logits (after top-k/p/temp) - actual sampled distribution
Use Cases:
- KV cache eviction gates: High perplexity triggers retrieval or pruning
- Quality monitoring: Track confidence estimates for generated sequences
- Dashboard signals: Real-time uncertainty visualization
- Analytics: Post-hoc analysis of generation quality
Example: Per-Step Metrics Collection
import {
sampleWithStrategy,
computeModelEntropy,
computeModelSurprisal,
RollingPerplexity,
getTopCandidates,
} from '@lloyal-labs/tsampler';
const ppl = new RollingPerplexity();
while (generating) {
const logits = new Float32Array(native.getLogits());
// 1. Compute metrics BEFORE sampling (no side effects on sampling)
const entropy = computeModelEntropy(logits);
const topCandidates = getTopCandidates(logits, 5); // For UI visualization
// 2. Sample token (metrics don't affect this)
const token = sampleWithStrategy(logits, {
tokenHistory,
params: { temperature: 0.8, topK: 40 },
workspace,
prng,
});
// 3. Track quality metrics AFTER sampling
const surprisal = computeModelSurprisal(logits, token);
ppl.addSurprisal(surprisal);
// 4. Analytics/eviction gates (metrics inform actions)
if (ppl.ppl() > 50) {
console.warn(
'High perplexity detected - consider KV eviction or retrieval'
);
}
if (entropy > 5.0) {
console.log('High uncertainty - model is exploring multiple paths');
}
// Accept token and continue
tokenHistory.accept(token);
}Available Metrics:
computeModelSurprisal(logits, tokenId)- Surprisal of chosen token from model distributioncomputeSamplingSurprisal(candidateLogits, candidateIds, tokenId)- Surprisal from filtered candidatescomputeModelEntropy(logits)- Entropy of full model distributioncomputeSamplingEntropy(candidateLogits)- Entropy of filtered candidatesRollingPerplexity- Track sequence-level quality via perplexity
All metrics use numerically stable log-sum-exp and support both nats (default) and bits output.
See also:
- Example 5: Perplexity Monitoring - Full TTA workflow with metrics
Invariants and Guarantees
Sampler Chain Order (FIXED)
INVARIANT: Chain order NEVER changes:
- Penalties (virtual, via accessor)
- Top-K
- Typical-P (if enabled)
- Top-P (if enabled)
- Min-P (if enabled)
- Top-N-Sigma (if enabled)
- Temperature (ALWAYS LAST structural operation)
- Sample
Rationale: Matches llama.cpp exactly (discussion #7590)
Temperature Timing (CRITICAL)
INVARIANT: Temperature applied AFTER all structural filters (top-k, nucleus, etc.)
Why this matters:
- Structural filters operate on raw probabilities
- Temperature BEFORE filters reduces their effectiveness
- Example:
topK=40with early temperature might keep wrong 40 tokens
Guarantee: sampleWithStrategy() enforces correct ordering.
Zero-Copy Logits (PERFORMANCE)
INVARIANT: Input logits array NEVER modified by sampleWithStrategy()
Implementation:
- Penalties applied virtually via accessor function
- Filters operate on workspace buffers
- Original logits remain unchanged
Benefit: Allows logits reuse, inspection, debugging without defensive copies.
Greedy Fast-Path (OPTIMIZATION)
INVARIANT: temperature < 1e-3 or topK === 1 triggers greedy selection
Skips: All filters (top-k, nucleus, temperature, etc.) Exception: Penalties still applied (correctness critical)
Implementation:
if (temp < 1e-3 || topK === 1) {
if (hasPenalties) {
// Greedy with penalties
const set = applyTopK(logits, 1, workspace, penaltyAccessor);
return set.indices[0];
}
// Plain greedy
return greedy(logits);
}Candidate Count Monotonicity
INVARIANT: Each filter narrows candidate set (count decreases or stays same)
Guarantee: count_after_filter ≤ count_before_filter
Exception: None. All filters strictly non-increasing.
Renormalization (CORRECTNESS)
INVARIANT: sampleFromSet() renormalizes probabilities before sampling
Why: Filtered sets may not sum to 1.0 (filters remove mass)
Formula:
const totalMass = sum(probs[(0).count]);
const rand = random() * totalMass;Guarantee: Sampling is unbiased with respect to filtered distribution.
Top-K Special Cases
INVARIANT: topK >= vocab_size means NO truncation (preserves token ID order)
Rationale: Matches llama.cpp behavior (user explicitly requested all tokens)
Implementation:
if (k >= V) {
return candidateSetFromFullLogits(logits, ws, penaltyAccessor);
}Use case: Verification mode, custom post-processing.
Top-P Special Cases
INVARIANT 1: topP >= 1.0 disables filter (no-op)
INVARIANT 2: topP <= 0.0 collapses to greedy by probability
Critical: Greedy by probability (NOT by index)
- Must find argmax explicitly
- Typical-P may have reordered candidates
- Returning
indices[0]would be WRONG
Typical-P Entropy Stability
INVARIANT: Typical-P requires larger candidate pool for stable entropy
Implementation:
- Fast mode uses
TYPICAL_P_KCAP = 512(notDEFAULT_KCAP = 256) - Ensures entropy calculation over sufficient candidates
Rationale: Meister et al. 2022 - entropy over 512 tokens provides better estimate.
Min-P Relative Threshold
INVARIANT: minProb = maxProb * threshold
Adapts to distribution:
- High max prob → aggressive filtering
- Low max prob → less filtering
Always keeps argmax: Token with max prob always passes filter.
Top-N-Sigma Semantics
INVARIANT: topNSigma <= 0 is NO-OP (NOT greedy)
Rationale: llama.cpp PR#13264 - negative values disable filter.
Edge case: All logits identical → std = 0 → collapse to single max token.
Penalty Application Order
INVARIANT: Penalties applied in order: repeat → frequency → presence
Matches: llama.cpp line 1774
Virtual application: sampleWithStrategy() applies penalties via accessor (zero-copy).
Workspace Reuse (ZERO-ALLOC)
INVARIANT: Preallocated workspace buffers reused per-token
Performance: Reduces allocations from ~100/token to ~0
Buffers:
idxK,valK,probK: For heap-based Top-K (size K)tmpIdx,tmpLogits,tmpProbs: For re-sorting (size K)workingLogits: For penalty application (size V, optional)
Guarantee: Zero allocations per token after initialization.
Performance Characteristics
TypeScript Sampling
| Operation | Time (ms) | Notes | | ------------- | --------- | ---------------------------------------------- | | Logits access | 0.05-0.07 | Zero-copy ArrayBuffer (measured in production) | | Top-K (k=40) | 1-2 | Heap-based selection, O(V log K) | | Nucleus | 0.5-1 | O(K log K) sorting | | Softmax | 0.2-0.5 | V8 JIT optimized | | Total | 3-5 | Per token (average across engines) |
TypeScript Performance by Engine
| Engine | Time per token | Notes | | ----------- | -------------- | -------------------------- | | Node/V8 | ~2-4ms | Server-side, JIT optimized | | Hermes | ~4-8ms | Android mid-tier devices | | JSC | ~3-6ms | iOS Safari/WebKit |
Native Sampling (C++)
| Operation | Time (ms) | Notes | | ------------- | ----------- | -------------- | | Logits access | 0.01 | Pointer access | | Top-K (k=40) | 0.1-0.3 | SIMD optimized | | Total | 0.1-0.3 | Per token |
Context
- Decode time: 40-500ms per token (model-dependent: small models ~40ms, large models ~500ms)
- TS overhead: 3-5ms average (engine-specific: 2-8ms range)
- Overhead %: ~6-11% of decode time for typical models
- User perception: Imperceptible (<10ms latency addition)
Optimization Strategies
Fast Mode (default):
- Pre-truncate to
KCAP(256 or 512 for typical-p) - O(V log K) complexity
- ~3-5ms per token
Parity Mode (verification only):
- Start from full V candidates
- O(V) complexity
- ~10-15ms per token
- Use only for llama.cpp verification tests
Workspace Reuse:
- Preallocate buffers once
- Reuse across tokens
- Zero allocations per token
Sparse Penalties:
- Iterate over history (H ≈ 10-50)
- Not full vocabulary (V = 65,536)
- O(H) instead of O(V)
Typical-P K-Cap Trade-Off
Problem: Typical-P requires larger candidate pool for stable entropy.
Solution:
DEFAULT_KCAP = 256(for most filters)TYPICAL_P_KCAP = 512(when typical-p enabled)
Performance impact:
- 512 vs 256: ~1-2ms additional (negligible vs decode time)
- Benefit: Better entropy estimate → more effective filtering
Testing
Test Coverage
Comprehensive test coverage for all components:
npm testTest suites:
golden.test.ts- llama.cpp parity testssampling.test.ts- Sampling strategiespenalties.test.ts- Penalty formulas and token historyprng.test.ts- Deterministic PRNG behaviorcorrectness.test.ts- P0 bug fixes (greedy with penalties, top-p p<=0)metrics.test.ts- Entropy, surprisal, perplexityworkspace.test.ts- Dynamic capacity managementintegration.test.ts- End-to-end workflows
Total: 181 tests across 8 test suites (vitest, ~400ms)
Golden Tests
Status: ✅ 20/20 passing
Coverage: All llama.cpp test cases from test-sampling.cpp
Validates:
- Exact penalty formulas (repeat, frequency, presence)
- Sampler chain order
- Temperature timing
- Token ID order preservation (when
k >= V)
Correctness Tests
Status: ✅ 12/12 passing
Critical bugs fixed:
- Top-P p<=0 greedy: Must find argmax by probability (not return index 0)
- Greedy with penalties: Must apply penalties even in fast-path
- Heap sort bug: Fixed reverse loop that flipped descending to ascending
PRNG Tests
Validates:
- Same seed → same sequence
resetPRNG()restores initial state- Xoroshiro128+ algorithm correctness
Native Parity
Coverage: Paritial
Key findings:
- Same sampler chain order
- Same default parameters
- Same greedy fast-path logic
- Minor differences: top-n-sigma (TS only), seed precision (ms vs s)
Interchangeability: Users can switch between 'typescript' and 'native' paths and get equivalent results (within numerical precision).
Directory Structure
sampling/
├── index.ts # Public API exports
│
├── Core Sampling # Individual sampling strategies
│ ├── greedy.ts # Argmax selection (deterministic)
│ ├── softmax.ts # Logits → probabilities
│ └── filters.ts # All filters (topK, topP, minP, typicalP, topNSigma)
│
├── strategies.ts # Combined strategy + penalty integration
│
├── Penalties # Repetition control (exact llama.cpp formulas)
│ └── penalties.ts # Repeat, frequency, presence penalties
│
├── Determinism
│ └── prng.ts # Xoroshiro128+ PRNG for reproducibility
│
├── Native Fast Path
│ └── sample-native.ts # Optional C++ passthrough
│
├── Utilities
│ └── utils.ts # Shared helpers (temperature scaling, heap sort)
│
└── __tests__/ # Comprehensive test coverage
├── penalties.test.ts
├── sampling.test.ts
├── prng.test.ts
└── correctness.test.tsImplementation Coverage
Implemented (grounded in llama.cpp):
- ✅ Greedy (argmax)
- ✅ Top-K sampling
- ✅ Top-P (nucleus) sampling
- ✅ Min-P sampling
- ✅ Typical-P (locally typical) sampling
- ✅ Top-N-Sigma filtering
- ✅ Temperature scaling
- ✅ Repetition penalties (repeat, frequency, presence)
- ✅ Grammar constraints (GBNF)
- ✅ Deterministic PRNG (seed-based)
Not Yet Implemented (advanced/experimental):
- ❌ Mirostat sampling (stateful, complex)
- ❌ DRY (Don't Repeat Yourself) sampling
- ❌ XTC sampler
Coverage: 10/13 sampling methods (77%) - all essential methods implemented
Migration from v1.x
Breaking Change: Standalone samplers (topK, topP, minP, etc.) removed.
Why?
Old standalone samplers had critical bug: temperature applied BEFORE structural filters.
New functional filters apply temperature AFTER all structural filters (correct llama.cpp parity).
Migration Guide
Before (v1.x):
import { topP } from './sampling';
const tokenId = topP(logits, 0.95, 0.8);After (v2.x):
import {
sampleWithStrategy,
TokenHistoryTracker,
initializePRNG,
} from './sampling';
// One-time setup per completion
const seed = samplingParams?.seed ?? Date.now();
initializePRNG(seed);
const tokenHistory = new TokenHistoryTracker(64);
// Per-token sampling
const tokenId = sampleWithStrategy(logits, tokenHistory, {
topP: 0.95,
temperature: 0.8,
});
tokenHistory.accept(tokenId);Benefits:
- ✅ Correct temperature timing (llama.cpp parity)
- ✅ Penalty support (repeat, frequency, presence)
- ✅ Filter composition (combine multiple filters)
- ✅ Deterministic sampling (seed-based PRNG)
- ✅ Grammar constraints (via native layer)
Advanced: Direct Filter Usage
If you need fine-grained control:
import {
applyTopK,
applyTopP,
applyTemperature,
sampleFromSet,
SamplerWorkspace,
} from './sampling';
const workspace = new SamplerWorkspace(256);
let candidates = applyTopK(logits, 256, workspace);
candidates = applyTopP(candidates, 0.95);
candidates = applyTemperature(candidates, 0.8); // ALWAYS LAST
const tokenId = sampleFromSet(candidates);Future Enhancements
- Logit Lens visualization (probability distribution inspection)
- Custom sampling strategies (domain-specific filters)
- Mirostat sampling (if requested by users)
- DRY sampling (if requested by users)
- Beam search support
- Constrained decoding (beyond GBNF)
- Token probability logging for debugging
- Sampling analytics and telemetry
References
- llama.cpp sampler chain discussion
- Nucleus Sampling - Holtzman et al. 2020
- Locally Typical Sampling - Meister et al. 2022
- Top-N-Sigma PR - llama.cpp#13264
- Xoroshiro128+ Algorithm
License
Apache 2.0
