npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@lloyal-labs/tsampler

v0.2.0

Published

Pure TypeScript sampling algorithms for LLM token generation with exact llama.cpp parity

Readme

TypeScript Sampling Library

Pure TypeScript implementations of sampling algorithms for LLM token generation with exact llama.cpp parity.

npm install @lloyal-labs/tsampler

Performance: ~3-5ms per token (vs ~0.1-0.3ms native) Trade-off: ~6-11% overhead for full flexibility, OTA updates, and transparency


Table of Contents


Quick Start

import {
  TokenHistoryTracker,
  sampleWithStrategy,
  Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';

// Create instances (once per completion)
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Get logits from native layer
const logitsBuffer = native.getLogits();
const logits = new Float32Array(logitsBuffer);

// Sample with combined strategy
const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: {
    temperature: 0.8,
    topK: 40,
    topP: 0.95,
    minP: 0.05,
    penaltyRepeat: 1.1,
  },
  prng, // Instance PRNG (no global state)
});

// Accept token for history tracking
tokenHistory.accept(tokenId);

Why TypeScript Sampling?

Advantages

  • Test-Time Alignment (TTA): Fuse app-state with sampling strategy at every token step
  • OTA Updates: Sampling logic can evolve without C++ changes
  • Custom Strategies: Easy to implement domain-specific sampling
  • Logit Steering: Apply domain constraints, validation rules, and business logic
  • Transparency: Full visibility into token probabilities and decisions
  • Debugging: Inspect logits, penalties, and sampling in real-time
  • Exact Control: Penalties match llama.cpp exactly (no black box)
  • Grammar Support: Integrates seamlessly with GBNF constraints

When to Use Native Sampling

Use native C++ sampling (samplingPath: 'native') when you need:

  • Maximum performance (performance-critical applications)
  • Simple greedy/top-k sampling without advanced features
  • Battery-constrained devices where every millisecond matters

API Reference

Core Functions

sampleWithStrategy()

Main entry point for sampling with combined strategy.

function sampleWithStrategy(logits: Float32Array, opts: SampleOptions): number;

interface SampleOptions {
  tokenHistory: TokenHistoryTracker;
  params?: SamplingParams;
  workspace?: SamplerWorkspace;
  mode?: SamplerMode; // 'fast' (default) or 'parity'
  prng?: Xoroshiro128Plus; // Instance PRNG (recommended)
}

Parameters:

  • logits: Logits array (zero-copy, NOT modified)
  • opts.tokenHistory: Token history tracker for penalties
  • opts.params: Sampling parameters (see Sampling Parameters)
  • opts.workspace: Preallocated buffers (optional, reuse across tokens for zero-alloc)
  • opts.mode: 'fast' (default, O(V log K)) or 'parity' (O(V), exact llama.cpp verification)
  • opts.prng: Instance PRNG for deterministic sampling (recommended, avoids multi-stepper collisions)

Returns: Sampled token ID

Sampler Chain Order (matches llama.cpp):

  1. Penalties (virtual, via accessor)
  2. Top-K
  3. Typical-P (if typicalP < 1.0)
  4. Top-P (if topP < 1.0)
  5. Min-P (if minP > 0.0)
  6. Top-N-Sigma (if topNSigma > 0.0)
  7. Temperature (applied AFTER all structural filters)
  8. Sample (with renormalization)

Fast-Paths:

  • Greedy: temperature < 1e-3 or topK === 1 → argmax selection (skips all filters)
  • With penalties: Applies penalties even in greedy mode (correctness critical)

greedy()

Argmax selection (deterministic).

function greedy(logits: Float32Array): number;

Returns token with highest logit value. O(V) single-pass.


getTopCandidates()

Get top-N candidates with probabilities (useful for visualization).

function getTopCandidates(
  logits: Float32Array,
  n: number = 10
): Array<{ tokenId: number; probability: number }>;

Performance: O(V log N) using heap-based selection (not full sort).


Sampling Parameters

interface SamplingParams {
  // Structural filters
  topK?: number; // Default: 40
  topP?: number; // Default: 0.95
  minP?: number; // Default: 0.05
  typicalP?: number; // Default: 1.0 (disabled)
  topNSigma?: number; // Default: -1.0 (disabled)

  // Temperature (applied after all filters)
  temperature?: number; // Default: 0.8

  // Penalties
  penaltyRepeat?: number; // Default: 1.1
  penaltyFreq?: number; // Default: 0.0
  penaltyPresent?: number; // Default: 0.0

  // Determinism
  seed?: number; // Default: Date.now()
}

Parameter Details

topK - Top-K sampling

  • Keep top K tokens by logit value
  • Common values: 40-80
  • topK = 0 (fast mode): Pre-truncate to 256 for performance
  • topK >= vocab_size: No truncation (preserve full vocabulary in token ID order)
  • Greedy: topK = 1

topP - Nucleus (top-p) sampling

  • Keep smallest set where cumulative probability ≥ p
  • Common values: 0.9-0.95
  • Adapts to distribution shape (dynamic K)
  • topP >= 1.0: Disabled
  • topP <= 0.0: Greedy by probability (argmax)

minP - Minimum probability threshold (relative)

  • Filter tokens where prob < max_prob * minP
  • Common values: 0.05-0.1
  • Adapts to distribution confidence
  • minP <= 0.0: Disabled
  • Always keeps argmax (highest prob token)

typicalP - Locally typical sampling

  • Keep tokens with "locally typical" information content
  • Filters tokens whose entropy diverges from expected
  • Common values: 0.95 (disabled by default: 1.0)
  • Requires larger candidate pool (512 vs 256) for stable entropy
  • typicalP >= 1.0: Disabled

topNSigma - Statistical filtering

  • Keep tokens within N standard deviations of max logit
  • Statistical approach to filtering unlikely tokens
  • Common values: 2.0 (disabled by default: ≤ 0)
  • topNSigma <= 0.0: Disabled (NOT greedy, per llama.cpp PR#13264)

temperature - Temperature scaling

  • Controls randomness of distribution
  • temp > 1.0: Flatter distribution (more random)
  • temp = 1.0: No change
  • temp < 1.0: Sharper distribution (more deterministic)
  • temp → 0: Approaches greedy
  • temp < 1e-3: Triggers greedy fast-path

penaltyRepeat - Repetition penalty

  • Multiplicative penalty for tokens in history
  • Formula:
    • logit *= penalty (if logit ≤ 0)
    • logit /= penalty (if logit > 0)
  • Common values: 1.0-1.5
  • Default: 1.1

penaltyFreq - Frequency penalty

  • Subtractive penalty scaled by token count
  • Formula: logit -= count * penalty
  • Common values: 0.0-2.0
  • Default: 0.0

penaltyPresent - Presence penalty

  • Flat penalty if token appears at least once
  • Formula: logit -= penalty (if token in history)
  • Common values: 0.0-1.0
  • Default: 0.0

seed - RNG seed for deterministic sampling


Penalties

TokenHistoryTracker

Manages token history with sliding window and frequency tracking.

class TokenHistoryTracker {
  constructor(penaltyLastN: number);

  // Accept token into history
  accept(token: number): void;

  // Get occurrence count
  getCount(token: number): number;

  // Check if token exists
  hasToken(token: number): boolean;

  // Reset history
  reset(): void;

  // Get window size
  size(): number;

  // Get unique tokens (for sparse iteration)
  getUniqueTokens(): number[];

  // Compute penalty adjustment for single token
  computeAdjustment(
    tokenId: number,
    baseLogit: number,
    params: {
      repeat?: number;
      frequency?: number;
      presence?: number;
    }
  ): number;

  // Check if penalties would modify logits
  static hasPenalties(params: {
    repeat?: number;
    frequency?: number;
    presence?: number;
  }): boolean;
}

Sliding Window: Maintains last N tokens with O(1) operations Frequency Map: Tracks token counts for efficient penalty application Sparse Iteration: Only processes tokens in history (H ≈ 10-50, not V = 65,536)

Penalty Formulas (llama.cpp exact)

Implementation Reference: penalties.ts lines 135-157 (computeAdjustment method)

Repetition Penalty (multiplicative, sign-dependent):

if (logit <= 0) {
  logit *= penalty_repeat; // Multiply for negative/zero
} else {
  logit /= penalty_repeat; // Divide for positive
}

Source: llama.cpp llama_sampler_penalties_apply (src/llama-sampling.cpp:1768-1772)

Rationale: Sign-dependent formula fixes academic paper bug where dividing negative logits would increase probability (incorrect). llama.cpp's corrected implementation applies multiplicative penalty to negative logits, divisive to positive logits. This preserves relative ordering and always decreases probability.

Frequency Penalty (additive, count-scaled):

logit -= count * penalty_freq;

Source: llama.cpp line 1774 (first term)

Linear penalty proportional to occurrence count. Common in OpenAI/HuggingFace APIs.

Presence Penalty (additive, binary):

logit -= (count > 0 ? 1 : 0) * penalty_present;

Source: llama.cpp line 1774 (second term)

Binary penalty - same reduction whether token appeared once or many times.

Application Order: repeat → frequency → presence (matches llama.cpp line 1764)

Formula Semantics Comparison:

| Library | Repetition Penalty | Frequency/Presence | | ------------------------- | ------------------------------- | ------------------ | | llama.cpp (this impl) | Multiplicative, sign-dependent | Additive | | OpenAI API | Not supported | Additive (same) | | HuggingFace | Purely divisive (no sign check) | Additive |

Key Difference: HuggingFace's repetition_penalty always divides, which incorrectly increases probability for negative logits. llama.cpp (and this implementation) fixes this with sign-dependent logic.

References:

Low-Level Penalty Functions

// Apply all penalties in correct order
function applyPenalties(
  logits: Float32Array,
  tokenHistory: TokenHistoryTracker,
  params: {
    repeat?: number;
    frequency?: number;
    presence?: number;
  }
): void;

// Individual penalty functions
function applyRepetitionPenalty(logits, tokenHistory, penalty): void;
function applyFrequencyPenalty(logits, tokenHistory, penalty): void;
function applyPresencePenalty(logits, tokenHistory, penalty): void;

Note: sampleWithStrategy() applies penalties virtually (zero-copy) via accessor function. Low-level functions modify logits in-place.


Deterministic Sampling

Same seed = identical token sequence (critical for reproducibility).

Instance PRNG (Recommended)

import {
  Xoroshiro128Plus,
  sampleWithStrategy,
  TokenHistoryTracker,
} from '@lloyal-labs/tsampler';

// Create PRNG instance (once per completion)
const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Use in sampling
const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: { temperature: 0.8 },
  prng, // Instance PRNG (no global state)
});

// Generate random numbers directly (advanced use)
const rand = prng.next(); // Returns [0, 1)

Algorithm: Xoroshiro128+ (fast, high-quality, passes BigCrush tests)

Benefits:

  • No global state: Multiple samplers can run concurrently without collision
  • Explicit lifecycle: PRNG lifetime tied to sampler instance
  • Reproducible: Same seed produces identical sequences

Seed Defaults:

  • Provided: Use exact seed value
  • Omitted: Date.now() (millisecond precision)

Deterministic Replay: Snapshots preserve samplingParams (including seed) for exact replay.

Legacy Global PRNG (Deprecated)

import { initializePRNG, random, resetPRNG } from '@lloyal-labs/tsampler';

// ⚠️ DEPRECATED: Use instance PRNG instead
initializePRNG(seed);
const rand = random();

Note: Global PRNG still works for backward compatibility but is deprecated. Use instance PRNG for new code.


Advanced Filters

For custom sampling strategies, use filters directly:

import {
  applyTopK,
  applyTypicalP,
  applyTopP,
  applyMinP,
  applyTopNSigma,
  applyTemperature,
  sampleFromSet,
  CandidateSet,
  SamplerWorkspace,
} from './sampling';

Filter Functions

All filters operate on CandidateSet (zero-copy wrapper around workspace buffers).

// CandidateSet: Zero-copy wrapper
class CandidateSet {
  indices: Uint32Array; // Token IDs
  logits: Float32Array; // Logit values
  probs: Float32Array; // Probabilities (softmax)
  count: number; // Active candidates [0, count)
}

// Top-K filter
function applyTopK(
  logits: Float32Array,
  k: number,
  ws: SamplerWorkspace,
  penaltyAccessor?: (tokenId: number, baseLogit: number) => number
): CandidateSet;

// Typical-P filter
function applyTypicalP(
  set: CandidateSet,
  p: number,
  minKeep: number = 1
): CandidateSet;

// Top-P (nucleus) filter
function applyTopP(
  set: CandidateSet,
  p: number,
  workspace?: SamplerWorkspace
): CandidateSet;

// Min-P filter
function applyMinP(set: CandidateSet, threshold: number): CandidateSet;

// Top-N-Sigma filter
function applyTopNSigma(set: CandidateSet, n: number): CandidateSet;

// Temperature scaling
function applyTemperature(set: CandidateSet, temp: number): CandidateSet;

// Sample from final set
function sampleFromSet(set: CandidateSet): number;

Custom Filter Chain Example

const workspace = new SamplerWorkspace(256);

// Build custom chain
let candidates = applyTopK(logits, 40, workspace);
candidates = applyTopP(candidates, 0.95);
candidates = applyMinP(candidates, 0.05);
candidates = applyTemperature(candidates, 0.8); // ALWAYS LAST

// Sample from final distribution
const tokenId = sampleFromSet(candidates);

CRITICAL: Temperature MUST be applied AFTER all structural filters.


Usage Patterns

Deterministic Generation

Fixed seed produces identical token sequences (perfect reproducibility).

import {
  sampleWithStrategy,
  TokenHistoryTracker,
  Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';

// Fixed seed for deterministic generation
const seed = 42;
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Generate tokens - same seed = same sequence
for (let i = 0; i < maxTokens; i++) {
  const logits = native.getLogits();

  const tokenId = sampleWithStrategy(new Float32Array(logits), {
    tokenHistory,
    params: {
      temperature: 0.8,
      topK: 40,
      topP: 0.95,
    },
    prng,
  });

  tokenHistory.accept(tokenId);
  native.decode([tokenId]);
}

Use cases:

  • Testing and debugging
  • Reproducible benchmarks
  • Deterministic replay of conversations
  • A/B testing with controlled randomness

Creative Generation

High temperature + nucleus sampling for diverse, creative outputs.

const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: {
    temperature: 1.2, // Higher randomness
    topP: 0.95, // Nucleus sampling (adaptive)
    topK: 0, // No hard limit (let topP decide)
    minP: 0.02, // Filter very unlikely tokens
    penaltyRepeat: 1.15, // Discourage repetition
  },
  prng,
});

Parameters:

  • High temperature (1.0-1.5): More randomness
  • topP (0.9-0.95): Adapts to distribution shape
  • Low topK or 0: Let nucleus sampling control diversity
  • Moderate penaltyRepeat: Avoid repetitive patterns

Use cases:

  • Creative writing
  • Brainstorming
  • Dialogue generation
  • Story continuation

Factual Generation

Low temperature + greedy for deterministic, factual outputs.

const tokenId = sampleWithStrategy(logits, {
  tokenHistory,
  params: {
    temperature: 0.1, // Near-deterministic
    topK: 1, // Greedy (triggers fast-path)
    penaltyRepeat: 1.0, // No repetition penalty
    penaltyFreq: 0.0,
    penaltyPresent: 0.0,
  },
  prng,
});

Or explicitly use greedy:

import { greedy } from '@lloyal-labs/tsampler';

const tokenId = greedy(new Float32Array(native.getLogits()));

Parameters:

  • Very low temperature (< 0.2): Deterministic
  • topK = 1: Greedy fast-path
  • No penalties: Pure argmax selection

Use cases:

  • Question answering
  • Summarization
  • Code generation
  • Translation

Grammar-Constrained Generation

Integrate with GBNF grammar constraints (applied before sampling).

import {
  sampleWithStrategy,
  TokenHistoryTracker,
  Xoroshiro128Plus,
} from '@lloyal-labs/tsampler';

// Initialize grammar
native.initGrammar('root ::= "hello" | "goodbye"');
native.resetGrammar();

const seed = samplingParams?.seed ?? Date.now();
const prng = new Xoroshiro128Plus(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Sampling loop
for (let i = 0; i < maxTokens; i++) {
  // Get logits
  const logitsBuffer = native.getLogits();
  const logits = new Float32Array(logitsBuffer);

  // Apply grammar constraints (modifies logits in-place)
  native.applyGrammar(logitsBuffer);

  // Sample from grammar-constrained distribution
  const tokenId = sampleWithStrategy(logits, {
    tokenHistory,
    params: {
      temperature: 0.8,
      topK: 40,
      topP: 0.95,
    },
    prng,
  });

  // Accept token for grammar state + history
  native.acceptToken(tokenId);
  tokenHistory.accept(tokenId);

  // Decode
  native.decode([tokenId]);
}

// Cleanup
native.freeGrammar();

Integration points:

  1. applyGrammar(): Masks invalid tokens (sets logits to -∞)
  2. sampleWithStrategy(): Samples from grammar-valid tokens only
  3. acceptToken(): Updates grammar state for next token

Use cases:

  • JSON generation
  • Code generation (language syntax)
  • Structured data extraction
  • Constrained creative writing

Custom Sampling Strategies

Build domain-specific sampling with fine-grained control.

import {
  applyTopK,
  applyTopP,
  applyMinP,
  applyTemperature,
  sampleFromSet,
  SamplerWorkspace,
  TokenHistoryTracker,
} from './sampling';

// Custom strategy: Aggressive filtering for concise responses
function sampleConcise(
  logits: Float32Array,
  tokenHistory: TokenHistoryTracker,
  temperature: number
): number {
  const workspace = new SamplerWorkspace(128); // Smaller workspace

  // Build custom chain
  let candidates = applyTopK(logits, 20, workspace); // Fewer candidates
  candidates = applyMinP(candidates, 0.1); // Aggressive min-p
  candidates = applyTopP(candidates, 0.8); // Tighter nucleus
  candidates = applyTemperature(candidates, temperature);

  return sampleFromSet(candidates);
}

// Usage
const tokenId = sampleConcise(
  new Float32Array(native.getLogits()),
  tokenHistory,
  0.7
);

Advanced patterns:

  • Domain-specific filter combinations
  • Custom probability transformations
  • Multi-stage filtering
  • Adaptive sampling based on context

Test-Time Alignment (TTA)

Test-Time Alignment is the fusion of app-state with sampling strategy to steer model outputs at every token step. Unlike traditional fixed-parameter sampling, TTA allows you to:

  1. Modify logits based on application state (constraints, domain knowledge)
  2. Adapt sampling parameters based on distribution health or uncertainty
  3. Switch strategies mid-generation without reinitialization

Architecture: App-State × Sampler Fusion

import {
  sampleWithStrategy,
  computeModelEntropy,
  computeModelSurprisal,
  requiredKcap,
  RollingPerplexity,
} from '@lloyal-labs/tsampler';

const ppl = new RollingPerplexity();

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // 1. App-State → Logit Steering
  applyDomainConstraints(logits, appState);

  // 2. Distribution Analysis → Strategy Selection
  const entropy = computeModelEntropy(logits);
  const params = selectStrategy(entropy, appState);

  // 3. Dynamic Capacity Management
  workspace.ensureCapacity(
    requiredKcap(params.topK, params.typicalP, logits.length)
  );

  // 4. Sample with adapted strategy
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params,
    workspace,
    prng,
  });

  // 5. Update app state + quality tracking
  updateAppState(token, appState);

  const surprisal = computeModelSurprisal(logits, token);
  ppl.addSurprisal(surprisal);

  // Optional: KV eviction gate
  if (ppl.ppl() > 50) {
    console.warn('High perplexity - consider cache pruning or retrieval');
  }
}

Dynamic Workspace Capacity

The workspace automatically grows to accommodate changing parameters:

import { SamplerWorkspace, requiredKcap } from '@lloyal-labs/tsampler';

const workspace = new SamplerWorkspace(256); // Initial capacity

// Token 1: Focused sampling (topK=40)
const token1 = sampleWithStrategy(logits1, {
  tokenHistory,
  params: { topK: 40, temperature: 0.8 },
  workspace, // kcap=256 (sufficient)
  prng,
});

// Token 2: Uncertainty spike → widen search (topK=320)
const token2 = sampleWithStrategy(logits2, {
  tokenHistory,
  params: { topK: 320, temperature: 1.2 },
  workspace, // Auto-grows to kcap=512 (power-of-two)
  prng,
});

// Token 3: Enable typical-P (needs larger pool)
const token3 = sampleWithStrategy(logits3, {
  tokenHistory,
  params: { topK: 40, typicalP: 0.9, temperature: 0.6 },
  workspace, // Stays at kcap=512 (typical-P requires ≥512)
  prng,
});

Growth Strategy:

  • Power-of-two sizing: Grows 40→64→128→256→512 (~5 allocations max per session)
  • Monotonic growth: Never downsizes (avoids churn in bursty workloads)
  • Zero reallocations: After initial growth, zero allocations per token
  • Version tracking: Guards against stale CandidateSet references after growth

Capacity Calculation Helper

import {
  requiredKcap,
  DEFAULT_KCAP,
  TYPICAL_P_KCAP,
} from '@lloyal-labs/tsampler';

// Calculate required capacity for given params
const needed = requiredKcap(
  params.topK, // undefined or 0 = use default
  params.typicalP, // < 1.0 triggers TYPICAL_P_KCAP (512)
  vocabSize // Upper bound (no point allocating beyond V)
);

workspace.ensureCapacity(needed);

Strategy:

  • typicalP < 1.0 → Need ≥512 for stable entropy calculation
  • Else → max(topK, DEFAULT_KCAP) for nucleus/min-p sampling
  • Clamps to vocabSize (no over-allocation)

Example 1: Adaptive Sampling Based on Entropy

import {
  sampleWithStrategy,
  computeModelEntropy,
  TokenHistoryTracker,
  Xoroshiro128Plus,
  SamplerWorkspace,
} from '@lloyal-labs/tsampler';

function selectStrategy(entropy: number) {
  if (entropy < 2.0) {
    // Collapsed distribution → widen search
    return {
      topK: 256,
      temperature: 1.5,
      topP: 0.98,
    };
  } else if (entropy > 5.0) {
    // Too flat → focus sampling
    return {
      topK: 20,
      temperature: 0.5,
      topP: 0.9,
    };
  } else {
    // Healthy distribution → standard sampling
    return {
      topK: 40,
      temperature: 0.8,
      topP: 0.95,
    };
  }
}

// Per-token adaptive sampling
const prng = new Xoroshiro128Plus(42);
const tokenHistory = new TokenHistoryTracker(64);
const workspace = new SamplerWorkspace(256);

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // Compute model-level entropy (before filters)
  const entropy = computeModelEntropy(logits);
  const params = selectStrategy(entropy);

  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params,
    workspace, // Grows as needed
    prng,
  });

  tokenHistory.accept(token);
  native.decode([token]);
}

Example 2: Constraint-Based Logit Steering

Domain-specific constraints via logit manipulation:

// Hard constraint: JPY doesn't use decimal subdivisions
if (parsedSoFar.currency === 'JPY' && currentField === 'amount') {
  logits[DECIMAL_TOKEN_ID] = -Infinity; // Veto
  DIGIT_TOKENS.forEach((id) => (logits[id] += 2.0)); // Boost integers
}

const token = sampleWithStrategy(logits, {
  tokenHistory,
  params: { temperature: 0.8, topK: 40 },
  workspace,
  prng,
});

Example 3: Domain Validation

Logit steering based on application state:

const currentMetric = detectCurrentMetric(accumulated);

if (currentMetric === 'glucose') {
  const { value, range } = reportData.glucose;
  const [min, max] = range;

  if (value > max) {
    // Elevated glucose → bias toward correct terminology
    ELEVATED_TOKENS.forEach((id) => (logits[id] += 10.0));
    NORMAL_TOKENS.forEach((id) => (logits[id] = -Infinity)); // Veto incorrect
  }
}

const token = sampleWithStrategy(logits, {
  tokenHistory,
  params: { temperature: 0.7, topK: 40 },
  workspace,
  prng,
});

Example 4: Exploratory Bursts

// Normal generation
let baseParams = { topK: 40, temperature: 0.8 };

// Detect uncertainty (e.g., all top-5 probs < 0.15)
if (isUncertain(logits)) {
  // Temporary exploration burst
  const exploratoryParams = {
    topK: 256, // Widen candidate pool
    temperature: 1.3, // Increase randomness
    typicalP: 0.9, // Filter atypical tokens
  };

  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: exploratoryParams,
    workspace, // Auto-grows to 512 for typical-P
    prng,
  });

  // Next token: back to focused sampling
  // workspace stays at 512 (no downsize churn)
} else {
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: baseParams,
    workspace,
    prng,
  });
}

Example 5: Perplexity Monitoring & Quality Tracking

import {
  sampleWithStrategy,
  computeModelSurprisal,
  computeSamplingSurprisal,
  RollingPerplexity,
  TokenHistoryTracker,
  Xoroshiro128Plus,
  SamplerWorkspace,
} from '@lloyal-labs/tsampler';

const ppl = new RollingPerplexity();
const prng = new Xoroshiro128Plus(42);
const tokenHistory = new TokenHistoryTracker(64);
const workspace = new SamplerWorkspace(256);

// Per-token quality metrics
const qualityLog: Array<{
  token: number;
  modelSurprisal: number;
  samplingSurprisal: number;
  runningPpl: number;
}> = [];

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // Sample token
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: { topK: 40, temperature: 0.8 },
    workspace,
    prng,
  });

  // Track model-level surprisal (before filters)
  const modelSurprisal = computeModelSurprisal(logits, token);
  ppl.addSurprisal(modelSurprisal);

  // Track sampling-level surprisal (post-filter)
  // Note: For full tracking, you'd capture candidates from getTopCandidates
  // This example shows the API usage pattern

  qualityLog.push({
    token,
    modelSurprisal,
    samplingSurprisal: modelSurprisal, // Approximation when candidates not available
    runningPpl: ppl.ppl(),
  });

  // Quality gates
  if (modelSurprisal > 8.0) {
    console.warn(
      `High uncertainty token: surprisal=${modelSurprisal.toFixed(2)}`
    );
  }

  if (ppl.ppl() > 50) {
    console.warn(
      `Sequence perplexity high: ${ppl.ppl().toFixed(2)} - consider retrieval`
    );
    // Trigger: cache pruning, RAG retrieval, or context compression
  }

  tokenHistory.accept(token);
  native.decode([token]);
}

// Sequence-level metrics
console.log(`Final perplexity: ${ppl.ppl().toFixed(2)}`);
console.log(
  `Avg surprisal: ${(qualityLog.reduce((sum, m) => sum + m.modelSurprisal, 0) / qualityLog.length).toFixed(2)} nats`
);

// Identify high-uncertainty spans
const uncertainSpans = qualityLog
  .filter((m) => m.modelSurprisal > 6.0)
  .map((m, i) => ({ position: i, surprisal: m.modelSurprisal }));

console.log('High-uncertainty tokens:', uncertainSpans);

Use cases:

  • KV cache eviction gates: High perplexity → prune old context or fetch from retrieval
  • Quality monitoring: Track surprisal/perplexity for confidence estimates
  • Debugging: Identify uncertain spans for manual review
  • A/B testing: Compare perplexity across different prompts or models
  • Dashboard signals: Real-time uncertainty visualization

Performance Characteristics

Growth amortization:

  • First token with K=40: Allocates 64 capacity (~0.5KB)
  • Later token with K=320: Grows to 512 capacity (~4KB, one-time)
  • Subsequent tokens: Zero allocations (reuses buffers)

Typical session progression:

  • 40 → 64 → 128 → 256 → 512 (at most ~5 reallocations)
  • Total overhead: ~10-20ms across entire session
  • Per-token overhead after growth: <0.01ms (negligible)

Memory footprint:

  • K buffers: 512 × 12 bytes = 6KB (max)
  • Working logits: 65536 × 4 bytes = 262KB (lazy allocated, optional)
  • Total: <270KB worst-case

When to Use TTA

Use TTA when you need:

  • ✅ Domain-specific constraints (JSON schemas, business rules)
  • ✅ Runtime validation (medical thresholds, financial rules)
  • ✅ Adaptive sampling (entropy-based, uncertainty-triggered)
  • ✅ Multi-modal generation (switch strategies by content type)
  • ✅ Exploratory search (temporary parameter bursts)

Don't use TTA for:

  • ❌ Simple fixed-parameter generation (use native sampling)
  • ❌ Maximum performance critical paths (6-10% overhead)
  • ❌ Battery-constrained devices (every millisecond matters)

Metrics & Telemetry

Distribution metrics are orthogonal to sampling - they observe and measure without affecting token selection. Metrics enable runtime analytics and decision-making without interfering with the sampling process.

Key Principle: Observation Without Interference

Metrics compute surprisal/entropy from logits at two measurement levels:

  • Model metrics: Raw logits (before filters) - model's inherent belief
  • Sampling metrics: Post-filter logits (after top-k/p/temp) - actual sampled distribution

Use Cases:

  • KV cache eviction gates: High perplexity triggers retrieval or pruning
  • Quality monitoring: Track confidence estimates for generated sequences
  • Dashboard signals: Real-time uncertainty visualization
  • Analytics: Post-hoc analysis of generation quality

Example: Per-Step Metrics Collection

import {
  sampleWithStrategy,
  computeModelEntropy,
  computeModelSurprisal,
  RollingPerplexity,
  getTopCandidates,
} from '@lloyal-labs/tsampler';

const ppl = new RollingPerplexity();

while (generating) {
  const logits = new Float32Array(native.getLogits());

  // 1. Compute metrics BEFORE sampling (no side effects on sampling)
  const entropy = computeModelEntropy(logits);
  const topCandidates = getTopCandidates(logits, 5); // For UI visualization

  // 2. Sample token (metrics don't affect this)
  const token = sampleWithStrategy(logits, {
    tokenHistory,
    params: { temperature: 0.8, topK: 40 },
    workspace,
    prng,
  });

  // 3. Track quality metrics AFTER sampling
  const surprisal = computeModelSurprisal(logits, token);
  ppl.addSurprisal(surprisal);

  // 4. Analytics/eviction gates (metrics inform actions)
  if (ppl.ppl() > 50) {
    console.warn(
      'High perplexity detected - consider KV eviction or retrieval'
    );
  }

  if (entropy > 5.0) {
    console.log('High uncertainty - model is exploring multiple paths');
  }

  // Accept token and continue
  tokenHistory.accept(token);
}

Available Metrics:

  • computeModelSurprisal(logits, tokenId) - Surprisal of chosen token from model distribution
  • computeSamplingSurprisal(candidateLogits, candidateIds, tokenId) - Surprisal from filtered candidates
  • computeModelEntropy(logits) - Entropy of full model distribution
  • computeSamplingEntropy(candidateLogits) - Entropy of filtered candidates
  • RollingPerplexity - Track sequence-level quality via perplexity

All metrics use numerically stable log-sum-exp and support both nats (default) and bits output.

See also:


Invariants and Guarantees

Sampler Chain Order (FIXED)

INVARIANT: Chain order NEVER changes:

  1. Penalties (virtual, via accessor)
  2. Top-K
  3. Typical-P (if enabled)
  4. Top-P (if enabled)
  5. Min-P (if enabled)
  6. Top-N-Sigma (if enabled)
  7. Temperature (ALWAYS LAST structural operation)
  8. Sample

Rationale: Matches llama.cpp exactly (discussion #7590)


Temperature Timing (CRITICAL)

INVARIANT: Temperature applied AFTER all structural filters (top-k, nucleus, etc.)

Why this matters:

  • Structural filters operate on raw probabilities
  • Temperature BEFORE filters reduces their effectiveness
  • Example: topK=40 with early temperature might keep wrong 40 tokens

Guarantee: sampleWithStrategy() enforces correct ordering.


Zero-Copy Logits (PERFORMANCE)

INVARIANT: Input logits array NEVER modified by sampleWithStrategy()

Implementation:

  • Penalties applied virtually via accessor function
  • Filters operate on workspace buffers
  • Original logits remain unchanged

Benefit: Allows logits reuse, inspection, debugging without defensive copies.


Greedy Fast-Path (OPTIMIZATION)

INVARIANT: temperature < 1e-3 or topK === 1 triggers greedy selection

Skips: All filters (top-k, nucleus, temperature, etc.) Exception: Penalties still applied (correctness critical)

Implementation:

if (temp < 1e-3 || topK === 1) {
  if (hasPenalties) {
    // Greedy with penalties
    const set = applyTopK(logits, 1, workspace, penaltyAccessor);
    return set.indices[0];
  }
  // Plain greedy
  return greedy(logits);
}

Candidate Count Monotonicity

INVARIANT: Each filter narrows candidate set (count decreases or stays same)

Guarantee: count_after_filter ≤ count_before_filter

Exception: None. All filters strictly non-increasing.


Renormalization (CORRECTNESS)

INVARIANT: sampleFromSet() renormalizes probabilities before sampling

Why: Filtered sets may not sum to 1.0 (filters remove mass)

Formula:

const totalMass = sum(probs[(0).count]);
const rand = random() * totalMass;

Guarantee: Sampling is unbiased with respect to filtered distribution.


Top-K Special Cases

INVARIANT: topK >= vocab_size means NO truncation (preserves token ID order)

Rationale: Matches llama.cpp behavior (user explicitly requested all tokens)

Implementation:

if (k >= V) {
  return candidateSetFromFullLogits(logits, ws, penaltyAccessor);
}

Use case: Verification mode, custom post-processing.


Top-P Special Cases

INVARIANT 1: topP >= 1.0 disables filter (no-op)

INVARIANT 2: topP <= 0.0 collapses to greedy by probability

Critical: Greedy by probability (NOT by index)

  • Must find argmax explicitly
  • Typical-P may have reordered candidates
  • Returning indices[0] would be WRONG

Typical-P Entropy Stability

INVARIANT: Typical-P requires larger candidate pool for stable entropy

Implementation:

  • Fast mode uses TYPICAL_P_KCAP = 512 (not DEFAULT_KCAP = 256)
  • Ensures entropy calculation over sufficient candidates

Rationale: Meister et al. 2022 - entropy over 512 tokens provides better estimate.


Min-P Relative Threshold

INVARIANT: minProb = maxProb * threshold

Adapts to distribution:

  • High max prob → aggressive filtering
  • Low max prob → less filtering

Always keeps argmax: Token with max prob always passes filter.


Top-N-Sigma Semantics

INVARIANT: topNSigma <= 0 is NO-OP (NOT greedy)

Rationale: llama.cpp PR#13264 - negative values disable filter.

Edge case: All logits identical → std = 0 → collapse to single max token.


Penalty Application Order

INVARIANT: Penalties applied in order: repeat → frequency → presence

Matches: llama.cpp line 1774

Virtual application: sampleWithStrategy() applies penalties via accessor (zero-copy).


Workspace Reuse (ZERO-ALLOC)

INVARIANT: Preallocated workspace buffers reused per-token

Performance: Reduces allocations from ~100/token to ~0

Buffers:

  • idxK, valK, probK: For heap-based Top-K (size K)
  • tmpIdx, tmpLogits, tmpProbs: For re-sorting (size K)
  • workingLogits: For penalty application (size V, optional)

Guarantee: Zero allocations per token after initialization.


Performance Characteristics

TypeScript Sampling

| Operation | Time (ms) | Notes | | ------------- | --------- | ---------------------------------------------- | | Logits access | 0.05-0.07 | Zero-copy ArrayBuffer (measured in production) | | Top-K (k=40) | 1-2 | Heap-based selection, O(V log K) | | Nucleus | 0.5-1 | O(K log K) sorting | | Softmax | 0.2-0.5 | V8 JIT optimized | | Total | 3-5 | Per token (average across engines) |

TypeScript Performance by Engine

| Engine | Time per token | Notes | | ----------- | -------------- | -------------------------- | | Node/V8 | ~2-4ms | Server-side, JIT optimized | | Hermes | ~4-8ms | Android mid-tier devices | | JSC | ~3-6ms | iOS Safari/WebKit |

Native Sampling (C++)

| Operation | Time (ms) | Notes | | ------------- | ----------- | -------------- | | Logits access | 0.01 | Pointer access | | Top-K (k=40) | 0.1-0.3 | SIMD optimized | | Total | 0.1-0.3 | Per token |

Context

  • Decode time: 40-500ms per token (model-dependent: small models ~40ms, large models ~500ms)
  • TS overhead: 3-5ms average (engine-specific: 2-8ms range)
  • Overhead %: ~6-11% of decode time for typical models
  • User perception: Imperceptible (<10ms latency addition)

Optimization Strategies

Fast Mode (default):

  • Pre-truncate to KCAP (256 or 512 for typical-p)
  • O(V log K) complexity
  • ~3-5ms per token

Parity Mode (verification only):

  • Start from full V candidates
  • O(V) complexity
  • ~10-15ms per token
  • Use only for llama.cpp verification tests

Workspace Reuse:

  • Preallocate buffers once
  • Reuse across tokens
  • Zero allocations per token

Sparse Penalties:

  • Iterate over history (H ≈ 10-50)
  • Not full vocabulary (V = 65,536)
  • O(H) instead of O(V)

Typical-P K-Cap Trade-Off

Problem: Typical-P requires larger candidate pool for stable entropy.

Solution:

  • DEFAULT_KCAP = 256 (for most filters)
  • TYPICAL_P_KCAP = 512 (when typical-p enabled)

Performance impact:

  • 512 vs 256: ~1-2ms additional (negligible vs decode time)
  • Benefit: Better entropy estimate → more effective filtering

Testing

Test Coverage

Comprehensive test coverage for all components:

npm test

Test suites:

  • golden.test.ts - llama.cpp parity tests
  • sampling.test.ts - Sampling strategies
  • penalties.test.ts - Penalty formulas and token history
  • prng.test.ts - Deterministic PRNG behavior
  • correctness.test.ts - P0 bug fixes (greedy with penalties, top-p p<=0)
  • metrics.test.ts - Entropy, surprisal, perplexity
  • workspace.test.ts - Dynamic capacity management
  • integration.test.ts - End-to-end workflows

Total: 181 tests across 8 test suites (vitest, ~400ms)

Golden Tests

Status: ✅ 20/20 passing

Coverage: All llama.cpp test cases from test-sampling.cpp

Validates:

  • Exact penalty formulas (repeat, frequency, presence)
  • Sampler chain order
  • Temperature timing
  • Token ID order preservation (when k >= V)

Correctness Tests

Status: ✅ 12/12 passing

Critical bugs fixed:

  1. Top-P p<=0 greedy: Must find argmax by probability (not return index 0)
  2. Greedy with penalties: Must apply penalties even in fast-path
  3. Heap sort bug: Fixed reverse loop that flipped descending to ascending

PRNG Tests

Validates:

  • Same seed → same sequence
  • resetPRNG() restores initial state
  • Xoroshiro128+ algorithm correctness

Native Parity

Coverage: Paritial

Key findings:

  • Same sampler chain order
  • Same default parameters
  • Same greedy fast-path logic
  • Minor differences: top-n-sigma (TS only), seed precision (ms vs s)

Interchangeability: Users can switch between 'typescript' and 'native' paths and get equivalent results (within numerical precision).


Directory Structure

sampling/
├── index.ts              # Public API exports
│
├── Core Sampling         # Individual sampling strategies
│   ├── greedy.ts         # Argmax selection (deterministic)
│   ├── softmax.ts        # Logits → probabilities
│   └── filters.ts        # All filters (topK, topP, minP, typicalP, topNSigma)
│
├── strategies.ts         # Combined strategy + penalty integration
│
├── Penalties             # Repetition control (exact llama.cpp formulas)
│   └── penalties.ts      # Repeat, frequency, presence penalties
│
├── Determinism
│   └── prng.ts           # Xoroshiro128+ PRNG for reproducibility
│
├── Native Fast Path
│   └── sample-native.ts  # Optional C++ passthrough
│
├── Utilities
│   └── utils.ts          # Shared helpers (temperature scaling, heap sort)
│
└── __tests__/            # Comprehensive test coverage
    ├── penalties.test.ts
    ├── sampling.test.ts
    ├── prng.test.ts
    └── correctness.test.ts

Implementation Coverage

Implemented (grounded in llama.cpp):

  • ✅ Greedy (argmax)
  • ✅ Top-K sampling
  • ✅ Top-P (nucleus) sampling
  • ✅ Min-P sampling
  • ✅ Typical-P (locally typical) sampling
  • ✅ Top-N-Sigma filtering
  • ✅ Temperature scaling
  • ✅ Repetition penalties (repeat, frequency, presence)
  • ✅ Grammar constraints (GBNF)
  • ✅ Deterministic PRNG (seed-based)

Not Yet Implemented (advanced/experimental):

  • ❌ Mirostat sampling (stateful, complex)
  • ❌ DRY (Don't Repeat Yourself) sampling
  • ❌ XTC sampler

Coverage: 10/13 sampling methods (77%) - all essential methods implemented


Migration from v1.x

Breaking Change: Standalone samplers (topK, topP, minP, etc.) removed.

Why?

Old standalone samplers had critical bug: temperature applied BEFORE structural filters.

New functional filters apply temperature AFTER all structural filters (correct llama.cpp parity).

Migration Guide

Before (v1.x):

import { topP } from './sampling';

const tokenId = topP(logits, 0.95, 0.8);

After (v2.x):

import {
  sampleWithStrategy,
  TokenHistoryTracker,
  initializePRNG,
} from './sampling';

// One-time setup per completion
const seed = samplingParams?.seed ?? Date.now();
initializePRNG(seed);
const tokenHistory = new TokenHistoryTracker(64);

// Per-token sampling
const tokenId = sampleWithStrategy(logits, tokenHistory, {
  topP: 0.95,
  temperature: 0.8,
});

tokenHistory.accept(tokenId);

Benefits:

  • ✅ Correct temperature timing (llama.cpp parity)
  • ✅ Penalty support (repeat, frequency, presence)
  • ✅ Filter composition (combine multiple filters)
  • ✅ Deterministic sampling (seed-based PRNG)
  • ✅ Grammar constraints (via native layer)

Advanced: Direct Filter Usage

If you need fine-grained control:

import {
  applyTopK,
  applyTopP,
  applyTemperature,
  sampleFromSet,
  SamplerWorkspace,
} from './sampling';

const workspace = new SamplerWorkspace(256);

let candidates = applyTopK(logits, 256, workspace);
candidates = applyTopP(candidates, 0.95);
candidates = applyTemperature(candidates, 0.8); // ALWAYS LAST
const tokenId = sampleFromSet(candidates);

Future Enhancements

  • Logit Lens visualization (probability distribution inspection)
  • Custom sampling strategies (domain-specific filters)
  • Mirostat sampling (if requested by users)
  • DRY sampling (if requested by users)
  • Beam search support
  • Constrained decoding (beyond GBNF)
  • Token probability logging for debugging
  • Sampling analytics and telemetry

References


License

Apache 2.0