@llm-dev-ops/latency-lens

v0.1.1

Published

6 months ago

High-precision LLM latency profiler - WebAssembly bindings for measuring token throughput, TTFT, and cost metrics

Downloads

0High
0Medium
0Low

gba_admin

llm latency profiler metrics performance openai anthropic wasm webassembly ttft token-throughput devops cli

@llm-dev-ops/latency-lens

High-precision LLM latency profiler powered by WebAssembly. Measure token throughput, Time to First Token (TTFT), inter-token latency, and cost metrics for OpenAI, Anthropic, and other LLM providers.

Features

🚀 High-precision timing - Sub-millisecond accuracy using WASM
📊 Comprehensive metrics - TTFT, inter-token latency, throughput, percentiles (p50, p90, p95, p99, p99.9)
💰 Cost tracking - Monitor spending across requests
🔧 Multi-provider - OpenAI, Anthropic, Google, and more
📈 Statistical analysis - HDR histograms for accurate percentile calculations
🔌 Easy integration - Simple API for Node.js and browsers
🛠️ CLI included - Test and explore metrics from the command line

Installation

As a library (recommended)

npm install @llm-dev-ops/latency-lens

As a global CLI tool

npm install -g @llm-dev-ops/latency-lens

CLI Usage

After installing globally, you can use the CLI:

# Show help
latency-lens help

# Show version
latency-lens version

# Run a test to see metrics in action
latency-lens test

CLI Commands

latency-lens version - Display version information
latency-lens test - Run a simulated metrics collection test
latency-lens help - Show usage information

Programmatic Usage

Basic Example

import { LatencyCollector } from '@llm-dev-ops/latency-lens';

// Create collector with 60-second window
const collector = new LatencyCollector(60000);

// Start tracking a request
const requestId = collector.start_request('openai', 'gpt-4-turbo');

// Record first token received
collector.record_first_token(requestId);

// Record each subsequent token
collector.record_token(requestId);
collector.record_token(requestId);
// ... more tokens

// Complete the request
collector.complete_request(
  requestId,
  150,      // input tokens
  800,      // output tokens
  null,     // thinking tokens (optional)
  0.05      // cost in USD
);

// Get aggregated metrics
const metrics = collector.get_metrics();
console.log('TTFT P95:', metrics.ttft_distribution.p95_ms, 'ms');
console.log('Throughput:', metrics.throughput.tokens_per_second, 'tokens/sec');

Advanced Example with Multiple Providers

import { LatencyCollector } from '@llm-dev-ops/latency-lens';

const collector = new LatencyCollector(30000);

async function trackOpenAIRequest(prompt) {
  const reqId = collector.start_request('openai', 'gpt-4-turbo');

  const stream = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [{ role: 'user', content: prompt }],
    stream: true
  });

  let firstToken = true;
  for await (const chunk of stream) {
    if (firstToken) {
      collector.record_first_token(reqId);
      firstToken = false;
    } else {
      collector.record_token(reqId);
    }
  }

  collector.complete_request(reqId, 100, 500, null, 0.025);
}

// Track multiple requests
await Promise.all([
  trackOpenAIRequest('What is AI?'),
  trackOpenAIRequest('Explain quantum computing'),
  trackOpenAIRequest('Write a poem')
]);

// Analyze performance
const metrics = collector.get_metrics();
console.log('Performance Report:');
console.log('===================');
console.log(`Total requests: ${metrics.total_requests}`);
console.log(`Success rate: ${(metrics.success_rate * 100).toFixed(2)}%`);
console.log(`TTFT P50: ${metrics.ttft_distribution.p50_ms.toFixed(2)}ms`);
console.log(`TTFT P95: ${metrics.ttft_distribution.p95_ms.toFixed(2)}ms`);
console.log(`Total cost: $${metrics.total_cost_usd.toFixed(4)}`);

API Reference

LatencyCollector

Main class for collecting metrics.

Constructor

new LatencyCollector(window_ms: number)

window_ms - Time window in milliseconds for metrics aggregation

Methods

start_request(provider: string, model: string): string

Start tracking a new request. Returns a unique request ID.

record_first_token(request_id: string): void

Record when the first token is received (measures TTFT).

record_token(request_id: string): void

Record each subsequent token received.

complete_request(request_id: string, input_tokens: number, output_tokens: number, thinking_tokens: number | null, cost_usd: number): void

Mark the request as complete and record final metrics.

record_failure(request_id: string, error: string): void

Mark the request as failed.

get_metrics(): Metrics

Get aggregated metrics for all requests.

reset(): void

Clear all collected metrics.

Metrics Object

{
  session_id: string,
  start_time: string,
  end_time: string,
  total_requests: number,
  successful_requests: number,
  failed_requests: number,
  success_rate: number,
  ttft_distribution: {
    min_ms: number,
    max_ms: number,
    mean_ms: number,
    p50_ms: number,
    p90_ms: number,
    p95_ms: number,
    p99_ms: number,
    p99_9_ms: number,
    stddev_ms: number
  },
  inter_token_distribution: { /* same as ttft_distribution */ },
  total_latency_distribution: { /* same as ttft_distribution */ },
  throughput: {
    tokens_per_second: number,
    requests_per_second: number
  },
  total_input_tokens: number,
  total_output_tokens: number,
  total_thinking_tokens: number | null,
  total_cost_usd: number | null,
  avg_cost_per_request: number | null,
  provider_breakdown: [string, number][],
  model_breakdown: [string, number][]
}

Performance

Built with Rust and WebAssembly for maximum performance:

Sub-millisecond timing precision using high-resolution timers
HDR Histogram for accurate percentile calculations
Zero-copy serialization for efficient data transfer
Minimal overhead - Less than 5μs per measurement

Browser Support

Requires a modern browser with WebAssembly support:

Chrome/Edge 57+
Firefox 52+
Safari 11+
Node.js 16+

License

Apache-2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@llm-dev-ops/latency-lens

Features

Installation

As a library (recommended)

As a global CLI tool

CLI Usage

CLI Commands

Programmatic Usage

Basic Example

Advanced Example with Multiple Providers

API Reference

LatencyCollector

Constructor

Methods

Metrics Object

Performance

Browser Support

License

Links