@llm-dev-ops/latency-lens
v0.1.1
Published
High-precision LLM latency profiler - WebAssembly bindings for measuring token throughput, TTFT, and cost metrics
Downloads
12
Maintainers
Readme
@llm-dev-ops/latency-lens
High-precision LLM latency profiler powered by WebAssembly. Measure token throughput, Time to First Token (TTFT), inter-token latency, and cost metrics for OpenAI, Anthropic, and other LLM providers.
Features
- 🚀 High-precision timing - Sub-millisecond accuracy using WASM
- 📊 Comprehensive metrics - TTFT, inter-token latency, throughput, percentiles (p50, p90, p95, p99, p99.9)
- 💰 Cost tracking - Monitor spending across requests
- 🔧 Multi-provider - OpenAI, Anthropic, Google, and more
- 📈 Statistical analysis - HDR histograms for accurate percentile calculations
- 🔌 Easy integration - Simple API for Node.js and browsers
- 🛠️ CLI included - Test and explore metrics from the command line
Installation
As a library (recommended)
npm install @llm-dev-ops/latency-lensAs a global CLI tool
npm install -g @llm-dev-ops/latency-lensCLI Usage
After installing globally, you can use the CLI:
# Show help
latency-lens help
# Show version
latency-lens version
# Run a test to see metrics in action
latency-lens testCLI Commands
latency-lens version- Display version informationlatency-lens test- Run a simulated metrics collection testlatency-lens help- Show usage information
Programmatic Usage
Basic Example
import { LatencyCollector } from '@llm-dev-ops/latency-lens';
// Create collector with 60-second window
const collector = new LatencyCollector(60000);
// Start tracking a request
const requestId = collector.start_request('openai', 'gpt-4-turbo');
// Record first token received
collector.record_first_token(requestId);
// Record each subsequent token
collector.record_token(requestId);
collector.record_token(requestId);
// ... more tokens
// Complete the request
collector.complete_request(
requestId,
150, // input tokens
800, // output tokens
null, // thinking tokens (optional)
0.05 // cost in USD
);
// Get aggregated metrics
const metrics = collector.get_metrics();
console.log('TTFT P95:', metrics.ttft_distribution.p95_ms, 'ms');
console.log('Throughput:', metrics.throughput.tokens_per_second, 'tokens/sec');Advanced Example with Multiple Providers
import { LatencyCollector } from '@llm-dev-ops/latency-lens';
const collector = new LatencyCollector(30000);
async function trackOpenAIRequest(prompt) {
const reqId = collector.start_request('openai', 'gpt-4-turbo');
const stream = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [{ role: 'user', content: prompt }],
stream: true
});
let firstToken = true;
for await (const chunk of stream) {
if (firstToken) {
collector.record_first_token(reqId);
firstToken = false;
} else {
collector.record_token(reqId);
}
}
collector.complete_request(reqId, 100, 500, null, 0.025);
}
// Track multiple requests
await Promise.all([
trackOpenAIRequest('What is AI?'),
trackOpenAIRequest('Explain quantum computing'),
trackOpenAIRequest('Write a poem')
]);
// Analyze performance
const metrics = collector.get_metrics();
console.log('Performance Report:');
console.log('===================');
console.log(`Total requests: ${metrics.total_requests}`);
console.log(`Success rate: ${(metrics.success_rate * 100).toFixed(2)}%`);
console.log(`TTFT P50: ${metrics.ttft_distribution.p50_ms.toFixed(2)}ms`);
console.log(`TTFT P95: ${metrics.ttft_distribution.p95_ms.toFixed(2)}ms`);
console.log(`Total cost: $${metrics.total_cost_usd.toFixed(4)}`);API Reference
LatencyCollector
Main class for collecting metrics.
Constructor
new LatencyCollector(window_ms: number)window_ms- Time window in milliseconds for metrics aggregation
Methods
start_request(provider: string, model: string): string
Start tracking a new request. Returns a unique request ID.
record_first_token(request_id: string): void
Record when the first token is received (measures TTFT).
record_token(request_id: string): void
Record each subsequent token received.
complete_request(request_id: string, input_tokens: number, output_tokens: number, thinking_tokens: number | null, cost_usd: number): void
Mark the request as complete and record final metrics.
record_failure(request_id: string, error: string): void
Mark the request as failed.
get_metrics(): Metrics
Get aggregated metrics for all requests.
reset(): void
Clear all collected metrics.
Metrics Object
{
session_id: string,
start_time: string,
end_time: string,
total_requests: number,
successful_requests: number,
failed_requests: number,
success_rate: number,
ttft_distribution: {
min_ms: number,
max_ms: number,
mean_ms: number,
p50_ms: number,
p90_ms: number,
p95_ms: number,
p99_ms: number,
p99_9_ms: number,
stddev_ms: number
},
inter_token_distribution: { /* same as ttft_distribution */ },
total_latency_distribution: { /* same as ttft_distribution */ },
throughput: {
tokens_per_second: number,
requests_per_second: number
},
total_input_tokens: number,
total_output_tokens: number,
total_thinking_tokens: number | null,
total_cost_usd: number | null,
avg_cost_per_request: number | null,
provider_breakdown: [string, number][],
model_breakdown: [string, number][]
}Performance
Built with Rust and WebAssembly for maximum performance:
- Sub-millisecond timing precision using high-resolution timers
- HDR Histogram for accurate percentile calculations
- Zero-copy serialization for efficient data transfer
- Minimal overhead - Less than 5μs per measurement
Browser Support
Requires a modern browser with WebAssembly support:
- Chrome/Edge 57+
- Firefox 52+
- Safari 11+
- Node.js 16+
License
Apache-2.0
