@doublewordai/inference-lab
v0.2.1
Published
High-performance LLM inference simulator for analyzing serving systems
Downloads
384
Maintainers
Readme
Inference Lab
High-performance LLM inference simulator for analyzing serving systems. Simulates GPU clusters serving LLM inference workloads with realistic performance modeling.
Features
- Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
- Multiple Scheduling Policies: FCFS, Priority, SJF, and more
- Chunked Prefill: Simulates realistic request interleaving
- KV Cache Management: Models GPU memory and KV cache utilization
- Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
- WebAssembly Support: Run simulations in the browser via WASM
- CLI Tool: Standalone binary for command-line usage
Installation
As a Rust Library
cargo add inference-labAs an npm Package (WASM)
npm install @doublewordai/inference-labCLI Tool
cargo install inference-labUsage
CLI
# Run with default configuration
inference-lab --config configs/config.toml
# Example output shows TTFT, E2E latency, throughput, and utilization metricsRust Library
use inference_lab::simulation::Simulator;
use inference_lab::config::SimulationConfig;
let config = SimulationConfig::from_file("config.toml")?;
let mut simulator = Simulator::new(config);
let results = simulator.run();
println!("Mean TTFT: {:.2}ms", results.ttft_mean * 1000.0);
println!("P99 E2E: {:.2}ms", results.e2e_p99 * 1000.0);
println!("Throughput: {:.1} tok/s", results.throughput);WebAssembly
import init, { run_simulation } from '@doubleword/inference-lab';
await init();
const config = `
[hardware]
name = "H100"
compute_flops = 2e15
memory_bandwidth = 3.35e12
# ... rest of config
`;
const results = run_simulation(config);
console.log('TTFT P50:', results.ttft_p50);
console.log('Throughput:', results.throughput);Configuration
Configuration files use TOML format and specify:
- Hardware: GPU specs (FLOPS, bandwidth, VRAM)
- Model: LLM architecture (parameters, layers, heads)
- Scheduler: Policies, max tokens, chunked prefill settings
- Workload: Request arrival patterns and distributions
Example configurations are in the configs/ directory:
config.toml- Default H100 + Llama-3-70B setuptest_blog.toml- Closed-loop benchmark (64 users)qwen3_30b_a3b.toml- Qwen model configuration
Building
Native Binary
cargo build --release
./target/release/inference-lab --config configs/config.tomlWASM Package
npm run build
# Outputs to pkg/ directoryPublishing
# Publish to npm (requires authentication)
npm run build
npm publish --access public
# Publish Rust crate
cargo publishProject Structure
inference-lab/
├── src/
│ ├── simulation/ # Core simulator logic
│ ├── scheduler/ # Scheduling policies (FCFS, Priority, SJF)
│ ├── compute/ # Performance calculations
│ ├── kv_cache/ # KV cache management
│ ├── request/ # Request generation and tracking
│ ├── metrics/ # Performance metrics collection
│ ├── config/ # Configuration structures
│ ├── lib.rs # Library root
│ ├── main.rs # CLI entry point
│ └── wasm.rs # WebAssembly bindings
├── configs/ # Example configurations
├── Cargo.toml # Rust package manifest
└── package.json # npm package manifestMetrics
The simulator tracks:
- TTFT (Time to First Token): Prefill latency
- E2E (End-to-End): Total request latency
- TPOT (Time Per Output Token): Decode latency per token
- Throughput: Tokens generated per second
- Utilization: Compute and memory bandwidth usage
- KV Cache: Memory utilization over time
Results include percentiles (p50, p90, p95, p99) and means.
License
MIT
Repository
https://github.com/doublewordai/inference-lab
