webinfer
v0.0.5
Published
High-performance LLM inference kernels for WebGPU
Maintainers
Readme
WebInfer
High-performance LLM inference kernels for WebGPU. A browser-native implementation of FlashInfer APIs.
Why WebInfer?
Running LLMs in the browser requires efficient GPU kernels. WebInfer brings FlashInfer's battle-tested attention mechanisms to WebGPU:
- Flash Attention: Online softmax with O(1) memory for long sequences
- Subgroup Optimizations: 2-4x faster reductions on supported hardware
- JIT Compilation: Runtime kernel generation with pipeline caching
- Dynamic Tile Sizes: Auto-tuned tile dimensions based on workload and hardware
Installation
npm install webinferQuick Start
import * as webinfer from 'webinfer';
// Initialize context
const ctx = await webinfer.WebInferContext.create();
// Run RMSNorm
const output = ctx.norm.rmsnorm(input, weight, epsilon);
// Flash attention decode
const attnOutput = ctx.decode.single_decode_with_kv_cache(
query, // [num_heads, head_dim]
kv_cache, // [seq_len, 2, num_kv_heads, head_dim]
);Browser Compatibility
WebInfer requires WebGPU support:
| Browser | Status | |---------|--------| | Chrome 113+ | Supported | | Edge 113+ | Supported | | Firefox Nightly | Behind flag | | Safari 18+ | Supported |
Check for WebGPU support:
if (!navigator.gpu) {
console.error('WebGPU not supported');
}Development
# Run tests
bun test
# Build
bun run buildAcknowledgments
- FlashInfer - The original CUDA implementation
- WebLLM - Browser LLM runtime
Citation
If you use WebInfer in your research, please cite:
@software{webinfer2025,
author = {Guan-Ming, Chiu},
title = {WebInfer: High-Performance LLM Inference Kernels for WebGPU},
year = {2026},
url = {https://github.com/guan404ming/webinfer}
}License
Apache-2.0
