npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@wgpu-fusion/core

v0.1.0

Published

Fused WebGPU compute kernels for transformer inference — 4,081x avg on Apple Silicon, 826x on phones (487 devices tested)

Downloads

66

Readme

@webgpu-fusion/core

Fused WebGPU compute kernels for transformer inference. 4,081x average on Apple Silicon, 826x on phones. 487 real-world devices tested.

One import. One dispatch. All tokens, all layers, all operations fused into a single GPU kernel.

Install

npm install @webgpu-fusion/core

Quick Start

import { FusedTransformer } from '@webgpu-fusion/core'

// Create a fused transformer (parallel mode, 64 GPU threads)
const model = await FusedTransformer.create({
  dModel: 128,
  nHeads: 2,
  nLayers: 4,
  maxSeqLen: 64,
})

// Benchmark with random weights
const stats = await model.benchmark({ runs: 10 })
console.log(`${stats.tok_per_sec.toFixed(0)} tok/s | ${stats.mean_ms.toFixed(1)} ms | 1 dispatch`)

// Or load real weights and run inference
const weights = new Float32Array(/* your model weights */)
model.loadWeightsWithDefaults(weights)

const embeddings = new Float32Array(64 * 128) // [seqLen * dModel]
const result = await model.forward(embeddings)
console.log(result.output) // Float32Array of hidden states
console.log(`${result.tok_per_sec.toFixed(0)} tok/s`)

// Clean up
model.destroy()

Why This Is Fast

Current browser inference engines dispatch separate GPU kernels for each operation:

Token 1: dispatch LN → dispatch Attn → dispatch LN → dispatch FFN  (4 round-trips)
Token 2: dispatch LN → dispatch Attn → dispatch LN → dispatch FFN  (4 round-trips)
... × 64 tokens × 4 layers = 1,024 GPU round-trips

This library fuses everything into 1 dispatch:

Single dispatch → all tokens × all layers × all ops in one kernel

The parallel variant uses 64 GPU threads with shared memory to also parallelize the matrix multiplications within the fused kernel.

Benchmark Results

Paper (Apple M2 Pro)

| Config | Unfused | Parallel Fused | Speedup | vs PyTorch MPS | |--------|---------|---------------|---------|---------------| | D=32, L=1 | 265 ms | 4.0 ms | 66x | 161x | | D=64, L=4 | 3,841 ms | 25.4 ms | 151x | 42x | | D=128, L=4 | 14,568 ms | 59.9 ms | 243x | 18x | | D=256, L=1 | 14,246 ms | 31.1 ms | 458x | 44x |

Real-World (487 devices, 8 GPU vendors)

| GPU Vendor | Avg Speedup | Peak Speedup | |---|---|---| | Apple Silicon (M1/M2/M3) | 4,081x | 79,021x | | Qualcomm Adreno (Android phones) | 826x | 13,541x | | NVIDIA (Blackwell, Lovelace, Ampere) | 70x | 159x | | ARM Mali | 55x | — |

Mobile: 15,000 tok/s avg, 213,000 tok/s peak. Higher speedups on mobile reflect worse dispatch overhead — kernel fusion benefits them most.

Run it on your device: gpubench.dev/transformer

API

FusedTransformer.create(options)

| Option | Type | Default | Description | |--------|------|---------|-------------| | dModel | number | required | Hidden dimension | | nHeads | number | required | Attention heads | | nLayers | number | required | Transformer layers | | ffnMultiplier | number | 4 | FFN expansion (dModel * ffnMultiplier) | | maxSeqLen | number | 256 | Maximum sequence length | | mode | 'parallel' | 'single-thread' | 'parallel' | Kernel mode | | precision | 'f32' | 'f16' | 'f32' | Compute precision | | workgroupSize | number | 64 | Threads per workgroup (parallel mode) |

model.forward(embeddings, seqLen?)

Returns InferenceResult with output (Float32Array), elapsed_ms, tokens, tok_per_sec.

model.benchmark({ warmup?, runs?, seqLen? })

Returns BenchmarkStats with mean_ms, std_ms, median_ms, tok_per_sec, etc.

model.loadWeights(weights) / model.loadWeightsWithDefaults(weights)

Load Float32Array of model weights. WithDefaults sets LayerNorm gamma=1, beta=0 automatically.

getGPU()

Returns { device: GPUDevice, info: GPUInfo } with GPU capabilities.

Requirements

  • Chrome 123+, Firefox 139+, Safari 18+ (WebGPU support)
  • For f16: browser must support shader-f16 feature

Research

Based on two preprints:

License

MIT