@zlaabsi/turboquant-wasm

v0.1.1

Published

23 days ago

TurboQuant vector quantization for the browser — compress embeddings 8x, search client-side

0High
0Medium
0Low

turboquant wasm vector-search embeddings quantization compression rag semantic-search

turboquant-wasm is a Rust/WebAssembly implementation of the TurboQuant MSE variant (Algorithm 1 from the paper). It is built for applications that already have embeddings and want local retrieval without shipping a vector database or a graph index.

Why this repo does not ship the QJL variant

The short version is that QJL works against the main design goal of turboquant-wasm: keep browser-side retrieval small and memory-efficient.

QJL adds an extra projection matrix, which materially increases runtime memory pressure.
In browser and WASM settings, that extra matrix becomes expensive quickly, especially once embedding dimensions get large.
The MSE variant already gives strong recall for the bit-rates this repo actually targets in practice, especially at 3+ bits.
For this project, the tradeoff was not worth it: more complexity and more memory, without fitting the core promise of a tiny browser-first package.

So the repo deliberately optimizes for the TurboQuant MSE path: smaller package, lower memory footprint, simpler runtime story.

At a glance

Small web package. The current measured browser npm build is about 30.3 KiB gzip.
Aggressive compression. With 4-bit quantization, a 384d vector takes about 196 B and a 768d vector about 388 B.
Direct search on compressed vectors. No full decode step on every query.
Portable packaging. Runs in browsers, Node.js, and WASM-friendly edge runtimes.
Persistence built in. Save indexes with save() and restore them with Index.load().
Example-first repo. Includes browser, WebGPU, and Cloudflare demos.

Bundle Size Analysis

Current turboquant-wasm bundle numbers below come from the latest measured snapshot in benchmarks/results/2026-04-09-m1-max-node22.json. That snapshot keeps the 2026-04-08 search measurements and refreshes the browser npm package size to the current pkg-bundler/ output. Alternative-library rows are maintained comparison estimates from benchmarks/wasm_analysis.md, not a fresh side-by-side rerun in this repo.

Current measured package

The npm browser entrypoint now ships the wasm-pack --target bundler output rather than the raw web loader. That keeps the published package free of a runtime fetch()-based Wasm bootstrap, which avoids the Socket alert on pkg/turboquant_wasm.js while still keeping the repo-local demos on the plain web target.

Comparison with alternative browser-side vector search libraries

turboquant-wasm is materially smaller than graph-based WASM alternatives. That matters most for edge deployments, mobile web, and embedded search widgets where bundle budget is tight.

Why it stays small

No HNSW graph or graph-tuning machinery in the binary.
No external native dependency stack, BLAS, or LAPACK.
A small core: PRNG, orthogonalization, centroid tables, scalar quantization, packed storage, and compressed brute-force scan.
Size-oriented WASM build settings, plus a design that matches the algorithm instead of wrapping a larger ANN engine.

Feature Comparison

This table keeps the product-level comparison from benchmarks/wasm_analysis.md, but refreshes the turboquant-wasm numbers to the current implementation.

Key Advantages Summary

Good fit

Static-site search for docs, blogs, and catalogs
Local-first semantic search in PWAs or desktop apps
Client-side RAG where documents never leave the machine
Browser extensions indexing tabs or notes locally
Edge APIs with a prebuilt compressed index

Probably not the right tool

Very large corpora where you want graph-based ANN over 100k+ vectors
Workloads that need sub-millisecond latency at large N
Benchmarks where you need a mature head-to-head comparison suite today

Install

npm install @zlaabsi/turboquant-wasm

For npm consumers, the browser entrypoint is packaged with the wasm-pack bundler target. The repo-local examples/ continue to use the raw web target in pkg/.

Quick start

Minimal usage

import { createQuantizer } from "@zlaabsi/turboquant-wasm";

const dim = 384;
const bits = 4;

const quantizer = await createQuantizer({ dim, bits });
const index = quantizer.buildIndex(embeddings, nVectors);
const resultIds = index.search(queryEmbedding, 10);

Persist and reload

import { createQuantizer, Index } from "@zlaabsi/turboquant-wasm";

const quantizer = await createQuantizer({ dim: 384, bits: 4 });
const index = quantizer.buildIndex(embeddings, nVectors);

const bytes = index.save();
const restored = Index.load(bytes, quantizer);
const resultIds = restored.search(queryEmbedding, 10);

Build from source

rustup target add wasm32-unknown-unknown
cargo install wasm-pack

git clone https://github.com/zlaabsi/turboquant-wasm.git
cd turboquant-wasm
npm run build

Use npm run build:node when you also want the Node.js target in pkg-node/.

Try the examples

npm run build
python3 -m http.server 8080

Then open:

http://localhost:8080/examples/browser/
http://localhost:8080/examples/transformers-js/
http://localhost:8080/examples/onnx-webgpu/

Example matrix:

More detail: examples/README.md

Cookbook

Use these guides when you want an integration pattern instead of a toy demo:

Performance snapshot

Honest version: the implementation looks useful for moderate corpus sizes, but this repo still does not have a full benchmark suite across devices, browsers, public datasets, and competing libraries.

The table below is the current source of truth for measured TurboQuant behavior in this repo. The old March analysis mixed theory, estimates, and older implementation assumptions; benchmarks/wasm_analysis.md now explains explicitly why current measured search latency is higher than those early estimates.

Current evidence is a local snapshot on:

Apple M1 Max
Node v22.11.0
npm 10.9.0
Darwin 25.3.0 arm64
synthetic clustered embeddings

That means the numbers below are directional evidence, not a universal SLA.

Current snapshot

Charts

Raw benchmark data

Packaging note: the 2026-04-09 snapshot refreshes bundle-size fields for the current npm browser package, while the raw search log remains the 2026-04-08 run.
Snapshot JSON: benchmarks/results/2026-04-09-m1-max-node22.json
Raw console log: benchmarks/results/2026-04-08-m1-max-node22-realworld.txt
Chart generator: benchmarks/render_charts.js

Comparative context

The charts above are about turboquant-wasm alone. The charts below add comparative context using the positioning tables in benchmarks/wasm_analysis.md.

Important caveat: these comparative plots are not a fresh controlled benchmark suite run side-by-side in this repo. The TurboQuant bars use the current measured package size and current packed storage model; the alternative-library bars come from the maintained comparison estimates in benchmarks/wasm_analysis.md. They are here for positioning and tradeoff discussion, not to pretend we already have airtight head-to-head numbers.

Reading guide: purple is the current measured turboquant-wasm result, gray bars are the comparison points documented in benchmarks/wasm_analysis.md, and the small labels under the gray bars show the relative overhead versus TurboQuant.

What is still missing

repeated runs with variance reporting
lower-variance harnesses for build and search sweeps
browser benchmarks on low-end and mid-range hardware
public real-world embedding corpora
head-to-head comparisons against exact float32 search and graph-based ANN libraries

API and package notes

Install from npm with @zlaabsi/turboquant-wasm
Repository: github.com/zlaabsi/turboquant-wasm
Primary workflow: create quantizer -> build or stream index -> save/load -> search
Generated artifacts live in pkg/ and pkg-node/

Development

For local workflow, release process, and commit conventions, see CONTRIBUTING.md.

Common commands:

npm run build
npm run build:node
npm run test
npm run verify
npm run bench:realworld
npm run bench:charts

References

License

Apache-2.0