npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

defuss-embeddings

v0.1.1

Published

Isomorphic text embeddings with Harrier ONNX int8 plus exact and TurboQuant-style vector search.

Readme

defuss-embeddings

Isomorphic text embeddings for Node.js and Chrome/Web using:

  • tss-deposium/harrier-oss-v1-270m-onnx-int8
  • @huggingface/transformers
  • stock int8 ONNX weights for inference
  • exact SIMD-friendly vector search utilities
  • optional multicore worker-backed brute-force exact search
  • a pure TypeScript TurboQuant-style vector search path for normalized embeddings

What this package does

  • Plan A encoder: use the existing ONNX Harrier int8 export with the default model.onnx runtime path
  • Client runtime: browser-first entry point with remote model loading + browser cache
  • Server runtime: Node-first entry point with optional filesystem cache/local model loading
  • Vector search: single-thread exact top-k, multicore exact top-k, and a separate TurboQuant-style compressed index

Package layout

This mirrors the split style used by defuss-rpc:

  • defuss-embeddings/client.js
  • defuss-embeddings/server.js
  • defuss-embeddings/vector-search.js
  • defuss-embeddings/turboquant.js

Install

bun add defuss-embeddings

The package does not download model files during install. The first loadModel(...), embed(...), embedQuery(...), or embedDocuments(...) call fetches the required files on demand and warms a persistent cache before the model pipeline is created.

  • Browsers store model files in defuss-db on top of IndexedDB and mirror them into the Cache API for Transformers.js lookups.
  • Node.js stores model files in the filesystem cache directory from cacheDir, or by default in the OS temp directory under defuss-embeddings/.
  • If you want to warm the cache before the first real request, call prefetchModel(...) at application startup.
  • After the first successful download, repeated runs reuse the same cache and should avoid downloading the model files again.

For runtime loading, prefer loadModel(urlOrRepoId).

  • Pass a Hugging Face repo id like tss-deposium/harrier-oss-v1-270m-onnx-int8 to use the default Hub layout.
  • Pass a base URL like https://cdn.example.com/models/harrier when the folder already contains config.json, tokenizer.json, tokenizer_config.json, and the onnx/ files.
  • For mirrored custom URLs whose file layout differs from the built-in manifests, pass requiredFiles so cache warming knows exactly which assets to fetch.
  • A raw single Buffer or Blob is not enough on its own because Harrier is a multi-file model bundle, not just one .onnx file.

Browser usage

import { createEmbeddingClient } from "defuss-embeddings/client.js";
import { buildTurboQuantIndex, searchTurboQuantIndexRerank } from "defuss-embeddings/turboquant.js";

const embedder = createEmbeddingClient({
  model: "tss-deposium/harrier-oss-v1-270m-onnx-int8",
  device: "webgpu", // or "wasm"
});

await embedder.loadModel("https://cdn.example.com/models/harrier");

const corpusTexts = [
  "machine learning systems",
  "cats on sofas",
  "vector search in browsers",
];

const corpusEmbeddings = await embedder.embed(corpusTexts);
const index = buildTurboQuantIndex(corpusEmbeddings, { seed: 1234 });

const queryEmbedding = await embedder.embedQuery("browser vector retrieval");
const { rerankedTopK } = searchTurboQuantIndexRerank(
  index,
  corpusEmbeddings,
  queryEmbedding,
  100,
  10,
);
console.log(rerankedTopK);

Server usage

import { createEmbeddingServer } from "defuss-embeddings/server.js";
import {
  attachRecords,
  searchTopK,
  searchTopKMulticore,
} from "defuss-embeddings/vector-search.js";
import { buildTurboQuantIndex, searchTurboQuantIndexRerank } from "defuss-embeddings/turboquant.js";

const embedder = createEmbeddingServer({
  model: "tss-deposium/harrier-oss-v1-270m-onnx-int8",
  dtype: "fp32",
  cacheDir: ".cache/defuss-embeddings",
});

await embedder.loadModel("tss-deposium/harrier-oss-v1-270m-onnx-int8");

const docs = ["alpha", "beta", "gamma"];
const docVectors = await embedder.embedDocuments(docs);
const query = await embedder.embedQuery("beta");
const exactHits = attachRecords(searchTopK(docVectors, query, 2), docs);
const exactHitsMulticore = attachRecords(
  await searchTopKMulticore(docVectors, query, 2, { threshold: 4096 }),
  docs,
);

const turboIndex = buildTurboQuantIndex(docVectors, { seed: 99 });
const { rerankedTopK } = searchTurboQuantIndexRerank(turboIndex, docVectors, query, 100, 10);

console.log(exactHits);
console.log(exactHitsMulticore);
console.log(attachRecords(rerankedTopK, docs));

Exact search strategies

  • searchTopK(...) is the synchronous exact-search baseline.
  • searchTopKMulticore(...) is the async exact-search variant that uses defuss-multicore workers in both Node.js and browsers.
  • For small corpora, searchTopKMulticore(...) falls back to the single-thread path automatically.
  • For large corpora, worker overhead is measurable, so benchmark on your target hardware instead of assuming multicore is always faster.
  • On the current benchmark snapshot below, single-query brute-force search is still faster on one thread than via workers in both Node.js and Chromium because chunk cloning and worker dispatch dominate.

OpenAI-Compatible endpoint usage

import { createEmbeddingServer } from "defuss-embeddings/server.js";

const embedder = createEmbeddingServer({
  model: "text-embedding-3-small",
  openAICompatible: {
    baseUrl: "https://api.openai.com/v1",
    apiKey: process.env.OPENAI_API_KEY,
  },
});

const docs = [
  "Create a Python virtual environment with python -m venv .venv.",
  "Use justify-content and align-items to center a div with flexbox.",
];

const docVectors = await embedder.embedDocuments(docs);

// For non-Harrier endpoint models, suppress the Harrier retrieval instruction.
const query = await embedder.embedQuery(
  "How do I create a Python virtual environment?",
  { instruction: "" },
);

console.log(docVectors, query);

Benchmark snapshot

These numbers come from local runs on 11 April 2026. They are only a snapshot and will vary with hardware, cache state, runtime version, and network conditions.

Quality takeaway

  • The fair comparison is exact candidate top-100 + exact rerank top-10 versus TurboQuant candidate top-100 + exact rerank top-10.
  • exact direct top-10 and exact candidate top-100 + exact rerank top-10 matched on the synthetic and live runs, which is what you want from the exact baseline.
  • On the current benchmark snapshot, TurboQuant candidate top-100 + exact rerank top-10 matched exact candidate top-100 + exact rerank top-10 on all synthetic cases and on the current live Harrier cases.
  • That does not mean exact search stops being the quality upper bound. Exact still sees the whole corpus, and reranking cannot recover a relevant document that TurboQuant never surfaced in its candidate set.
  • In the current live Harrier run, the quality bottleneck was the embedding model ranking itself rather than TurboQuant candidate loss: one query landed at exact candidate rank 66 and TurboQuant candidate rank 54, so both exact rerank and TurboQuant rerank correctly kept it out of top-10; one query was absent from both exact and TurboQuant top-100.

Synthetic large-haystack benchmark

Command: bun run test:needle

  • Corpus: 25,000 vectors
  • Dimensions: 640
  • Pipeline: exact direct top-10, exact candidate top-100 plus exact rerank top-10, multicore exact baselines, and TurboQuant candidate top-100 plus exact rerank top-10
  • Quality: 4/4 planted needles recovered by exact direct top-10, exact+rerank, multicore exact direct, multicore exact+rerank, TurboQuant candidate top-100, and TurboQuant+rerank
  • TurboQuant index build: 343.399 ms
  • Exact direct top-10 latency: avg 13.458 ms, median 14.040 ms
  • Exact candidate top-100 latency: avg 13.323 ms, median 13.410 ms
  • Exact rerank latency: avg 0.137 ms, median 0.115 ms
  • Exact candidate plus rerank total: avg 13.460 ms, median 13.516 ms
  • Multicore exact direct top-10 latency: avg 33.762 ms, median 34.518 ms
  • Multicore exact candidate top-100 latency: avg 36.976 ms, median 40.323 ms
  • Multicore exact candidate plus rerank total: avg 37.122 ms, median 40.468 ms
  • TurboQuant candidate top-100 latency: avg 17.968 ms, median 17.334 ms
  • TurboQuant rerank latency: avg 0.150 ms, median 0.157 ms
  • TurboQuant candidate plus rerank total: avg 18.118 ms, median 17.497 ms
  • Exact candidate throughput: avg 1,886,088 docs/s
  • Multicore exact candidate throughput: avg 695,886 docs/s
  • TurboQuant candidate throughput: avg 1,397,003 docs/s
  • Multicore exact candidate speed relative to single-thread exact candidate: avg 0.369x

Browser synthetic benchmark

Command: bun run test:needle:browser

  • Corpus: 8,000 vectors in Chromium
  • Dimensions: 384
  • Pipeline: exact direct top-10, exact candidate top-100 plus exact rerank top-10, multicore exact baselines, and TurboQuant candidate top-100 plus exact rerank top-10
  • Quality: 3/3 planted needles recovered by exact direct top-10, exact+rerank, multicore exact direct, multicore exact+rerank, TurboQuant candidate top-100, and TurboQuant+rerank
  • TurboQuant index build: 55.600 ms
  • Exact direct top-10 latency: avg 2.467 ms, median 2.500 ms
  • Exact candidate top-100 latency: avg 2.367 ms, median 2.400 ms
  • Exact candidate plus rerank total: avg 2.400 ms, median 2.400 ms
  • Multicore exact direct top-10 latency: avg 5.600 ms, median 5.500 ms
  • Multicore exact candidate top-100 latency: avg 5.400 ms, median 5.400 ms
  • Multicore exact candidate plus rerank total: avg 5.467 ms, median 5.400 ms
  • TurboQuant candidate top-100 latency: avg 2.433 ms, median 2.400 ms
  • TurboQuant candidate plus rerank total: avg 2.467 ms, median 2.500 ms
  • Exact candidate throughput: avg 3,381,643 docs/s
  • Multicore exact candidate throughput: avg 1,481,820 docs/s
  • TurboQuant candidate throughput: avg 3,296,172 docs/s
  • Multicore exact candidate speed relative to single-thread exact candidate: avg 0.438x
  • Use this benchmark to validate whether worker overhead pays off on your browser target, because the crossover point varies noticeably by machine, runtime, and corpus layout

Live Harrier benchmark

Command: bun run test:needle:live

  • Query format: Harrier instructed queries using Instruct: ... followed by Query: ...
  • Corpus: 512 natural-language passages per case
  • Cases: 4 natural search queries against 4 planted target passages
  • Exact direct top-10 hit count: 2/4
  • Exact candidate top-100 hit count: 3/4
  • Exact candidate plus rerank top-10 hit count: 2/4
  • TurboQuant candidate top-100 hit count: 3/4
  • TurboQuant candidate plus rerank top-10 hit count: 2/4
  • Case snapshot: protein-guideline exact rank 1 and TurboQuant rank 1; summit-definition exact candidate rank 66 and TurboQuant candidate rank 54, but both reranked top-10 results miss; css-center-div exact rank 1 and TurboQuant rank 1; python-virtualenv is absent from both exact and TurboQuant top-100 on this run
  • Query embedding latency: avg 30.365 ms, median 30.590 ms
  • Exact direct top-10 latency: avg 1.441 ms, median 1.122 ms
  • Exact candidate top-100 latency: avg 1.136 ms, median 1.493 ms
  • Exact rerank latency: avg 0.290 ms, median 0.387 ms
  • Exact candidate plus rerank total: avg 1.426 ms, median 1.881 ms
  • TurboQuant candidate top-100 latency: avg 1.957 ms, median 1.336 ms
  • TurboQuant rerank latency: avg 0.187 ms, median 0.152 ms
  • TurboQuant candidate plus rerank total: avg 2.144 ms, median 1.487 ms
  • End-to-end exact retrieval: avg 31.791 ms, median 31.143 ms
  • End-to-end TurboQuant retrieval: avg 32.509 ms, median 31.736 ms

Notes

  • The model runtime uses the model file layout resolved by Transformers.js / ONNX Runtime for the selected source.
  • The default model is tss-deposium/harrier-oss-v1-270m-onnx-int8, which ships a single onnx/model.onnx file containing the int8 export. Keep the default dtype unless you are targeting a different model layout.
  • Retrieval queries should be embedded with an instruction prefix. Use embedQuery() for that flow; documents should stay unprefixed.
  • Harrier uses last_token pooling and L2 normalization, so the runtime defaults match the model card instead of mean pooling.
  • loadModel(urlOrRepoId) works with either a Hub repo id or a plain base URL. The shared core uses separate browser and Node cache adapters behind the existing client.js and server.js exports.
  • prefetchModel(...) is optional and only exists to warm the same cache earlier at runtime; loadModel(...) and the first embedding call will also warm the cache automatically.
  • Use inspectModelCache(...) and clearModelCache(...) if you want to inspect or purge cached assets explicitly.
  • The turboquant module here is only for vector search, not model-weight quantization.
  • For fair quality comparisons, compare searchTopK(..., 100) + rerankSearchHits(..., 10) against searchTurboQuantIndex(..., 100) + rerankSearchHits(..., 10). Comparing direct exact top-10 against TurboQuant+rerank mixes different pipeline stages.
  • Quick unit tests use injected mock extractors, while bun run test:live runs both a Node live suite and a browser live suite against the real Harrier model.
  • searchTopKMulticore(...) uses worker-backed brute-force search and returns the same ranked hits as searchTopK(...), but asynchronously.
  • openAICompatible switches the runtime from local ONNX inference to an arbitrary OpenAI-compatible /embeddings HTTP endpoint.
  • When openAICompatible is active, prefetchModel(...), inspectModelCache(...), and clearModelCache(...) are not applicable and will throw.
  • embedQuery() still uses Harrier-style Instruct: ...\nQuery: ... formatting by default. For non-Harrier endpoint models, pass { instruction: "" } or call embedOne() directly.
  • bun run test:needle runs a synthetic large-haystack benchmark that verifies planted needles stay retrievable while reporting exact direct, exact+rerank, multicore exact, and TurboQuant+rerank latency.
  • bun run test:needle:browser runs the same fair-comparison benchmark in Chromium.
  • bun run test:needle:live runs the real Harrier model against a larger generated text corpus and reports exact+rerank versus TurboQuant+rerank quality plus end-to-end latency.

Scripts

bun run build
bun run test:node
bun run test:browser
bun run test:needle
bun run test:needle:browser
bun run test:needle:live
bun run test:live:node
bun run test:live:browser
bun run test:live
bun run test