@ai4data/search
v0.1.7
Published
Framework-agnostic semantic search client: HNSW, BM25, hybrid ranking in a Web Worker
Maintainers
Readme
@ai4data/search
Framework-agnostic semantic search client for the AI for Data – Data for AI program. Combines HNSW approximate nearest-neighbour search, BM25 lexical search, and hybrid ranking — all running in a Web Worker. Published under the @ai4data npm organization.
Browser only: requires a browser environment (Web Workers, fetch, Cache API).
Installation
npm install @ai4data/search
# or
yarn add @ai4data/search
# or
pnpm add @ai4data/searchUsage
import { SearchClient } from '@ai4data/search'
const client = new SearchClient('https://example.com/data/your-collection/manifest.json')
client.on('index_ready', () => {
client.search('climate finance', { topK: 10, mode: 'hybrid' })
})
client.on('results', ({ data, stats }) => {
console.log(data) // SearchResult[]
console.log(stats) // SearchStats | null
})
// Clean up when done
client.destroy()Curated highlights (optional)
If manifest.json includes index.highlights (e.g. "index/highlights.json"), host that file next to your index. Use the same field shape as titles.json entries (id, title, idno, type, …). For a fixed display order, use a JSON array of objects; an object keyed by id (like titles.json) is also supported (enumeration order).
client.on('highlights', ({ data }) => setSpotlightDocs(data))
client.on('index_ready', () => client.getHighlights(8))Loading from a CDN (any origin)
When you load the package from a CDN (e.g. via an import map) and your page is on another origin, the browser blocks creating a worker from the CDN URL. Use SearchClient.fromCDN() so the worker is fetched and run from a blob URL (same-origin):
import { SearchClient } from '@ai4data/search'
const client = await SearchClient.fromCDN(manifestUrl)
client.on('results', ({ data }) => console.log(data))workerUrl defaults to unpkg for the current package version (injected at build time), so you usually don't need to pass it. To pin a different version, pass workerUrl: 'https://unpkg.com/@ai4data/[email protected]/dist/worker.mjs'. If you pass an esm.sh worker URL, the client fetches the raw bundle from unpkg (esm.sh returns a wrapper that fails from a blob). This works from any static host with no build step.
CSP and Hugging Face (embedding proxy)
The search worker loads the embedding model with @xenova/transformers, which fetches ONNX and tokenizer files from the Hugging Face Hub by default (https://huggingface.co/...). If your page’s Content Security Policy blocks connect-src to huggingface.co or to LFS/CDN hosts, those requests will fail.
Recommended approach: serve a same-origin reverse proxy that forwards to the Hub, and point Transformers.js at it via env.remoteHost. This package exposes that as transformersRemoteHost (and optionally transformersRemotePathTemplate if your proxy does not mirror the Hub path layout).
modelIdstays a full Hub repo id (e.g.avsolatorio/GIST-small-Embedding-v0). It is not a raw directory URL; the library builds paths like…/{model}/resolve/{revision}/tokenizer.json.- Pass an absolute base URL (or a path relative to the page, which
SearchClientresolves againstlocation.hrefso blob workers still target your app origin):
const origin = typeof location !== 'undefined' ? location.origin : 'https://example.com'
const client = new SearchClient(manifestUrl, {
transformersRemoteHost: `${origin}/api/hf-proxy/`,
})
await SearchClient.fromCDN(manifestUrl, {
workerUrl: `${origin}/dist/worker.mjs`,
transformersRemoteHost: `${origin}/api/hf-proxy/`,
})Your proxy should forward to the Hub with the same path suffix after the prefix, for example:
GET https://your.app/api/hf-proxy/avsolatorio/GIST-small-Embedding-v0/resolve/main/config.json
→ GET https://huggingface.co/avsolatorio/GIST-small-Embedding-v0/resolve/main/config.json
Important: Hub responses often 302 to cdn-lfs.huggingface.co. The proxy must follow redirects on the server and return the final 200 response to the browser; if you pass 302 through, the browser will leave your origin and hit CSP again. The demo server below does this with fetch(..., { redirect: 'follow' }).
A second path shape (/api/resolve-cache/models/<org>/<model>/<revision>/<file…>) is supported by the same demo server for tools that emit that URL pattern; it maps to https://huggingface.co/<org>/<model>/resolve/<revision>/<file…>.
Disable BM25 (semantic-only)
To skip loading and building the BM25 index even when the manifest includes bm25_corpus (e.g. to save memory or avoid lexical search):
new SearchClient(manifestUrl, { skipBm25: true })
// or with fromCDN:
await SearchClient.fromCDN(manifestUrl, { skipBm25: true })Search will run in semantic-only mode; mode: 'lexical' and mode: 'hybrid' will behave as semantic when BM25 is disabled.
Custom worker path (bundler)
If your bundler does not resolve the default worker URL, pass a factory:
new SearchClient(manifestUrl, {
workerFactory: () => new Worker(
new URL('@ai4data/search/worker', import.meta.url),
{ type: 'module' }
)
})Rerank worker (optional)
For cross-encoder reranking of results, use the separate rank worker:
const rankWorker = new Worker(
new URL('@ai4data/search/rank-worker', import.meta.url),
{ type: 'module' }
)
// Send { query, documents, top_k } and receive scored resultsVue and React adapters are planned (future: @ai4data/search/vue, @ai4data/search/react).
Building an index (Python pipeline)
SearchClient expects a hosted directory whose entry point is manifest.json, plus the index files it references (HNSW shards, titles.json, optional bm25_corpus.json, optional manually curated index/highlights.json via index.highlights in the manifest, or a flat embeddings.int8.json for small collections). This package does not build those artifacts; they come from the search pipeline in the main ai4data repository (except highlights.json, which you add yourself).
| Step | Script | Role |
|------|--------|------|
| 1 | 01_fetch_and_prepare.py | Ingest documents → metadata.json |
| 2 | 02_generate_embeddings.py | Encode with sentence-transformers → raw_embeddings.npy, flat index pieces |
| 3 | 03_build_index.py | Build HNSW (or flat) layout → manifest.json, index/, etc. |
End-to-end: run pipeline.py from the repo (with uv) after installing the Python extra search (see root pyproject.toml):
# From the ai4data repo root
uv pip install -e ".[search]"
# or: uv sync --extra search
uv run python scripts/search/pipeline/pipeline.py --helpFull options, outputs (including gzip vs GitHub Pages), and examples: Search pipeline README (in-repo copy: scripts/search/pipeline/README.md).
Host the resulting output directory on any static host (or object storage with CORS) and pass the absolute URL to manifest.json into new SearchClient(...).
Demo
Demos are included under demo/:
- Static server (local
dist/): runnpm run demo, then open http://localhost:5173/demo/ (uses serve). - Standalone HTML (loads the package from npm via esm.sh): open demo/standalone.html in a browser. Serve the file over HTTP (e.g.
npx serve .from this package). No build step; the file usesSearchClient.fromCDNwith a worker URL (see the file). - HF proxy + CSP-friendly demo: run
npm run demo:proxyfrom this package (builds, then starts demo/proxy-server.mjs). Open http://localhost:5173/ — it serves demo/hf-proxy-demo.html, which setstransformersRemoteHostto the same-origin/api/hf-proxy/prefix. The proxy forwards tohuggingface.coand follows redirects server-side so the browser only talks to localhost.
All demos need a manifest URL pointing at a hosted manifest.json (from the Python pipeline above or an existing collection).
Development
From the repo root (with workspaces):
npm install
npm run build --workspace=@ai4data/searchOr from this directory:
cd packages/ai4data/search
npm install
npm run buildPublishing (maintainers)
The package is published under the @ai4data npm organization. Only maintainers with publish access to the org can release.
Prerequisites
- npm account that is a member of the ai4data org with permission to publish.
- Logged in locally:
npm login(use your npm credentials or a machine account token).
Steps
Bump the version in
package.json(or usenpm version patch|minor|majorfrom this directory).From the package directory, publish with public access (required for scoped packages):
cd packages/ai4data/search npm publish --access publicThe
prepublishOnlyscript runsnpm run buildautomatically before packing, so the published tarball always includes an up-to-datedist/.Optionally tag the release in git and push:
git tag @ai4data/[email protected] git push origin @ai4data/[email protected]
What gets published
- Only the
dist/directory (built ESM and types) andREADME.mdare included (seefilesinpackage.json). - Consumers install with:
npm install @ai4data/search.
Documentation
License
MIT License with World Bank IGO Rider. See LICENSE in the repo root.
