@ai4data/search

v0.1.7

Published

2 months ago

Framework-agnostic semantic search client: HNSW, BM25, hybrid ranking in a Web Worker

0High
0Medium
0Low

avsolatorio

ai4data data4ai semantic-search hnsw bm25 worldbank development-data

@ai4data/search

Framework-agnostic semantic search client for the AI for Data – Data for AI program. Combines HNSW approximate nearest-neighbour search, BM25 lexical search, and hybrid ranking — all running in a Web Worker. Published under the @ai4data npm organization.

Browser only: requires a browser environment (Web Workers, fetch, Cache API).

Installation

npm install @ai4data/search
# or
yarn add @ai4data/search
# or
pnpm add @ai4data/search

Usage

import { SearchClient } from '@ai4data/search'

const client = new SearchClient('https://example.com/data/your-collection/manifest.json')

client.on('index_ready', () => {
  client.search('climate finance', { topK: 10, mode: 'hybrid' })
})

client.on('results', ({ data, stats }) => {
  console.log(data)   // SearchResult[]
  console.log(stats)  // SearchStats | null
})

// Clean up when done
client.destroy()

Curated highlights (optional)

If manifest.json includes index.highlights (e.g. "index/highlights.json"), host that file next to your index. Use the same field shape as titles.json entries (id, title, idno, type, …). For a fixed display order, use a JSON array of objects; an object keyed by id (like titles.json) is also supported (enumeration order).

client.on('highlights', ({ data }) => setSpotlightDocs(data))
client.on('index_ready', () => client.getHighlights(8))

Loading from a CDN (any origin)

When you load the package from a CDN (e.g. via an import map) and your page is on another origin, the browser blocks creating a worker from the CDN URL. Use SearchClient.fromCDN() so the worker is fetched and run from a blob URL (same-origin):

import { SearchClient } from '@ai4data/search'

const client = await SearchClient.fromCDN(manifestUrl)
client.on('results', ({ data }) => console.log(data))

workerUrl defaults to unpkg for the current package version (injected at build time), so you usually don't need to pass it. To pin a different version, pass workerUrl: 'https://unpkg.com/@ai4data/[email protected]/dist/worker.mjs'. If you pass an esm.sh worker URL, the client fetches the raw bundle from unpkg (esm.sh returns a wrapper that fails from a blob). This works from any static host with no build step.

CSP and Hugging Face (embedding proxy)

The search worker loads the embedding model with @xenova/transformers, which fetches ONNX and tokenizer files from the Hugging Face Hub by default (https://huggingface.co/...). If your page’s Content Security Policy blocks connect-src to huggingface.co or to LFS/CDN hosts, those requests will fail.

Recommended approach: serve a same-origin reverse proxy that forwards to the Hub, and point Transformers.js at it via env.remoteHost. This package exposes that as transformersRemoteHost (and optionally transformersRemotePathTemplate if your proxy does not mirror the Hub path layout).

modelId stays a full Hub repo id (e.g. avsolatorio/GIST-small-Embedding-v0). It is not a raw directory URL; the library builds paths like …/{model}/resolve/{revision}/tokenizer.json.
Pass an absolute base URL (or a path relative to the page, which SearchClient resolves against location.href so blob workers still target your app origin):

const origin = typeof location !== 'undefined' ? location.origin : 'https://example.com'

const client = new SearchClient(manifestUrl, {
  transformersRemoteHost: `${origin}/api/hf-proxy/`,
})

await SearchClient.fromCDN(manifestUrl, {
  workerUrl: `${origin}/dist/worker.mjs`,
  transformersRemoteHost: `${origin}/api/hf-proxy/`,
})

Your proxy should forward to the Hub with the same path suffix after the prefix, for example:

GET https://your.app/api/hf-proxy/avsolatorio/GIST-small-Embedding-v0/resolve/main/config.json
→ GET https://huggingface.co/avsolatorio/GIST-small-Embedding-v0/resolve/main/config.json

Important: Hub responses often 302 to cdn-lfs.huggingface.co. The proxy must follow redirects on the server and return the final 200 response to the browser; if you pass 302 through, the browser will leave your origin and hit CSP again. The demo server below does this with fetch(..., { redirect: 'follow' }).

A second path shape (/api/resolve-cache/models/<org>/<model>/<revision>/<file…>) is supported by the same demo server for tools that emit that URL pattern; it maps to https://huggingface.co/<org>/<model>/resolve/<revision>/<file…>.

Disable BM25 (semantic-only)

To skip loading and building the BM25 index even when the manifest includes bm25_corpus (e.g. to save memory or avoid lexical search):

new SearchClient(manifestUrl, { skipBm25: true })
// or with fromCDN:
await SearchClient.fromCDN(manifestUrl, { skipBm25: true })

Search will run in semantic-only mode; mode: 'lexical' and mode: 'hybrid' will behave as semantic when BM25 is disabled.

Custom worker path (bundler)

If your bundler does not resolve the default worker URL, pass a factory:

new SearchClient(manifestUrl, {
  workerFactory: () => new Worker(
    new URL('@ai4data/search/worker', import.meta.url),
    { type: 'module' }
  )
})

Rerank worker (optional)

For cross-encoder reranking of results, use the separate rank worker:

const rankWorker = new Worker(
  new URL('@ai4data/search/rank-worker', import.meta.url),
  { type: 'module' }
)
// Send { query, documents, top_k } and receive scored results

Vue and React adapters are planned (future: @ai4data/search/vue, @ai4data/search/react).

Building an index (Python pipeline)

SearchClient expects a hosted directory whose entry point is manifest.json, plus the index files it references (HNSW shards, titles.json, optional bm25_corpus.json, optional manually curated index/highlights.json via index.highlights in the manifest, or a flat embeddings.int8.json for small collections). This package does not build those artifacts; they come from the search pipeline in the main ai4data repository (except highlights.json, which you add yourself).

| Step | Script | Role | |------|--------|------| | 1 | 01_fetch_and_prepare.py | Ingest documents → metadata.json | | 2 | 02_generate_embeddings.py | Encode with sentence-transformers → raw_embeddings.npy, flat index pieces | | 3 | 03_build_index.py | Build HNSW (or flat) layout → manifest.json, index/, etc. |

End-to-end: run pipeline.py from the repo (with uv) after installing the Python extra search (see root pyproject.toml):

# From the ai4data repo root
uv pip install -e ".[search]"
# or: uv sync --extra search

uv run python scripts/search/pipeline/pipeline.py --help

Full options, outputs (including gzip vs GitHub Pages), and examples: Search pipeline README (in-repo copy: scripts/search/pipeline/README.md).

Host the resulting output directory on any static host (or object storage with CORS) and pass the absolute URL to manifest.json into new SearchClient(...).

Demo

Demos are included under demo/:

Static server (local dist/): run npm run demo, then open http://localhost:5173/demo/ (uses serve).
Standalone HTML (loads the package from npm via esm.sh): open demo/standalone.html in a browser. Serve the file over HTTP (e.g. npx serve . from this package). No build step; the file uses SearchClient.fromCDN with a worker URL (see the file).
HF proxy + CSP-friendly demo: run npm run demo:proxy from this package (builds, then starts demo/proxy-server.mjs). Open http://localhost:5173/ — it serves demo/hf-proxy-demo.html, which sets transformersRemoteHost to the same-origin /api/hf-proxy/ prefix. The proxy forwards to huggingface.co and follows redirects server-side so the browser only talks to localhost.

All demos need a manifest URL pointing at a hosted manifest.json (from the Python pipeline above or an existing collection).

Development

From the repo root (with workspaces):

npm install
npm run build --workspace=@ai4data/search

Or from this directory:

cd packages/ai4data/search
npm install
npm run build

Publishing (maintainers)

The package is published under the @ai4data npm organization. Only maintainers with publish access to the org can release.

Prerequisites

npm account that is a member of the ai4data org with permission to publish.
Logged in locally: npm login (use your npm credentials or a machine account token).

Steps

Bump the version in package.json (or use npm version patch|minor|major from this directory).
From the package directory, publish with public access (required for scoped packages):
```
cd packages/ai4data/search
npm publish --access public
```
The prepublishOnly script runs npm run build automatically before packing, so the published tarball always includes an up-to-date dist/.

Optionally tag the release in git and push:

git tag @ai4data/[email protected]
git push origin @ai4data/[email protected]

What gets published

Only the dist/ directory (built ESM and types) and README.md are included (see files in package.json).
Consumers install with: npm install @ai4data/search.

Documentation

License

MIT License with World Bank IGO Rider. See LICENSE in the repo root.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ai4data/search

Installation

Usage

Curated highlights (optional)

Loading from a CDN (any origin)

CSP and Hugging Face (embedding proxy)

Disable BM25 (semantic-only)

Custom worker path (bundler)

Rerank worker (optional)

Building an index (Python pipeline)

Demo

Development

Publishing (maintainers)

Documentation

License