albex

v0.1.0

Published

a day ago

Local full-text search engine — documents never leave the browser

0High
0Medium
0Low

bdovenbird

search full-text wasm browser offline docx pdf xlsx fuzzy privacy

Albex

Local full-text search for documents. Runs entirely in the browser — no server, no upload, no network request after the initial load.

Drop a DOCX, PDF, XLSX, TXT or XML file, start typing, get results in milliseconds.

Features

Zero server — all text stays on the user's machine.
Fuzzy matching — finds "contrato" even if you type "conttrato" (adaptive edit distance).
Accent-insensitive — "accion" matches "acción", "espana" matches "España".
Multi-format — DOCX, XLSX, PDF (text-based), TXT, XML.
Phrase search — "contrato marco" requires the words to appear together.
OR search — contrato | acuerdo unions two independent searches.
No dependencies — one TypeScript file, two WASM binaries, nothing else.
Tiny footprint — main WASM is ~14 KB on disk; PDF module (~1 MB) loads on demand.

Installation

npm install albex

Or copy dist/albex.js, wasm/pkg/albex_wasm_bg.wasm (and optionally albex_pdf.wasm) to your project.

Quick start

import { AlbexEngine } from 'albex';

const engine = new AlbexEngine({
  wasmUrl:    '/assets/albex_wasm_bg.wasm',
  pdfWasmUrl: '/assets/albex_pdf.wasm',   // only needed for PDFs
});

await engine.init();

// Index a file from a <input type="file"> or drag-and-drop
const file = inputElement.files[0];
const doc  = await engine.indexFile(file);
console.log(`Indexed ${doc.chunks} chunks in ${doc.indexTimeMs.toFixed(0)} ms`);

// Search
const results = engine.search('contrato marco');
for (const r of results) {
  console.log(`[${r.score}] ${r.documentName} — ${r.snippet}`);
}

Supported formats

| Extension | How text is extracted | |-----------|----------------------| | .docx | Native Rust/WASM XML parser — reads word/document.xml directly | | .xlsx | Native Rust/WASM XML parser — reads shared strings + inline strings | | .pdf | Separate albex_pdf.wasm (pure Rust, loaded on demand) | | .txt | Plain text split on double newlines | | .xml | Tag-stripped, entity-decoded |

Query syntax

| Input | Behaviour | |-------|-----------| | contrato | Fuzzy match, accent-insensitive | | contrato marco | Both words must appear in the same chunk | | "contrato marco" | Both words AND they must be adjacent (phrase) | | contrato \| acuerdo | OR: returns results matching either term |

Up to 4 space-separated tokens per simple/phrase query. OR branches are unlimited.

API reference

`new AlbexEngine(opts)`

interface AlbexOptions {
  wasmUrl:     string;   // required
  pdfWasmUrl?: string;   // required only for PDF indexing
}

`engine.init(): Promise<void>`

Fetches and initialises the main WASM module. Must be called before anything else.

`engine.indexFile(file: File): Promise<IndexedDocument>`

Detects the file format by extension, extracts text, and adds it to the search index. Throws for unsupported extensions or parse errors.

interface IndexedDocument {
  name:        string;
  ext:         string;
  chunks:      number;   // number of indexed text chunks
  indexTimeMs: number;
  textBytes:   number;   // raw UTF-8 text indexed
}

`engine.search(query: string): SearchResult[]`

Returns results sorted by score (0–1000, descending).

interface SearchResult {
  documentName: string;
  location:     number;   // paragraph (DOCX/TXT) or page (PDF, 1-based)
  score:        number;   // 0–1000
  snippet:      string;   // full chunk text (original, with accents)
  matchStart:   number;   // byte offset of match in snippet
  matchEnd:     number;   // exclusive
}

`engine.getStats(): EngineStats`

interface EngineStats {
  documents:       number;
  chunks:          number;
  textUsed:        number;   // bytes
  textCapacity:    number;   // 16 MB hard cap
  wasmMemoryBytes: number;
}

`engine.getLastSearchStats(): SearchStats | null`

Bloom/Bitap pipeline counters from the most recent search — useful for debugging and UI dashboards.

interface SearchStats {
  query:        string;
  timeMs:       number;
  results:      number;
  bloomTested:  number;   // chunks tested
  bloomPassed:  number;   // passed bloom pre-filter
  bitapMatched: number;   // confirmed by Bitap
}

Tuning

engine.setMaxErrors(n);     // 0–3  (default 2, auto-scaled by query length)
engine.setThreshold(n);     // 0–1000 minimum score (default 250)
engine.setMaxResults(n);    // 1–200 (default 50)

`engine.reset()`

Clears all indexed documents. The engine is ready to index new files immediately after.

Capacity

| Resource | Limit | |----------|-------| | Documents | 128 | | Chunks | 100 000 | | Total text | 16 MB | | Query length | 64 characters (longer queries are truncated) | | Results | 200 (configurable, default 50) |

These are hard-coded BSS limits in the WASM module. Exceeding them is silent — the engine stops indexing additional content without error.

Browser requirements

WebAssembly (all modern browsers since 2017)
DecompressionStream for DOCX/XLSX (Chrome 80+, Firefox 113+, Safari 16.4+)
String.prototype.normalize for phrase search (all modern browsers)

PDF support additionally requires the albex_pdf.wasm module to be served with the correct MIME type (application/wasm).

Building from source

# Install Rust + wasm-pack
rustup target add wasm32-unknown-unknown

# Build main WASM
cd wasm && cargo build --target wasm32-unknown-unknown --release
cp ../target/wasm32-unknown-unknown/release/albex_wasm.wasm pkg/albex_wasm_bg.wasm

# Build PDF WASM
cd ../pdf-wasm && cargo build --target wasm32-unknown-unknown --release
cp ../target/wasm32-unknown-unknown/release/albex_pdf.wasm ../wasm/pkg/albex_pdf.wasm

# Build TypeScript
cd .. && npm install && npm run build

Privacy

Albex does not transmit any document content. Text extraction, indexing, and search all happen inside the browser's WASM sandbox. The only network requests are the initial fetch of the .wasm binary files.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme