albex
v0.1.0
Published
Local full-text search engine — documents never leave the browser
Maintainers
Readme
Albex
Local full-text search for documents. Runs entirely in the browser — no server, no upload, no network request after the initial load.
Drop a DOCX, PDF, XLSX, TXT or XML file, start typing, get results in milliseconds.
Features
- Zero server — all text stays on the user's machine.
- Fuzzy matching — finds "contrato" even if you type "conttrato" (adaptive edit distance).
- Accent-insensitive — "accion" matches "acción", "espana" matches "España".
- Multi-format — DOCX, XLSX, PDF (text-based), TXT, XML.
- Phrase search —
"contrato marco"requires the words to appear together. - OR search —
contrato | acuerdounions two independent searches. - No dependencies — one TypeScript file, two WASM binaries, nothing else.
- Tiny footprint — main WASM is ~14 KB on disk; PDF module (~1 MB) loads on demand.
Installation
npm install albexOr copy dist/albex.js, wasm/pkg/albex_wasm_bg.wasm (and optionally albex_pdf.wasm) to your project.
Quick start
import { AlbexEngine } from 'albex';
const engine = new AlbexEngine({
wasmUrl: '/assets/albex_wasm_bg.wasm',
pdfWasmUrl: '/assets/albex_pdf.wasm', // only needed for PDFs
});
await engine.init();
// Index a file from a <input type="file"> or drag-and-drop
const file = inputElement.files[0];
const doc = await engine.indexFile(file);
console.log(`Indexed ${doc.chunks} chunks in ${doc.indexTimeMs.toFixed(0)} ms`);
// Search
const results = engine.search('contrato marco');
for (const r of results) {
console.log(`[${r.score}] ${r.documentName} — ${r.snippet}`);
}Supported formats
| Extension | How text is extracted |
|-----------|----------------------|
| .docx | Native Rust/WASM XML parser — reads word/document.xml directly |
| .xlsx | Native Rust/WASM XML parser — reads shared strings + inline strings |
| .pdf | Separate albex_pdf.wasm (pure Rust, loaded on demand) |
| .txt | Plain text split on double newlines |
| .xml | Tag-stripped, entity-decoded |
Query syntax
| Input | Behaviour |
|-------|-----------|
| contrato | Fuzzy match, accent-insensitive |
| contrato marco | Both words must appear in the same chunk |
| "contrato marco" | Both words AND they must be adjacent (phrase) |
| contrato \| acuerdo | OR: returns results matching either term |
Up to 4 space-separated tokens per simple/phrase query. OR branches are unlimited.
API reference
new AlbexEngine(opts)
interface AlbexOptions {
wasmUrl: string; // required
pdfWasmUrl?: string; // required only for PDF indexing
}engine.init(): Promise<void>
Fetches and initialises the main WASM module. Must be called before anything else.
engine.indexFile(file: File): Promise<IndexedDocument>
Detects the file format by extension, extracts text, and adds it to the search index. Throws for unsupported extensions or parse errors.
interface IndexedDocument {
name: string;
ext: string;
chunks: number; // number of indexed text chunks
indexTimeMs: number;
textBytes: number; // raw UTF-8 text indexed
}engine.search(query: string): SearchResult[]
Returns results sorted by score (0–1000, descending).
interface SearchResult {
documentName: string;
location: number; // paragraph (DOCX/TXT) or page (PDF, 1-based)
score: number; // 0–1000
snippet: string; // full chunk text (original, with accents)
matchStart: number; // byte offset of match in snippet
matchEnd: number; // exclusive
}engine.getStats(): EngineStats
interface EngineStats {
documents: number;
chunks: number;
textUsed: number; // bytes
textCapacity: number; // 16 MB hard cap
wasmMemoryBytes: number;
}engine.getLastSearchStats(): SearchStats | null
Bloom/Bitap pipeline counters from the most recent search — useful for debugging and UI dashboards.
interface SearchStats {
query: string;
timeMs: number;
results: number;
bloomTested: number; // chunks tested
bloomPassed: number; // passed bloom pre-filter
bitapMatched: number; // confirmed by Bitap
}Tuning
engine.setMaxErrors(n); // 0–3 (default 2, auto-scaled by query length)
engine.setThreshold(n); // 0–1000 minimum score (default 250)
engine.setMaxResults(n); // 1–200 (default 50)engine.reset()
Clears all indexed documents. The engine is ready to index new files immediately after.
Capacity
| Resource | Limit | |----------|-------| | Documents | 128 | | Chunks | 100 000 | | Total text | 16 MB | | Query length | 64 characters (longer queries are truncated) | | Results | 200 (configurable, default 50) |
These are hard-coded BSS limits in the WASM module. Exceeding them is silent — the engine stops indexing additional content without error.
Browser requirements
- WebAssembly (all modern browsers since 2017)
DecompressionStreamfor DOCX/XLSX (Chrome 80+, Firefox 113+, Safari 16.4+)String.prototype.normalizefor phrase search (all modern browsers)
PDF support additionally requires the albex_pdf.wasm module to be served with the correct MIME type (application/wasm).
Building from source
# Install Rust + wasm-pack
rustup target add wasm32-unknown-unknown
# Build main WASM
cd wasm && cargo build --target wasm32-unknown-unknown --release
cp ../target/wasm32-unknown-unknown/release/albex_wasm.wasm pkg/albex_wasm_bg.wasm
# Build PDF WASM
cd ../pdf-wasm && cargo build --target wasm32-unknown-unknown --release
cp ../target/wasm32-unknown-unknown/release/albex_pdf.wasm ../wasm/pkg/albex_pdf.wasm
# Build TypeScript
cd .. && npm install && npm run buildPrivacy
Albex does not transmit any document content. Text extraction, indexing, and search all happen inside the browser's WASM sandbox. The only network requests are the initial fetch of the .wasm binary files.
License
MIT
