@latinfo/engine
v0.5.7
Published
Binary lookup, inverted-index search, and MPHF engine for Latin American business registries
Readme
@latinfo/engine
Binary lookup, inverted-index search, and minimal perfect hash (MPHF) engine for content-addressed datasets. Storage-agnostic — bring your own backend (R2, S3, filesystem, in-memory).
Pure TypeScript, zero runtime dependencies. Works in Node.js, Cloudflare Workers, Bun, Deno, and modern browsers.
Install
npm install @latinfo/engineUsage
import { lookupId, searchByName } from '@latinfo/engine';
import type { Storage, SourceConfig } from '@latinfo/engine';
// Implement the Storage interface for your backend
const storage: Storage = {
async get(key, options) {
// Return { arrayBuffer() } or null
},
};
// Provide your own SourceConfig (the engine is dataset-agnostic)
const source: SourceConfig = {
country: 'xx', institution: 'agency', dataset: 'records',
source: 'agency-records', baseName: 'agency-records',
routePath: '/xx/agency/records',
primaryId: { name: 'id', length: 11, regex: /^\d{11}$/, prefixLength: 5 },
alternateIds: [],
fieldNames: ['name', 'status', 'address'],
searchFieldIndex: 0,
};
// Lookup by ID — O(log n) binary search over a content-addressed shard
const record = await lookupId(storage, source, '12345678901');
// Search by name — inverted index with TF-IDF scoring
const results = await searchByName(storage, source, 'acme corporation');API
Lookup
lookupId(storage, source, id)— exact lookup by primary IDlookupIds(storage, source, id)— formultiRecord: truesources (one ID → many records)loadIndex(storage, source)— preload the prefix indexclearIndexCache(baseName?)— clear in-memory cache
Search
searchByName(storage, source, query)— V2/V3 inverted index with inline fieldsresolveQuery(storage, source, query)— ODIS-style: returns instructions for client-side resolutiongetPostingData(storage, source, terms)— fetch posting lists by termclearSearchCache(baseName?)— clear in-memory cache
Tenders / Procurement
searchLicitaciones(storage, config, query)— procurement search with filters (category, amount, buyer, method, status)licitacionesInfo(storage, config)— total record countclearLicitacionesCache(baseName?)
LicitacionesConfig is provided by the caller, so you can run multiple named instances (e.g. current vs historical archives) against the same engine.
MPHF (Minimal Perfect Hash)
BBHash— class for building MPHFsbuildMphfFile(terms, totalDocs, entrySize)— build from a term listserializeMphf(mphf)/deserializeMphf(buf)— wire formatmphfExactLookup(mphf, term)— O(1) term lookupmphfResolveToken(mphf, token, isLast)— prefix-aware resolution
Tokenizer
tokenize(text)/tokenizeKeepStopWords(text)/isAllStopWords(text)STOP_WORDS_BY_LANG— stop word sets for ES, PT, EN
Helpers
dniToRuc(dni)— Peru DNI to RUC conversion (publicly documented SUNAT formula)
Storage interface
interface Storage {
get(key: string, options?: {
range?: { offset: number; length: number };
}): Promise<StorageObject | null>;
}
interface StorageObject {
arrayBuffer(): Promise<ArrayBuffer>;
}Implement this against R2, S3, the filesystem, an in-memory map, or anything else.
Build tools (Node.js only)
The build entry uses fs / readline / path and is not Cloudflare-Workers-safe. Import it from a separate path:
import { buildBinaryFiles, buildSearchIndex } from '@latinfo/engine/build';buildBinaryFiles(tsvPath, outDir, baseName, opts)— builds.idx+ sharded.binfilesbuildSearchIndex(tsvPath, outDir, baseName, opts, totalRecords)— builds V2/V3 search index + posting shards
Binary format
The engine reads its own binary format, designed for content-addressed object storage:
.idx— prefix index (~hundreds of KB, cached in memory).bin— sorted records, sharded at 200 MB each-search.idx+-search-N.dat— V2/V3 inverted index with inline name + status
Shards are content-addressed by sha256 in your storage layer. A small manifest pointer (~1 KB) commits the version atomically — readers always see consistent state.
Credits
The MPHF implementation is based on Limasset et al., "Fast and Scalable Minimal Perfect Hashing for Massive Key Sets" (2017). MurmurHash3 by Austin Appleby (public domain).
License
MIT
