@llamaindex/liteparse-wasm
v2.0.4
Published
Fast, lightweight PDF parsing with spatial text extraction — WebAssembly build for browsers
Downloads
1,784
Readme
@llamaindex/liteparse-wasm
Browser/WebAssembly build of LiteParse — a fast, lightweight PDF parser with spatial text extraction.
This package runs entirely in the browser. No server, no cloud calls.
Install
npm install @llamaindex/liteparse-wasmQuick start
import init, { LiteParse } from "@llamaindex/liteparse-wasm";
// Load the wasm module (point at the file shipped with the package).
await init();
const parser = new LiteParse({
ocrEnabled: false, // OCR requires a JS-side engine (see below)
outputFormat: "json",
});
// `data` is a Uint8Array (e.g. from fetch / File / drag-drop).
const bytes = new Uint8Array(await file.arrayBuffer());
const result = await parser.parse(bytes);
console.log(result.text); // full document text
console.log(result.pages[0]); // per-page items with bboxesConfig options
All optional, camelCase:
| Option | Type | Default | Description |
|---|---|---|---|
| ocrLanguage | string | "eng" | Language code passed to the OCR engine |
| ocrEnabled | boolean | true | Run OCR on text-sparse pages |
| maxPages | number | 1000 | Stop after this many pages |
| targetPages | string | — | e.g. "1-5,10,15-20" |
| dpi | number | 150 | Render DPI for OCR / screenshots |
| outputFormat | "json" \| "text" | "json" | Format used by parser.format(...) |
| preserveVerySmallText | boolean | false | Keep tiny text that's normally filtered |
| password | string | — | Password for protected PDFs |
| quiet | boolean | false | Suppress progress logging |
| ocrEngine | object | — | JS-side OCR engine (see below) |
OCR in the browser
The native HTTP-OCR and Tesseract backends are not available in the browser. To use OCR, pass an object with a recognize method:
const parser = new LiteParse({
ocrEnabled: true,
ocrLanguage: "eng",
ocrEngine: {
/**
* @param imageData PNG-encoded image bytes
* @param width rendered page width in pixels
* @param height rendered page height in pixels
* @param language e.g. "eng"
* @returns array of { text, bbox: [x1,y1,x2,y2], confidence }
*/
async recognize(imageData, width, height, language) {
// e.g. call a worker that wraps tesseract.js, or a remote OCR service
return [
{ text: "Hello", bbox: [10, 20, 80, 40], confidence: 0.98 },
];
},
},
});Building from source
Requires Rust + wasm-pack:
# from packages/wasm
npm run build # web target (default)
npm run build:bundler # for webpack/rollup/vite
npm run build:nodejs # for node.jsOutput goes to pkg/.
Note: A real build also needs a static
libpdfium.acompiled forwasm32-unknown-emscripten/wasm32-unknown-unknownexposed viaPDFIUM_LIB_PATH. See the project rootcrates/WASM_PLAN.mdfor details.
License
Apache-2.0
