@tinyweb_dev/doc-indexer-pdf
v0.0.2
Published
PDF adapter for @tinyweb_dev/doc-indexer (text extraction via pdfjs-dist).
Readme
@tinyweb_dev/doc-indexer-pdf
PDF source adapter for @tinyweb_dev/doc-indexer.
MVP:
- Extracts text page-by-page via
pdfjs-dist(legacy build, no DOM required). - Emits one
Chunkper page + a 2-levelTreeNodeskeleton (document → pages[]). - Detects
application/pdfby extension or sniff.
Roadmap (deferred):
- Heading detection from font-size statistics → multi-level outline tree.
- Page-image render to PNG via
pdfjs-dist+canvas(asset emission). - OCR fallback (
tesseract.js) for scanned PDFs with no text layer. - Vision-LLM triage for image-only PDFs.
Usage
import { Indexer, FsStorage } from '@tinyweb_dev/doc-indexer-core';
import { PdfAdapter } from '@tinyweb_dev/doc-indexer-pdf';
const indexer = new Indexer({
adapters: [new PdfAdapter()],
storage: new FsStorage('./out'),
});
const doc = await indexer.index({ path: './sample.pdf' });
console.log(doc.stats);