@productivities/document-extractor
v0.1.0
Published
Universal URL → text extractor. Routes via @productivities/document-sources and parses PDF, DOCX, EPUB, Markdown, LaTeX, reStructuredText, Jupyter, Google Docs, and plain text. Browser/service-worker friendly — no Node-only dependencies.
Maintainers
Readme
@productivities/document-extractor
Universal URL → text extractor for LLM ingestion. Routes via
@productivities/document-sources
and parses PDFs (pdfjs-dist), DOCX, EPUB, Markdown, LaTeX, reStructuredText,
Jupyter notebooks, Google Docs, and plain text.
Browser/service-worker friendly — no Node-only dependencies. Originally extracted from the Depth Chrome extension.
Install
npm install @productivities/document-extractorUsage
import { extractFromUrl, DocumentExtractionError } from '@productivities/document-extractor';
try {
const extracted = await extractFromUrl({
url: 'https://arxiv.org/pdf/1706.03762',
title: 'Attention Is All You Need', // optional fallback title
});
console.log(extracted.title); // e.g. 'arXiv:1706.03762'
console.log(extracted.wordCount); // 4521
console.log(extracted.text); // full plain-text body
console.log(extracted.classification); // { kind: 'article', sourceType: 'pdf' }
} catch (err) {
if (err instanceof DocumentExtractionError && err.code === 'SCANNED_PDF_UNSUPPORTED') {
// …handle scanned-PDF case
}
throw err;
}For known formats with bytes in hand (e.g. file upload, custom fetch), use the per-format helpers:
import {
extractDocxTextFromBytes,
extractEpubTextFromBytes,
} from '@productivities/document-extractor';
const text = await extractDocxTextFromBytes(arrayBuffer);
const epub = await extractEpubTextFromBytes(arrayBuffer); // { title, text, sectionCount }API
extractFromUrl({ url, title?, signal? })— orchestrator. Routes by URL, fetches, parses. Returns{ title, byline, siteName, text, wordCount, truncated, sourceUrl, sourceLabel, classification }.DocumentExtractionError— typed error with.code(e.g.SCANNED_PDF_UNSUPPORTED,PDF_TEXT_TOO_SHORT).extractDocxTextFromBytes(bytes),docxXmlToText(xml)— low-level DOCX.extractEpubTextFromBytes(bytes),xhtmlToText(html)— low-level EPUB.
Limits (defaults)
| Format | Max bytes | Other | |---|---|---| | PDF | 20 MB | 50 pages | | DOCX | 20 MB | — | | EPUB | 50 MB | — | | Plain text / Markdown / LaTeX / RST | 5 MB | — | | Output text | — | 60,000 chars (truncated beyond) |
Configurable limits are not yet exposed via the API. File an issue if you need them.
License
MIT
