@productivities/document-extractor

v0.1.0

Published

a month ago

Universal URL → text extractor. Routes via @productivities/document-sources and parses PDF, DOCX, EPUB, Markdown, LaTeX, reStructuredText, Jupyter, Google Docs, and plain text. Browser/service-worker friendly — no Node-only dependencies.

0High
0Medium
0Low

zachzwy

url extract text pdf docx epub readability arxiv google-docs markdown service-worker browser

@productivities/document-extractor

Universal URL → text extractor for LLM ingestion. Routes via @productivities/document-sources and parses PDFs (pdfjs-dist), DOCX, EPUB, Markdown, LaTeX, reStructuredText, Jupyter notebooks, Google Docs, and plain text.

Browser/service-worker friendly — no Node-only dependencies. Originally extracted from the Depth Chrome extension.

Install

npm install @productivities/document-extractor

Usage

import { extractFromUrl, DocumentExtractionError } from '@productivities/document-extractor';

try {
  const extracted = await extractFromUrl({
    url: 'https://arxiv.org/pdf/1706.03762',
    title: 'Attention Is All You Need', // optional fallback title
  });

  console.log(extracted.title);          // e.g. 'arXiv:1706.03762'
  console.log(extracted.wordCount);      // 4521
  console.log(extracted.text);           // full plain-text body
  console.log(extracted.classification); // { kind: 'article', sourceType: 'pdf' }
} catch (err) {
  if (err instanceof DocumentExtractionError && err.code === 'SCANNED_PDF_UNSUPPORTED') {
    // …handle scanned-PDF case
  }
  throw err;
}

For known formats with bytes in hand (e.g. file upload, custom fetch), use the per-format helpers:

import {
  extractDocxTextFromBytes,
  extractEpubTextFromBytes,
} from '@productivities/document-extractor';

const text = await extractDocxTextFromBytes(arrayBuffer);
const epub = await extractEpubTextFromBytes(arrayBuffer); // { title, text, sectionCount }

API

extractFromUrl({ url, title?, signal? }) — orchestrator. Routes by URL, fetches, parses. Returns { title, byline, siteName, text, wordCount, truncated, sourceUrl, sourceLabel, classification }.
DocumentExtractionError — typed error with .code (e.g. SCANNED_PDF_UNSUPPORTED, PDF_TEXT_TOO_SHORT).
extractDocxTextFromBytes(bytes), docxXmlToText(xml) — low-level DOCX.
extractEpubTextFromBytes(bytes), xhtmlToText(html) — low-level EPUB.

Limits (defaults)

| Format | Max bytes | Other | |---|---|---| | PDF | 20 MB | 50 pages | | DOCX | 20 MB | — | | EPUB | 50 MB | — | | Plain text / Markdown / LaTeX / RST | 5 MB | — | | Output text | — | 60,000 chars (truncated beyond) |

Configurable limits are not yet exposed via the API. File an issue if you need them.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@productivities/document-extractor

Install

Usage

API

Limits (defaults)

License