@productivities/document-sources
v0.1.0
Published
URL → document-source router. Detects PDF, DOCX, EPUB, Google Docs, arXiv, YouTube, Markdown, LaTeX, reStructuredText, Jupyter, audio, and video from a URL. Pure functions, zero runtime dependencies, runs anywhere (browser, service worker, Node, edge).
Maintainers
Readme
@productivities/document-sources
URL → document-source router. Given an arbitrary URL, tells you what kind of document it points at (PDF, EPUB, DOCX, Google Doc, YouTube video, Markdown, LaTeX, reStructuredText, Jupyter notebook, audio, video) and, where useful, rewrites it to a fetchable canonical URL (e.g. arXiv abstract → ar5iv HTML, GitHub blob → raw, Google Doc → text export).
Pure functions, zero runtime dependencies, runs anywhere — browser, MV3 service worker, Node, edge runtimes.
Install
npm install @productivities/document-sourcesUsage
import {
documentSourceFromUrl,
isPdfUrl,
isYouTubeUrl,
parseArxivId,
arxivHtmlCandidates,
googleDocTextCandidates,
} from '@productivities/document-sources';
documentSourceFromUrl('https://arxiv.org/pdf/1706.03762');
// → { kind: 'pdf', sourceType: 'pdf', label: 'PDF' }
documentSourceFromUrl('https://docs.google.com/document/d/abc123/edit');
// → { kind: 'document', sourceType: 'google-doc', label: 'Google Doc' }
documentSourceFromUrl('https://www.gutenberg.org/ebooks/1342.epub3.images');
// → { kind: 'document', sourceType: 'epub', label: 'EPUB' }
arxivHtmlCandidates('https://arxiv.org/pdf/1706.03762');
// → [{ url: 'https://ar5iv.labs.arxiv.org/html/1706.03762', label: 'ar5iv HTML' }]API
documentSourceFromUrl(url)— primary entry point. Returns{ kind, sourceType, label, url? }ornull.isPdfUrl(url),isYouTubeUrl(url)— predicates.parseArxivId(url),parseGoogleDoc(url),parseYouTubeUrl(url)— site-specific extractors.- Candidate generators (URL → fetchable canonical URLs):
arxivHtmlCandidates,googleDocTextCandidates,wordDocxCandidates,epubCandidates,markdownCandidates,textCandidates,notebookCandidates,latexCandidates,rstCandidates,mediaCandidates.
License
MIT
