parseflow-core
v1.7.0
Published
Document parsing library for ParseFlow - Extract text and data from PDF, Word (docx), and Excel (xlsx) files
Maintainers
Readme
parseflow-core
Core PDF parsing library for ParseFlow - Extract text, metadata, images, and TOC from PDF files.
✨ Features
- 📄 Text Extraction - Extract text from PDF with multiple strategies (raw, formatted, clean)
- 📊 Metadata Extraction - Get title, author, page count, creation date, etc.
- 🔍 Keyword Search - Search for specific content in PDFs with context
- 🖼️ Image Extraction - Extract images from PDFs (requires poppler-utils)
- 📑 Table of Contents - Extract PDF bookmarks and outline structure (requires pdftk/pdfinfo)
📦 Installation
npm install parseflow-coreOr using pnpm:
pnpm add parseflow-coreOr using yarn:
yarn add parseflow-core🚀 Quick Start
Text Extraction
import { PDFParser } from 'parseflow-core';
const parser = new PDFParser();
// Extract all text
const result = await parser.extractText('path/to/document.pdf');
console.log(result.text);
// Extract specific page
const page2 = await parser.extractText('path/to/document.pdf', { page: 2 });
// Extract page range
const pages = await parser.extractText('path/to/document.pdf', { range: '1-5' });Metadata Extraction
const metadata = await parser.getMetadata('path/to/document.pdf');
console.log(metadata);
// {
// title: 'Document Title',
// author: 'Author Name',
// pageCount: 10,
// creationDate: '2025-01-01',
// ...
// }Keyword Search
const results = await parser.searchPDF('path/to/document.pdf', 'keyword', {
caseSensitive: false,
maxResults: 10
});
results.forEach(result => {
console.log(`Found on page ${result.page}: ${result.context}`);
});Image Extraction (requires poppler-utils)
import { ImageExtractorExternal } from 'parseflow-core';
const extractor = new ImageExtractorExternal();
const images = await extractor.extract('path/to/document.pdf', './output', {
format: 'png'
});Table of Contents (requires pdftk or pdfinfo)
import { TOCExtractorExternal } from 'parseflow-core';
const tocExtractor = new TOCExtractorExternal();
const toc = await tocExtractor.extract('path/to/document.pdf');
console.log(toc);📚 API Reference
PDFParser
Main parser class for PDF operations.
Methods
extractText(path, options?)- Extract text from PDFgetMetadata(path)- Get PDF metadatasearchPDF(path, query, options?)- Search for keywords
ImageExtractorExternal
Extract images from PDF using external tools.
Methods
isAvailable()- Check if pdfimages is availableextract(pdfPath, outputDir, options?)- Extract images
TOCExtractorExternal
Extract table of contents from PDF.
Methods
isAvailable()- Check if pdftk/pdfinfo is availableextract(pdfPath, options?)- Extract TOC
🔧 External Tools
Some features require external tools:
Image Extraction
Windows:
- Download Poppler
- Add to system PATH
Linux:
sudo apt-get install poppler-utilsmacOS:
brew install popplerTOC Extraction
Windows:
- Download Poppler (includes pdfinfo)
Linux:
sudo apt-get install poppler-utils pdftkmacOS:
brew install poppler pdftk-java💻 Requirements
- Node.js: >= 18.0.0
- TypeScript: >= 5.0.0 (for development)
📖 Documentation
For complete documentation, visit:
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for details.
📄 License
MIT © Libres-coder
🙏 Acknowledgments
🔗 Links
Made with ❤️ by ParseFlow Team
