parseflow-core

v1.7.0

Published

7 months ago

Document parsing library for ParseFlow - Extract text and data from PDF, Word (docx), and Excel (xlsx) files

0High
0Medium
0Low

libres

pdf docx xlsx word excel parser document-parsing text-extraction metadata mcp model-context-protocol

parseflow-core

Core PDF parsing library for ParseFlow - Extract text, metadata, images, and TOC from PDF files.

✨ Features

📄 Text Extraction - Extract text from PDF with multiple strategies (raw, formatted, clean)
📊 Metadata Extraction - Get title, author, page count, creation date, etc.
🔍 Keyword Search - Search for specific content in PDFs with context
🖼️ Image Extraction - Extract images from PDFs (requires poppler-utils)
📑 Table of Contents - Extract PDF bookmarks and outline structure (requires pdftk/pdfinfo)

📦 Installation

npm install parseflow-core

Or using pnpm:

pnpm add parseflow-core

Or using yarn:

yarn add parseflow-core

🚀 Quick Start

Text Extraction

import { PDFParser } from 'parseflow-core';

const parser = new PDFParser();

// Extract all text
const result = await parser.extractText('path/to/document.pdf');
console.log(result.text);

// Extract specific page
const page2 = await parser.extractText('path/to/document.pdf', { page: 2 });

// Extract page range
const pages = await parser.extractText('path/to/document.pdf', { range: '1-5' });

Metadata Extraction

const metadata = await parser.getMetadata('path/to/document.pdf');
console.log(metadata);
// {
//   title: 'Document Title',
//   author: 'Author Name',
//   pageCount: 10,
//   creationDate: '2025-01-01',
//   ...
// }

Keyword Search

const results = await parser.searchPDF('path/to/document.pdf', 'keyword', {
  caseSensitive: false,
  maxResults: 10
});

results.forEach(result => {
  console.log(`Found on page ${result.page}: ${result.context}`);
});

Image Extraction (requires poppler-utils)

import { ImageExtractorExternal } from 'parseflow-core';

const extractor = new ImageExtractorExternal();
const images = await extractor.extract('path/to/document.pdf', './output', {
  format: 'png'
});

Table of Contents (requires pdftk or pdfinfo)

import { TOCExtractorExternal } from 'parseflow-core';

const tocExtractor = new TOCExtractorExternal();
const toc = await tocExtractor.extract('path/to/document.pdf');
console.log(toc);

📚 API Reference

PDFParser

Main parser class for PDF operations.

Methods

extractText(path, options?) - Extract text from PDF
getMetadata(path) - Get PDF metadata
searchPDF(path, query, options?) - Search for keywords

ImageExtractorExternal

Extract images from PDF using external tools.

Methods

isAvailable() - Check if pdfimages is available
extract(pdfPath, outputDir, options?) - Extract images

TOCExtractorExternal

Extract table of contents from PDF.

Methods

isAvailable() - Check if pdftk/pdfinfo is available
extract(pdfPath, options?) - Extract TOC

🔧 External Tools

Some features require external tools:

Image Extraction

Windows:

Download Poppler
Add to system PATH

Linux:

sudo apt-get install poppler-utils

macOS:

brew install poppler

TOC Extraction

Windows:

Download Poppler (includes pdfinfo)

Linux:

sudo apt-get install poppler-utils pdftk

macOS:

brew install poppler pdftk-java

💻 Requirements

Node.js: >= 18.0.0
TypeScript: >= 5.0.0 (for development)

📖 Documentation

For complete documentation, visit:

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

📄 License

🙏 Acknowledgments

pdf-parse - PDF text extraction
pdf-lib - PDF manipulation
Poppler - PDF rendering library

🔗 Links

Made with ❤️ by ParseFlow Team

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

parseflow-core

✨ Features

📦 Installation

🚀 Quick Start

Text Extraction

Metadata Extraction

Keyword Search

Image Extraction (requires poppler-utils)

Table of Contents (requires pdftk or pdfinfo)

📚 API Reference

PDFParser

Methods

ImageExtractorExternal

Methods

TOCExtractorExternal

Methods

🔧 External Tools

Image Extraction

TOC Extraction

💻 Requirements

📖 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

🔗 Links