pdf-decomposer
v1.2.0
Published
๐ Extract text and images (as buffers) from PDF files using native parsing and OCR with Tesseract.
Maintainers
Readme
pdf-decomposer
๐ Extract text and images (as buffers) from PDF files using native parsing and OCR with Tesseract.
Features
- ๐ Extract raw text from PDFs (including OCR for scanned documents)
- ๐ผ๏ธ Convert PDF pages to image buffers
- ๐ง Use Tesseract OCR for image-to-text extraction
- โก Buffer-based API โ easy to integrate in Node.js pipelines
- ๐ ๏ธ Built with TypeScript for better type safety
Prerequisites
๐ Installing Poppler
This package (pdf-decomposer) relies on Poppler command-line tools (pdftotext, pdfimages, pdftoppm) to extract content from PDF files. These tools must be installed on your system and accessible from the terminal.
โ Ubuntu / Debian
sudo apt update
sudo apt install poppler-utilsโ MacOS
brew install poppler๐ Installing Tesseract (only if using OCR)
This package (pdf-decomposer) relies on Tesseract OCR to extract content from PNG files. These tools must be installed on your system and accessible from the terminal if you want to use OCR on PDF.
โ Ubuntu / Debian
sudo apt update
sudo apt install tesseract-ocrโ MacOS
brew install tesseractInstallation
npm install pdf-decomposerUsage
import { decompose } from 'pdf-decomposer';
// Optional options
const options = {
ocr: true, // use ocr to extract text
};
const result = await decompose(bufferOrPath, options);๐ค Contributing
Contributions are welcome! Please read our Contributing Guidelines before submitting a pull request.
We are committed to fostering a welcoming and respectful environment for everyone. By participating in this project, you agree to abide by our Code of Conduct.
