@leolionart/n8n-nodes-pdf-extractor
v1.2.0
Published
n8n community node to extract text from password-protected PDFs - no external dependencies required
Maintainers
Readme
n8n-nodes-pdf-extractor
This is an n8n community node that extracts text from password-protected PDFs reliably using qpdf and pdftotext command-line tools.
This node was created to solve the known crashing issue with the built-in "Extract from File" PDF node.
n8n is a fair-code licensed workflow automation platform.
Features
- ✅ Extract text from password-protected PDFs
- ✅ Decrypt PDFs and return as binary for further processing
- ✅ No crashes - uses battle-tested command-line tools instead of buggy JavaScript libraries
- ✅ Layout preservation - maintains original text positioning
- ✅ Page range selection - extract specific pages only
- ✅ Multiple encodings - UTF-8, Latin1, ASCII7
Prerequisites
Before using this node, you must install the required tools in your n8n container:
docker exec -u root n8n apk add --no-cache qpdf poppler-utilsFor persistent installation, add this to your Docker Compose file:
services:
n8n:
image: n8nio/n8n:latest
# ... other config
entrypoint: /bin/sh
command:
- -c
- |
apk add --no-cache qpdf poppler-utils
exec tini -- /docker-entrypoint.shInstallation
Via n8n UI (Recommended)
- Go to Settings → Community Nodes
- Click Install
- Enter:
n8n-nodes-pdf-extractor - Click Install
Via npm
cd ~/.n8n/nodes
npm install n8n-nodes-pdf-extractorOperations
Extract Text
Extracts text content from a PDF file.
Parameters:
- Binary Property: Name of the binary property containing the PDF (default:
data) - Password: Password to decrypt the PDF (leave empty if not encrypted)
Options:
- Layout Mode: Maintain original text layout (default: true)
- Page Range: Extract specific pages (e.g., "1-5" or "1,3,5")
- Output Property: JSON property name for extracted text (default:
text) - Encoding: Text encoding (UTF-8, Latin1, ASCII7)
Decrypt Only
Decrypts a password-protected PDF and returns it as a binary file for further processing.
Example Usage
Extract text from bank statement
[Gmail Trigger] → [PDF Extractor] → [AI/LLM] → [Google Sheets]- Gmail Trigger receives email with PDF attachment
- PDF Extractor extracts text with password
- AI extracts structured data
- Save to Google Sheets
Why This Node?
The built-in n8n "Extract from File" node uses pdf-parse JavaScript library which:
- ❌ Crashes n8n container with certain PDF encryption types
- ❌ Causes "SIGILL" errors on Alpine Linux
- ❌ Has memory issues with large PDFs
This node uses:
- ✅ qpdf - Industry-standard PDF manipulation tool
- ✅ pdftotext (poppler-utils) - Robust text extraction from PDFs
Troubleshooting
"Required tools not found"
Install the required tools:
docker exec -u root n8n apk add --no-cache qpdf poppler-utils"Invalid password for PDF file"
Check that the password is correct. Some PDFs use owner password vs user password.
Empty text output
The PDF might be scanned/image-based. This node extracts text layers only. For scanned PDFs, use OCR tools.
