n8n-nodes-power-document-extractor
v0.13.7
Published
Power Document Extractor – universal local document parser for n8n
Maintainers
Readme
Power Document Extractor for n8n
📖 Overview
Power Document Extractor is a comprehensive n8n community node for extracting structured content from 17+ document formats, entirely locally on your server. No external APIs, no cloud services—just reliable, privacy-focused document parsing.
Perfect for document processing workflows, content analysis, data migration, AI-powered document understanding, and automated information extraction pipelines.
☕ Support Development
If you find this node useful, consider supporting its development:
Donation Links:
- ☕ Ko-fi - Support with coffee
- 💰 Coinbase Commerce - Cryptocurrency donations
✨ Features
- 🔒 100% Local Processing - All document parsing happens on your server
- 📄 17+ Supported Formats - PDF, DOCX, DOC, XLSX, XLS, CSV, TXT, RTF, EPUB, FB2, Markdown, HTML, XML, PPT, PPTX, ODS, ODG
- 🎯 Auto-Detection - Universal Extractor automatically identifies document format
- 🧩 Structured Output - Extracts content as structured blocks (paragraphs, headings, tables, lists)
- 🎚️ Flexible Detail Levels - Choose between Raw, Basic, or Full structured output
- 📊 Rich Metadata - Extracts document metadata (author, title, page count, dates, etc.)
- ⚡ LibreOffice Integration - Optional LibreOffice server for legacy formats (DOC, RTF, PPT)
📦 Installation
Via n8n Community Nodes
- Go to Settings > Community Nodes in your n8n instance
- Click Install and enter:
n8n-nodes-power-document-extractor - Click Install
Via npm
npm install n8n-nodes-power-document-extractorManual Installation
cd ~/.n8n/nodes
git clone https://github.com/ZBlaZe/n8n-nodes-power-document-extractor.git
cd n8n-nodes-power-document-extractor
npm install
npm run build🚀 Supported Formats
| Format | Extension | Native Support | LibreOffice Required | Status |
|--------|-----------|----------------|---------------------|---------|
| PDF | .pdf | ✅ Yes | ❌ No | ⚠️ Beta |
| Plain Text | .txt | ✅ Yes | ❌ No | ✅ Stable |
| CSV | .csv | ✅ Yes | ❌ No | ✅ Stable |
| Markdown | .md | ✅ Yes | ❌ No | ✅ Stable |
| HTML | .html, .htm | ✅ Yes | ❌ No | ✅ Stable |
| XML | .xml | ✅ Yes | ❌ No | ✅ Stable |
| Excel | .xlsx, .xls | ✅ Yes | ❌ No | ✅ Stable |
| Word (Modern) | .docx | ⚠️ Partial | ✅ Yes (recommended) | ⚠️ Beta |
| Word (Legacy) | .doc | ❌ No | ✅ Yes | ⚠️ Beta |
| RTF | .rtf | ❌ No | ✅ Yes | ⚠️ Beta |
| PowerPoint | .ppt, .pptx | ❌ No | ✅ Yes | ⚠️ Beta |
| OpenDocument | .ods, .odg | ⚠️ Partial | ✅ Yes (recommended) | ⚠️ Beta |
| EPUB | .epub | ✅ Yes | ❌ No | ⚠️ Beta |
| FictionBook | .fb2 | ✅ Yes | ❌ No | ⚠️ Beta |
Legend:
- ✅ Stable - Fully tested and production-ready
- ⚠️ Beta - Functional but may have edge cases
- 🚧 In Development - Work in progress
🎮 Usage
Basic Example
- Add Power Document Extractor node to your workflow
- Connect a node that provides binary file data (e.g., HTTP Request, Read Binary File)
- Configure the node:
- Operation: Universal Extractor (auto-detects format)
- Binary Property:
data(or your binary property name) - Structured Level: Full (recommended)
Operations
Universal Extractor (Recommended)
Automatically detects document format and extracts content using the optimal parser.
Format-Specific Extractors
Available for all supported formats if you want explicit control:
- Extract PDF
- Extract DOCX
- Extract XLSX
- Extract TXT
- Extract CSV
- Extract Markdown
- Extract HTML
- ... and more
Structured Levels
Choose the level of detail in extracted content:
- Raw - Single text string with minimal formatting
- Basic - Paragraphs and basic structure
- Full - Complete structure with headings, tables, lists, metadata (recommended)
📤 Output Format
Example Output
{
"blocks": [
{
"type": "heading",
"level": 1,
"text": "Annual Report 2024",
"id": "h-1",
"page": 1
},
{
"type": "paragraph",
"text": "This report provides an overview of our company's performance...",
"id": "p-1",
"page": 1
},
{
"type": "table",
"headers": ["Quarter", "Revenue", "Growth"],
"rows": [
["Q1", "$1.2M", "15%"],
["Q2", "$1.5M", "25%"],
["Q3", "$1.8M", "20%"],
["Q4", "$2.1M", "17%"]
],
"id": "table-1",
"page": 2
}
],
"metadata": {
"fileName": "annual_report_2024.pdf",
"fileSize": 245680,
"fileType": "pdf",
"mimeType": "application/pdf",
"pageCount": 12,
"author": "John Smith",
"title": "Annual Report 2024",
"creationDate": "2024-01-15",
"modificationDate": "2024-11-20"
}
}Block Types
paragraph- Text paragraphsheading- Document headings (with level 1-6)table- Tables with headers and rowslist- Bulleted or numbered listsimage- Image references (planned for future versions)
🐳 LibreOffice Server Setup (Optional)
For best results with DOC, RTF, PPT, PPTX formats, set up a LibreOffice server using Docker.
Quick Start with Docker
docker run -d \
--name libreoffice-server \
-p 33101:2004 \
ghcr.io/unoconv/unoserver-docker:latestConfiguration in n8n
In the Power Document Extractor node:
- LibreOffice Server URL:
http://localhost:33101(or your server IP)
Important Notes
- ⚠️ Do not expose LibreOffice port to the internet - it's not secured by default
- 🔒 Use firewall rules or Docker networks to restrict access
- 🚀 LibreOffice container should run on the same network as n8n for best performance
For detailed setup instructions, see unoserver documentation.
⚙️ Configuration
Node Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| Operation | Select | Universal Extractor | Extraction method (auto or format-specific) |
| Binary Property | String | data | Name of the binary property containing the file |
| Structured Level | Select | Full | Level of detail in output (Raw/Basic/Full) |
| LibreOffice Server URL | String | (empty) | Optional LibreOffice server URL for legacy formats |
🎯 Use Cases
- 📊 Data Migration - Extract content from legacy documents for database import
- 🤖 AI/LLM Integration - Prepare document content for AI analysis and processing
- 🔍 Document Indexing - Build searchable document databases
- 📝 Content Management - Automated document processing workflows
- 📧 Email Attachment Processing - Extract and analyze attachments automatically
- 🗄️ Archive Digitization - Convert old documents to structured data
⚠️ Known Limitations
- 🖼️ Image Extraction - Not yet supported (planned for v0.10.0)
- 🔤 Text Encoding - Some legacy documents may have encoding issues
- 📄 Complex Layouts - Advanced page layouts may not be fully preserved
- ⏱️ Large Files - Very large files (>100MB) may take longer to process
Development Status: This node is actively maintained and under continuous improvement. Bug reports and feature requests are welcome!
🗺️ Roadmap
Version 0.11.0-0.12.0 (Planned)
- 🖼️ Base64 image extraction from documents
- 🔤 Improved text encoding detection and handling
- 🎨 Better formatting preservation for complex documents
Future Versions
- 📊 Advanced table structure detection
- 🔗 Hyperlink extraction
- 📝 Document annotations and comments
- 🌍 Multi-language OCR support
- ⚡ Performance optimizations for large files
🐛 Troubleshooting
Common Issues
"LibreOffice conversion failed"
Solution:
- Ensure LibreOffice server is running:
docker ps | grep libreoffice - Check URL is correct:
http://localhost:33101(nothttps://) - Verify server is accessible from n8n container
Text appears garbled or with wrong characters
Possible causes:
- Legacy encoding in old documents
- Font substitution issues
- Temporary workaround: Try using LibreOffice server for conversion
Node execution timeout
For large files:
- Increase n8n execution timeout in settings
- Consider splitting large documents
- Use LibreOffice server for faster processing
Empty output for supported format
Check:
- File is not password-protected
- File is not corrupted
- File contains actual text (not just images)
🤝 Contributing
Contributions are welcome! If you find a bug or have a feature request:
- Check existing issues
- Create a new issue with detailed description
- Submit a pull request with your improvements
📜 License
Proprietary / All Rights Reserved
This project is closed-source.
The source code is not available for public viewing or modification.
🙏 Acknowledgments
Built with:
- n8n - Workflow automation platform
- pdf.js-extract - PDF parsing
- xlsx - Excel file processing
- cheerio - HTML parsing
- marked - Markdown parsing
- LibreOffice - Legacy format conversion
📞 Support
- 💬 GitHub Issues
- 📧 Contact via GitHub profile
- ☕ Support on Ko-fi
Made with ❤️ for the n8n community
