n8n-nodes-power-document-extractor

v0.13.7

Published

3 months ago

Power Document Extractor – universal local document parser for n8n

0High
0Medium
0Low

zblaze

n8n-community-node-package document extract pdf docx csv markdown parser text-extraction document-processing

Power Document Extractor for n8n

📖 Overview

Power Document Extractor is a comprehensive n8n community node for extracting structured content from 17+ document formats, entirely locally on your server. No external APIs, no cloud services—just reliable, privacy-focused document parsing.

Perfect for document processing workflows, content analysis, data migration, AI-powered document understanding, and automated information extraction pipelines.

☕ Support Development

If you find this node useful, consider supporting its development:

Donation Links:

☕ Ko-fi - Support with coffee
💰 Coinbase Commerce - Cryptocurrency donations

✨ Features

🔒 100% Local Processing - All document parsing happens on your server
📄 17+ Supported Formats - PDF, DOCX, DOC, XLSX, XLS, CSV, TXT, RTF, EPUB, FB2, Markdown, HTML, XML, PPT, PPTX, ODS, ODG
🎯 Auto-Detection - Universal Extractor automatically identifies document format
🧩 Structured Output - Extracts content as structured blocks (paragraphs, headings, tables, lists)
🎚️ Flexible Detail Levels - Choose between Raw, Basic, or Full structured output
📊 Rich Metadata - Extracts document metadata (author, title, page count, dates, etc.)
⚡ LibreOffice Integration - Optional LibreOffice server for legacy formats (DOC, RTF, PPT)

📦 Installation

Via n8n Community Nodes

Go to Settings > Community Nodes in your n8n instance
Click Install and enter: n8n-nodes-power-document-extractor
Click Install

Via npm

npm install n8n-nodes-power-document-extractor

Manual Installation

cd ~/.n8n/nodes
git clone https://github.com/ZBlaZe/n8n-nodes-power-document-extractor.git
cd n8n-nodes-power-document-extractor
npm install
npm run build

🚀 Supported Formats

| Format | Extension | Native Support | LibreOffice Required | Status | |--------|-----------|----------------|---------------------|---------| | PDF | .pdf | ✅ Yes | ❌ No | ⚠️ Beta | | Plain Text | .txt | ✅ Yes | ❌ No | ✅ Stable | | CSV | .csv | ✅ Yes | ❌ No | ✅ Stable | | Markdown | .md | ✅ Yes | ❌ No | ✅ Stable | | HTML | .html, .htm | ✅ Yes | ❌ No | ✅ Stable | | XML | .xml | ✅ Yes | ❌ No | ✅ Stable | | Excel | .xlsx, .xls | ✅ Yes | ❌ No | ✅ Stable | | Word (Modern) | .docx | ⚠️ Partial | ✅ Yes (recommended) | ⚠️ Beta | | Word (Legacy) | .doc | ❌ No | ✅ Yes | ⚠️ Beta | | RTF | .rtf | ❌ No | ✅ Yes | ⚠️ Beta | | PowerPoint | .ppt, .pptx | ❌ No | ✅ Yes | ⚠️ Beta | | OpenDocument | .ods, .odg | ⚠️ Partial | ✅ Yes (recommended) | ⚠️ Beta | | EPUB | .epub | ✅ Yes | ❌ No | ⚠️ Beta | | FictionBook | .fb2 | ✅ Yes | ❌ No | ⚠️ Beta |

Legend:

✅ Stable - Fully tested and production-ready
⚠️ Beta - Functional but may have edge cases
🚧 In Development - Work in progress

🎮 Usage

Basic Example

Add Power Document Extractor node to your workflow
Connect a node that provides binary file data (e.g., HTTP Request, Read Binary File)
Configure the node:
- Operation: Universal Extractor (auto-detects format)
- Binary Property: data (or your binary property name)
- Structured Level: Full (recommended)

Operations

Universal Extractor (Recommended)

Automatically detects document format and extracts content using the optimal parser.

Format-Specific Extractors

Available for all supported formats if you want explicit control:

Extract PDF
Extract DOCX
Extract XLSX
Extract TXT
Extract CSV
Extract Markdown
Extract HTML
... and more

Structured Levels

Choose the level of detail in extracted content:

Raw - Single text string with minimal formatting
Basic - Paragraphs and basic structure
Full - Complete structure with headings, tables, lists, metadata (recommended)

📤 Output Format

Example Output

{
  "blocks": [
    {
      "type": "heading",
      "level": 1,
      "text": "Annual Report 2024",
      "id": "h-1",
      "page": 1
    },
    {
      "type": "paragraph",
      "text": "This report provides an overview of our company's performance...",
      "id": "p-1",
      "page": 1
    },
    {
      "type": "table",
      "headers": ["Quarter", "Revenue", "Growth"],
      "rows": [
        ["Q1", "$1.2M", "15%"],
        ["Q2", "$1.5M", "25%"],
        ["Q3", "$1.8M", "20%"],
        ["Q4", "$2.1M", "17%"]
      ],
      "id": "table-1",
      "page": 2
    }
  ],
  "metadata": {
    "fileName": "annual_report_2024.pdf",
    "fileSize": 245680,
    "fileType": "pdf",
    "mimeType": "application/pdf",
    "pageCount": 12,
    "author": "John Smith",
    "title": "Annual Report 2024",
    "creationDate": "2024-01-15",
    "modificationDate": "2024-11-20"
  }
}

Block Types

paragraph - Text paragraphs
heading - Document headings (with level 1-6)
table - Tables with headers and rows
list - Bulleted or numbered lists
image - Image references (planned for future versions)

🐳 LibreOffice Server Setup (Optional)

For best results with DOC, RTF, PPT, PPTX formats, set up a LibreOffice server using Docker.

Quick Start with Docker

docker run -d \
  --name libreoffice-server \
  -p 33101:2004 \
  ghcr.io/unoconv/unoserver-docker:latest

Configuration in n8n

In the Power Document Extractor node:

LibreOffice Server URL: http://localhost:33101 (or your server IP)

Important Notes

⚠️ Do not expose LibreOffice port to the internet - it's not secured by default
🔒 Use firewall rules or Docker networks to restrict access
🚀 LibreOffice container should run on the same network as n8n for best performance

For detailed setup instructions, see unoserver documentation.

⚙️ Configuration

Node Parameters

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | Operation | Select | Universal Extractor | Extraction method (auto or format-specific) | | Binary Property | String | data | Name of the binary property containing the file | | Structured Level | Select | Full | Level of detail in output (Raw/Basic/Full) | | LibreOffice Server URL | String | (empty) | Optional LibreOffice server URL for legacy formats |

🎯 Use Cases

📊 Data Migration - Extract content from legacy documents for database import
🤖 AI/LLM Integration - Prepare document content for AI analysis and processing
🔍 Document Indexing - Build searchable document databases
📝 Content Management - Automated document processing workflows
📧 Email Attachment Processing - Extract and analyze attachments automatically
🗄️ Archive Digitization - Convert old documents to structured data

⚠️ Known Limitations

🖼️ Image Extraction - Not yet supported (planned for v0.10.0)
🔤 Text Encoding - Some legacy documents may have encoding issues
📄 Complex Layouts - Advanced page layouts may not be fully preserved
⏱️ Large Files - Very large files (>100MB) may take longer to process

Development Status: This node is actively maintained and under continuous improvement. Bug reports and feature requests are welcome!

🗺️ Roadmap

Version 0.11.0-0.12.0 (Planned)

🖼️ Base64 image extraction from documents
🔤 Improved text encoding detection and handling
🎨 Better formatting preservation for complex documents

Future Versions

📊 Advanced table structure detection
🔗 Hyperlink extraction
📝 Document annotations and comments
🌍 Multi-language OCR support
⚡ Performance optimizations for large files

🐛 Troubleshooting

Common Issues

"LibreOffice conversion failed"

Solution:

Ensure LibreOffice server is running: docker ps | grep libreoffice
Check URL is correct: http://localhost:33101 (not https://)
Verify server is accessible from n8n container

Text appears garbled or with wrong characters

Possible causes:

Legacy encoding in old documents
Font substitution issues
Temporary workaround: Try using LibreOffice server for conversion

Node execution timeout

For large files:

Increase n8n execution timeout in settings
Consider splitting large documents
Use LibreOffice server for faster processing

Empty output for supported format

Check:

File is not password-protected
File is not corrupted
File contains actual text (not just images)

🤝 Contributing

Contributions are welcome! If you find a bug or have a feature request:

Check existing issues
Create a new issue with detailed description
Submit a pull request with your improvements

📜 License

🙏 Acknowledgments

Built with:

n8n - Workflow automation platform
pdf.js-extract - PDF parsing
xlsx - Excel file processing
cheerio - HTML parsing
marked - Markdown parsing
LibreOffice - Legacy format conversion

📞 Support

💬 GitHub Issues
📧 Contact via GitHub profile
☕ Support on Ko-fi

Made with ❤️ for the n8n community