n8n-nodes-mopdf
v0.3.0
Published
n8n community node for PDF processing - convert PDF to images, extract text and run OCR
Maintainers
Readme
n8n-nodes-mopdf
Extract text from PDFs, convert pages to images and generate AI-ready output — without any API, running entirely local inside n8n.
Built on MuPDF and Tesseract.js.
Installation
In n8n: Settings → Community Nodes → Install
n8n-nodes-mopdfOperations
PDF to Images
Renders each PDF page as a PNG or JPEG image.
| Parameter | Default | Notes | |---|---|---| | DPI | 200 | 72 – 600 | | Format | PNG | PNG or JPEG | | Pages | All | Specific pages supported |
OCR
Runs optical character recognition on an image binary.
- 11 supported languages (German, English, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Simplified Chinese, Japanese)
- PSM modes: Auto, Single Block, Single Line, Single Word, Sparse Text
- Output format: Plain text or Markdown
- Optional: word and line coordinates in output
- Carries over input JSON fields (
Keep input data)
Text Extraction
Extracts selectable text directly from a PDF — no OCR, fast.
- Output formats: Plain text, Markdown, JSON, HTML
- Per-page or combined output
- Optional: PDF metadata (title, author, etc.)
- Optional
Text onlymode — output contains only thetextfield, no statistics
Text + OCR Fallback
Tries direct text extraction first. For any page that yields no text (e.g. a scanned page), OCR runs automatically as a fallback.
- Selectively replaces only empty pages — pages with extractable text are kept as-is
- Output formats: Plain text, Markdown, JSON, HTML
- For Markdown format, hOCR is used automatically on OCR pages for better structure
- Optional
Text onlymode
Output formats
| Format | Description | |---|---| | Plain text | Clean extracted text | | Markdown | Structure-aware output, suitable for LLMs | | JSON | Raw structured text with bounding boxes | | HTML | Full HTML with layout information |
Requirements
- n8n self-hosted (Community Nodes are not available on n8n Cloud)
- Node.js 18+
Licenses
This package depends on the following third-party libraries:
- MuPDF — AGPL v3
- Tesseract.js — Apache 2.0
This package itself is licensed under MIT.
