n8n-nodes-pdf-ocr

v1.3.0

Published

7 months ago

n8n node for PDF OCR text extraction using Tesseract.js

0High
0Medium
0Low

dnachavez

n8n-community-node-package pdf ocr tesseract text-extraction

n8n-nodes-pdf-ocr

This is an n8n community node that lets you extract text from PDF files using OCR (Optical Character Recognition) in your n8n workflows.

PDF OCR uses Tesseract.js for text recognition and PDF.js for PDF processing, providing a completely free solution without requiring external APIs, accounts, or system dependencies.

n8n is a fair-code licensed workflow automation platform.

Installation
Operations
Compatibility
Usage
Troubleshooting
Publishing
Resources

Installation

Follow the installation guide in the n8n community nodes documentation.

Note: This node uses pure JavaScript libraries and doesn't require any external system dependencies.

Operations

PDF OCR

Extract text from PDF files - Converts PDF pages to images and uses OCR to extract text
Multi-page support - Processes all pages in a PDF document
Multi-language support - Supports over 100 languages including English, Spanish, French, German, Italian, Portuguese, Russian, Chinese Simplified, Japanese, and Korean
Multiple output formats - Choose between combined text, per-page text, or detailed output with metadata

Parameters

Input Binary Property - Name of the binary property containing the PDF file (default: "data")
Language - Language for OCR text recognition (default: English)
Output Format - How to format the extracted text:
- Combined Text: All text combined into a single string
- Per Page: Array of text for each page
- Detailed: Combined text + per-page text + metadata
DPI Scale - Scale factor for rendering PDF pages (default: 2, higher = better quality but slower)

Compatibility

Minimum n8n version: 0.187.0
Node.js version: >=20.15
System requirements: None (pure JavaScript)

Usage

Basic Usage

Add the PDF OCR node to your workflow
Connect it to a node that provides PDF binary data (e.g., HTTP Request, Read Binary File)
Configure the input binary property name
Select the desired language for OCR
Choose the output format
Run the workflow

Output Examples

Combined Text Output

{
  "text": "This is the combined text from all pages...",
  "totalPages": 3,
  "language": "eng"
}

Per Page Output

{
  "pages": [
    {
      "pageNumber": 1,
      "text": "Text from page 1..."
    },
    {
      "pageNumber": 2,
      "text": "Text from page 2..."
    }
  ],
  "totalPages": 2,
  "language": "eng"
}

Detailed Output

{
  "text": "Combined text from all pages...",
  "pages": [
    {
      "pageNumber": 1,
      "text": "Text from page 1..."
    }
  ],
  "metadata": {
    "totalPages": 1,
    "language": "eng",
    "dpiScale": 2
  }
}

Workflow Examples

Extract Text from Uploaded PDF

Manual Trigger - Start workflow manually
HTTP Request - Download PDF from URL
PDF OCR - Extract text from PDF
Set - Process extracted text

Process PDF Files from Google Drive

Google Drive Trigger - Monitor for new PDF files
Google Drive - Download PDF file
PDF OCR - Extract text with multi-language support
Gmail - Email extracted text

Features

✅ Free and Open Source - No API keys, subscriptions, or system dependencies required
✅ Multi-language OCR - Supports 100+ languages
✅ Multi-page Processing - Handles PDFs with multiple pages
✅ Flexible Output - Choose from different output formats
✅ Error Handling - Robust error handling with continue-on-fail support
✅ Memory Efficient - Processes pages individually and cleans up temporary files

Troubleshooting

Common Issues

"totalPages: 0" with No Text Extracted

This usually indicates that PDF processing failed. Common causes:

Corrupted PDF file
PDF is password protected or has restrictions
PDF contains only images without text (OCR should still work)
Memory issues with very large PDFs

Performance Issues

Reduce DPI scale for faster processing
Monitor memory usage for large PDFs
Consider processing PDFs in smaller batches

Memory Issues

For large PDFs, try reducing the DPI scale
Process PDFs one at a time instead of in batches
Restart n8n if memory usage becomes too high

Publishing

Want to publish this node to npm? See our comprehensive Publishing Guide for step-by-step instructions on how to:

Prepare your package for publication
Publish to npm registry
Install in n8n instances
Maintain and update your package
Follow security best practices

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

n8n-nodes-pdf-ocr

Installation

Operations

PDF OCR

Parameters

Compatibility

Usage

Basic Usage

Output Examples

Combined Text Output

Per Page Output

Detailed Output

Workflow Examples

Extract Text from Uploaded PDF

Process PDF Files from Google Drive

Features

Troubleshooting

Common Issues

"totalPages: 0" with No Text Extracted

Performance Issues

Memory Issues

Publishing

Resources