n8n-nodes-disruptech-ocr
v0.1.23
Published
Native n8n node for OCR using Tesseract.js — by Disrup-tech.com
Readme
n8n-nodes-disruptech-ocr
Extract text from images and PDFs — including scanned (image-based) PDFs — in your n8n workflows. This community node provides OCR (Optical Character Recognition) for images and PDFs using Tesseract.js, with automatic fallback for scanned documents where text cannot be selected or copied.
Built by Disrup-tech.com
Features
- OCR from Images — Extract text from PNG, JPG, TIFF, BMP, and other image formats using Tesseract.js
- Extract Text from PDFs — Pull text content from native text-based PDF documents
- Scanned PDF OCR Fallback — Automatically detect and OCR image-based PDFs (scanned documents) where normal text extraction returns nothing
- Multi-language Support — OCR supports 100+ languages via Tesseract language packs
- No External APIs — All processing happens locally, no data leaves your server
Installation
Via n8n Community Nodes (Recommended)
- Open your n8n instance
- Go to Settings → Community Nodes
- Click Install a community node
- Enter:
n8n-nodes-disruptech-ocr - Click Install
- Restart n8n when prompted
Via npm (Self-hosted)
cd ~/.n8n/nodes
npm install n8n-nodes-disruptech-ocr
# Restart n8nDocker
Mount the node into your n8n container:
docker run -it --rm \
--name n8n \
-p 5678:5678 \
-e N8N_CUSTOM_EXTENSIONS="/home/node/.n8n/custom/n8n-nodes-disruptech-ocr" \
-v n8n_data:/home/node/.n8n \
docker.n8n.io/n8nio/n8nUsage
OCR from Image
Extract text from images using Tesseract OCR.
- Add Baddak OCR node to your workflow
- Set Operation to
OCR from Image - Configure:
- Input Binary Field: Name of the binary property containing the image (default:
data) - Language: Tesseract language code (default:
eng)
- Input Binary Field: Name of the binary property containing the image (default:
Example workflow:
[Read Binary File] → [Baddak OCR] → [Set Node]Supported image formats: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP
Language codes:
eng- Englishara- Arabicdeu- Germanfra- Frenchspa- Spanishchi_sim- Chinese (Simplified)jpn- Japanese- Multiple languages:
ara+eng,eng+deu+fra
Extract Text from PDF
Extract text content from PDF documents.
- Add Baddak OCR node to your workflow
- Set Operation to
Extract Text from PDF - Configure:
- Input Binary Field: Name of the binary property containing the PDF (default:
data) - OCR Fallback for Scanned Pages: Enable this if the PDF is a scanned document (image-based) where text cannot be selected
- OCR Language: Language for OCR fallback (default:
eng)
- Input Binary Field: Name of the binary property containing the PDF (default:
Example workflow:
[HTTP Request (PDF URL)] → [Baddak OCR] → [Code Node]Output:
{
"text": "Extracted text content...",
"pages": 5,
"ocr": false
}When OCR fallback is triggered on a scanned PDF:
{
"text": "OCR'd text from scanned pages...",
"pages": 3,
"ocr": true
}Examples
Basic Image OCR
- Use Read Binary File to load an image
- Connect to Baddak OCR with operation
OCR from Image - Output contains
text,confidence, andwordscount
Scanned PDF (Image-based PDF)
- Use HTTP Request or Read Binary File to load the scanned PDF
- Connect to Baddak OCR with operation
Extract Text from PDF - Enable OCR Fallback for Scanned Pages
- Set the correct OCR Language if not English
- Output contains OCR'd
textandocr: true
Process PDF and Send via Email
- HTTP Request — Download PDF from URL
- Baddak OCR — Extract text (operation:
Extract Text from PDF) - Send Email — Include extracted text in email body
Troubleshooting
Node not appearing after installation
- Restart your n8n instance
- Check the n8n logs for any errors
Low OCR accuracy
- Use higher resolution images (300 DPI recommended)
- Ensure good contrast between text and background
- Specify the correct language code
- Pre-process images to remove noise if needed
PDF extraction returns empty text
- The PDF may contain scanned images instead of selectable text
- Enable OCR Fallback for Scanned Pages in the node settings
License
MIT
