n8n-nodes-disruptech-ocr

v0.1.23

Published

15 days ago

Native n8n node for OCR using Tesseract.js — by Disrup-tech.com

0High
0Medium
0Low

baddak

n8n-community-node-package

n8n-nodes-disruptech-ocr

Extract text from images and PDFs — including scanned (image-based) PDFs — in your n8n workflows. This community node provides OCR (Optical Character Recognition) for images and PDFs using Tesseract.js, with automatic fallback for scanned documents where text cannot be selected or copied.

npm version

Built by Disrup-tech.com

Features

OCR from Images — Extract text from PNG, JPG, TIFF, BMP, and other image formats using Tesseract.js
Extract Text from PDFs — Pull text content from native text-based PDF documents
Scanned PDF OCR Fallback — Automatically detect and OCR image-based PDFs (scanned documents) where normal text extraction returns nothing
Multi-language Support — OCR supports 100+ languages via Tesseract language packs
No External APIs — All processing happens locally, no data leaves your server

Installation

Via n8n Community Nodes (Recommended)

Open your n8n instance
Go to Settings → Community Nodes
Click Install a community node
Enter: n8n-nodes-disruptech-ocr
Click Install
Restart n8n when prompted

Via npm (Self-hosted)

cd ~/.n8n/nodes
npm install n8n-nodes-disruptech-ocr
# Restart n8n

Docker

Mount the node into your n8n container:

docker run -it --rm \
  --name n8n \
  -p 5678:5678 \
  -e N8N_CUSTOM_EXTENSIONS="/home/node/.n8n/custom/n8n-nodes-disruptech-ocr" \
  -v n8n_data:/home/node/.n8n \
  docker.n8n.io/n8nio/n8n

Usage

OCR from Image

Extract text from images using Tesseract OCR.

Add Baddak OCR node to your workflow
Set Operation to OCR from Image
Configure:
- Input Binary Field: Name of the binary property containing the image (default: data)
- Language: Tesseract language code (default: eng)

Example workflow:

[Read Binary File] → [Baddak OCR] → [Set Node]

Supported image formats: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP

Language codes:

eng - English
ara - Arabic
deu - German
fra - French
spa - Spanish
chi_sim - Chinese (Simplified)
jpn - Japanese
Multiple languages: ara+eng, eng+deu+fra

Extract Text from PDF

Extract text content from PDF documents.

Add Baddak OCR node to your workflow
Set Operation to Extract Text from PDF
Configure:
- Input Binary Field: Name of the binary property containing the PDF (default: data)
- OCR Fallback for Scanned Pages: Enable this if the PDF is a scanned document (image-based) where text cannot be selected
- OCR Language: Language for OCR fallback (default: eng)

Example workflow:

[HTTP Request (PDF URL)] → [Baddak OCR] → [Code Node]

Output:

{
  "text": "Extracted text content...",
  "pages": 5,
  "ocr": false
}

When OCR fallback is triggered on a scanned PDF:

{
  "text": "OCR'd text from scanned pages...",
  "pages": 3,
  "ocr": true
}

Examples

Basic Image OCR

Use Read Binary File to load an image
Connect to Baddak OCR with operation OCR from Image
Output contains text, confidence, and words count

Scanned PDF (Image-based PDF)

Use HTTP Request or Read Binary File to load the scanned PDF
Connect to Baddak OCR with operation Extract Text from PDF
Enable OCR Fallback for Scanned Pages
Set the correct OCR Language if not English
Output contains OCR'd text and ocr: true

Process PDF and Send via Email

HTTP Request — Download PDF from URL
Baddak OCR — Extract text (operation: Extract Text from PDF)
Send Email — Include extracted text in email body

Troubleshooting

Node not appearing after installation

Restart your n8n instance
Check the n8n logs for any errors

Low OCR accuracy

Use higher resolution images (300 DPI recommended)
Ensure good contrast between text and background
Specify the correct language code
Pre-process images to remove noise if needed

PDF extraction returns empty text

The PDF may contain scanned images instead of selectable text
Enable OCR Fallback for Scanned Pages in the node settings

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

n8n-nodes-disruptech-ocr

Features

Installation

Via n8n Community Nodes (Recommended)

Via npm (Self-hosted)

Docker

Usage

OCR from Image

Extract Text from PDF

Examples

Basic Image OCR

Scanned PDF (Image-based PDF)

Process PDF and Send via Email

Troubleshooting

Node not appearing after installation

Low OCR accuracy

PDF extraction returns empty text

License

Links