npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

n8n-nodes-pdf-ocr

v1.3.0

Published

n8n node for PDF OCR text extraction using Tesseract.js

Readme

n8n-nodes-pdf-ocr

This is an n8n community node that lets you extract text from PDF files using OCR (Optical Character Recognition) in your n8n workflows.

PDF OCR uses Tesseract.js for text recognition and PDF.js for PDF processing, providing a completely free solution without requiring external APIs, accounts, or system dependencies.

n8n is a fair-code licensed workflow automation platform.

Installation
Operations
Compatibility
Usage
Troubleshooting
Publishing
Resources

Installation

Follow the installation guide in the n8n community nodes documentation.

Note: This node uses pure JavaScript libraries and doesn't require any external system dependencies.

Operations

PDF OCR

  • Extract text from PDF files - Converts PDF pages to images and uses OCR to extract text
  • Multi-page support - Processes all pages in a PDF document
  • Multi-language support - Supports over 100 languages including English, Spanish, French, German, Italian, Portuguese, Russian, Chinese Simplified, Japanese, and Korean
  • Multiple output formats - Choose between combined text, per-page text, or detailed output with metadata

Parameters

  • Input Binary Property - Name of the binary property containing the PDF file (default: "data")
  • Language - Language for OCR text recognition (default: English)
  • Output Format - How to format the extracted text:
    • Combined Text: All text combined into a single string
    • Per Page: Array of text for each page
    • Detailed: Combined text + per-page text + metadata
  • DPI Scale - Scale factor for rendering PDF pages (default: 2, higher = better quality but slower)

Compatibility

  • Minimum n8n version: 0.187.0
  • Node.js version: >=20.15
  • System requirements: None (pure JavaScript)

Usage

Basic Usage

  1. Add the PDF OCR node to your workflow
  2. Connect it to a node that provides PDF binary data (e.g., HTTP Request, Read Binary File)
  3. Configure the input binary property name
  4. Select the desired language for OCR
  5. Choose the output format
  6. Run the workflow

Output Examples

Combined Text Output

{
  "text": "This is the combined text from all pages...",
  "totalPages": 3,
  "language": "eng"
}

Per Page Output

{
  "pages": [
    {
      "pageNumber": 1,
      "text": "Text from page 1..."
    },
    {
      "pageNumber": 2,
      "text": "Text from page 2..."
    }
  ],
  "totalPages": 2,
  "language": "eng"
}

Detailed Output

{
  "text": "Combined text from all pages...",
  "pages": [
    {
      "pageNumber": 1,
      "text": "Text from page 1..."
    }
  ],
  "metadata": {
    "totalPages": 1,
    "language": "eng",
    "dpiScale": 2
  }
}

Workflow Examples

Extract Text from Uploaded PDF

  1. Manual Trigger - Start workflow manually
  2. HTTP Request - Download PDF from URL
  3. PDF OCR - Extract text from PDF
  4. Set - Process extracted text

Process PDF Files from Google Drive

  1. Google Drive Trigger - Monitor for new PDF files
  2. Google Drive - Download PDF file
  3. PDF OCR - Extract text with multi-language support
  4. Gmail - Email extracted text

Features

  • Free and Open Source - No API keys, subscriptions, or system dependencies required
  • Multi-language OCR - Supports 100+ languages
  • Multi-page Processing - Handles PDFs with multiple pages
  • Flexible Output - Choose from different output formats
  • Error Handling - Robust error handling with continue-on-fail support
  • Memory Efficient - Processes pages individually and cleans up temporary files

Troubleshooting

Common Issues

"totalPages: 0" with No Text Extracted

This usually indicates that PDF processing failed. Common causes:

  • Corrupted PDF file
  • PDF is password protected or has restrictions
  • PDF contains only images without text (OCR should still work)
  • Memory issues with very large PDFs

Performance Issues

  • Reduce DPI scale for faster processing
  • Monitor memory usage for large PDFs
  • Consider processing PDFs in smaller batches

Memory Issues

  • For large PDFs, try reducing the DPI scale
  • Process PDFs one at a time instead of in batches
  • Restart n8n if memory usage becomes too high

Publishing

Want to publish this node to npm? See our comprehensive Publishing Guide for step-by-step instructions on how to:

  • Prepare your package for publication
  • Publish to npm registry
  • Install in n8n instances
  • Maintain and update your package
  • Follow security best practices

Resources