npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

n8n-nodes-pdf-parse

v1.6.1

Published

N8N community node for parsing PDF files to text with advanced configuration options

Readme

N8N PDF Parse Node

A robust N8N community node for parsing PDF files and extracting text content with advanced configuration options.

Features

  • 🤖 AI-Optimized Text Extraction: Enhanced pdf-parse engine with superior AI-friendly formatting
  • 🖼️ PDF to Image Conversion: High-quality PDF to PNG/JPEG conversion using pdf2pic with GraphicsMagick backend
  • Raw Mode (Default): Preserves all line breaks and document structure for optimal AI processing
  • Multiple Formatting Options: Raw, Smart, Minimal, Structured, Visual, and Compact modes
  • Perfect for Document Analysis: Purchase orders, invoices, forms, and tables maintain layout
  • Enhanced Line Break Preservation: Keeps document structure intact for LLM processing
  • Dual Operations: Text parsing and image conversion in one node
  • Multiple Input Sources: Binary data and URL sources
  • Advanced Options: Page ranges, DPI control, custom dimensions, format selection
  • Comprehensive Output: Text, images, metadata, and statistics
  • Robust Error Handling: Detailed validation and graceful failure handling
  • TypeScript: Full type safety and IntelliSense support

Installation

System Requirements

For PDF to Image conversion operations, the following system dependencies are required:

Docker Environment (N8N Docker)

Option 1: Using docker-compose with custom Dockerfile (Recommended)

Create a Dockerfile in your project directory:

FROM n8nio/n8n:latest
USER root
# Alpine-based image uses apk package manager
RUN apk update && apk add --no-cache \
    graphicsmagick \
    ghostscript
USER node

Create a docker-compose.yml file:

version: '3.8'

services:
  n8n:
    build: .
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=localhost
      - N8N_PORT=5678
      - N8N_PROTOCOL=http
      - NODE_ENV=production
      - WEBHOOK_URL=http://localhost:5678/
      - GENERIC_TIMEZONE=America/New_York
      # Allow community nodes
      - NODE_FUNCTION_ALLOW_EXTERNAL=n8n-nodes-pdf-parse
    volumes:
      - n8n_data:/home/node/.n8n
      - ./custom-nodes:/home/node/.n8n/nodes
    networks:
      - n8n-network

volumes:
  n8n_data:

networks:
  n8n-network:
    driver: bridge

Then run:

docker-compose up -d
Option 2: Quick installation in running container (Temporary)

If you're already running n8n in Docker and need a quick fix:

# Access the container (replace 'n8n' with your container name)
docker exec -it --user root n8n /bin/sh

# Install dependencies
apk update && apk add --no-cache graphicsmagick ghostscript

# Exit container
exit

Note: This method is temporary - packages will be lost when the container restarts. Use Option 1 for a permanent solution.

Ubuntu/Debian

sudo apt-get update
sudo apt-get install graphicsmagick ghostscript

CentOS/RHEL/Fedora

sudo yum install GraphicsMagick ghostscript
# or for newer versions:
sudo dnf install GraphicsMagick ghostscript

macOS

brew install graphicsmagick ghostscript

Windows

Download and install:

Node Installation for Self-Hosted n8n

For Docker-Based n8n (Most Common)

If you're running n8n with Docker, you need to install the node inside the container. Add this to your docker-compose.yml:

version: '3.8'

services:
  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=localhost
      - N8N_PORT=5678
      - N8N_PROTOCOL=http
      - NODE_ENV=production
      - WEBHOOK_URL=http://localhost:5678/
      - GENERIC_TIMEZONE=America/New_York
    volumes:
      - n8n_data:/home/node/.n8n
      - ./custom-nodes:/home/node/.n8n/nodes
    command: >
      /bin/sh -c "
        npm install -g n8n-nodes-pdf-parse &&
        n8n start
      "
    networks:
      - n8n-network

volumes:
  n8n_data:

networks:
  n8n-network:
    driver: bridge

Note: For PDF to Image conversion to work, you'll also need GraphicsMagick and Ghostscript. See the "System Requirements" section above for Docker setup with these dependencies.

Option 1: Install via npm (For Non-Docker Installations)

npm install n8n-nodes-pdf-parse

Option 2: Manual Installation

  1. Navigate to your N8N installation directory
  2. Go to the ~/.n8n/custom directory (create if it doesn't exist)
  3. Clone or download this repository
  4. Install dependencies and build:
cd n8n-nodes-pdf-parse
npm install
npm run build

Option 3: Global Installation

npm install -g n8n-nodes-pdf-parse

After installation, restart your N8N instance to load the new node.

Configuration

Environment Variables

For self-hosted N8N instances, you can set these environment variables:

# Allow community nodes
N8N_NODES_INCLUDE=["n8n-nodes-pdf-parse"]

# Or allow all community nodes
N8N_NODES_EXCLUDE=[]

Usage

Basic Usage

  1. Add the "PDF Parse" node to your workflow
  2. Connect it to a node that provides PDF data (e.g., HTTP Request, File Read)
  3. Configure the source type (Binary Data or URL)
  4. Set the binary property name or URL
  5. Configure additional options as needed

Node Parameters

Required Parameters

  • Operation: Choose between "Parse PDF" or "Convert to Image"
  • PDF Source: Choose between "Binary Data" or "URL"
  • Binary Property: Name of the binary property containing the PDF (when using binary data)
  • URL: URL of the PDF file to parse (when using URL source)

Optional Parameters

  • Output Property Name: Property name to store the result (default: "result")
  • Max Pages: Maximum number of pages to process (0 = all pages)
  • Page Range Start: Starting page number (1-based)
  • Page Range End: Ending page number (0 = last page)

Text Parsing Options (Parse PDF Operation)

  • Text Formatting: Choose formatting style:
    • Raw (Best for AI): Preserves all line breaks and document structure
    • Smart Layout: Intelligent layout preservation with enhanced spacing
    • Visual Layout: Universal layout preservation - replicates human text selection patterns
    • Minimal Cleanup: Removes extra spaces but keeps line breaks
    • Structured: Cleans formatting while preserving structure
    • Compact: Removes most whitespace for compact text
  • Include Metadata: Include PDF metadata in output
  • Split by Pages: Return text split by pages as an array
  • Version: PDF.js version to use for parsing

Image Conversion Options (Convert to Image Operation)

  • Image Format: Choose between PNG or JPEG output
    • PNG: Better quality, transparency support, larger files
    • JPEG: Smaller files, no transparency, good for photos
  • DPI (Resolution): 72-600 dots per inch (default: 150)
    • Higher DPI = better quality but larger files
    • 72 DPI = screen resolution, 300 DPI = print quality
  • Width: Custom width in pixels (0 = auto based on DPI)
  • Height: Custom height in pixels (0 = auto based on DPI)
  • Preserve Aspect Ratio: Maintain original proportions when resizing (default: true)

Example Workflows

Example 1: Parse PDF from URL

{
  "nodes": [
    {
      "parameters": {
        "operation": "parse",
        "source": "url",
        "url": "https://example.com/document.pdf",
        "outputProperty": "extractedText",
        "additionalOptions": {
          "normalizeWhitespace": true,
          "includeMetadata": true
        }
      },
      "name": "PDF Parse",
      "type": "n8n-nodes-pdf-parse.pdfParse"
    }
  ]
}

Example 2: Parse PDF from Binary Data

{
  "nodes": [
    {
      "parameters": {
        "operation": "parse",
        "source": "binary",
        "binaryPropertyName": "data",
        "outputProperty": "pdfText",
        "additionalOptions": {
          "maxPages": 10,
          "splitByPages": true
        }
      },
      "name": "PDF Parse",
      "type": "n8n-nodes-pdf-parse.pdfParse"
    }
  ]
}

Example 3: Parse Specific Page Range

{
  "nodes": [
    {
      "parameters": {
        "operation": "parse",
        "source": "binary",
        "binaryPropertyName": "document",
        "additionalOptions": {
          "pageRangeStart": 5,
          "pageRangeEnd": 15,
          "normalizeWhitespace": true
        }
      },
      "name": "PDF Parse",
      "type": "n8n-nodes-pdf-parse.pdfParse"
    }
  ]
}

Output Format

Standard Output

{
  "text": "Extracted PDF text content...",
  "numPages": 25,
  "pdfStats": {
    "textLength": 15420,
    "wordCount": 2156,
    "pageCount": 25
  }
}

With Metadata

{
  "text": "Extracted PDF text content...",
  "numPages": 25,
  "pdfMetadata": {
    "numPages": 25,
    "info": {
      "Title": "Document Title",
      "Author": "Document Author",
      "Creator": "PDF Creator",
      "Producer": "PDF Producer",
      "CreationDate": "D:20231201120000Z",
      "ModDate": "D:20231201120000Z"
    },
    "metadata": "Additional metadata...",
    "version": "1.7"
  },
  "pdfStats": {
    "textLength": 15420,
    "wordCount": 2156,
    "pageCount": 25
  }
}

Split by Pages

{
  "text": [
    "Page 1 text content...",
    "Page 2 text content...",
    "Page 3 text content..."
  ],
  "numPages": 3,
  "pdfStats": {
    "textLength": 2340,
    "wordCount": 456,
    "pageCount": 3
  }
}

Error Handling

The node includes comprehensive error handling:

  • Invalid PDF files: Validates PDF magic number
  • Network errors: Handles URL fetch failures
  • Empty files: Detects and reports empty PDF files
  • Invalid URLs: Validates URL format
  • Missing properties: Validates required parameters

When "Continue on Fail" is enabled, errors are added to the output data:

{
  "error": "Error message describing what went wrong"
}

Supported PDF Features

  • ✅ Text extraction from standard PDFs
  • ✅ Multi-page documents
  • ✅ Password-protected PDFs (basic support)
  • ✅ Various PDF versions (1.0 - 2.0)
  • ✅ Embedded fonts and text encoding
  • ⚠️ OCR for scanned documents (not supported - text-based PDFs only)
  • ⚠️ Complex layouts with tables/forms (basic support)

Performance Considerations

  • Large PDFs: Use page range options to limit processing
  • Memory usage: Large PDFs may require more memory
  • Processing time: Scales with document size and complexity
  • Network timeouts: URLs should be accessible and responsive

Dependencies

Node.js Dependencies

  • pdf-parse: Enhanced PDF parsing library with AI-optimized text extraction
  • pdfjs-dist: Mozilla's PDF.js library for reliable PDF text parsing
  • pdf2pic: Robust PDF to image conversion library
  • n8n-workflow: N8N workflow types and utilities

System Dependencies (Image Conversion Only)

  • GraphicsMagick: High-performance image processing library
  • Ghostscript: PostScript and PDF interpreter (required by GraphicsMagick for PDF handling)

Note: Text parsing operations require no system dependencies - only Node.js packages. Image conversion operations require GraphicsMagick and Ghostscript to be installed on the system.

Development

Setup

git clone https://github.com/ConniAU/n8n-pdf-parse.git
cd n8n-nodes-pdf-parse
npm install

Build

npm run build

Lint and Format

npm run lint
npm run format

Test

npm test

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/new-feature
  3. Commit your changes: git commit -am 'Add new feature'
  4. Push to the branch: git push origin feature/new-feature
  5. Submit a pull request

Complete Docker Setup Example

For self-hosted n8n users who want both the PDF Parse node AND image conversion capabilities, here's a complete setup:

Step 1: Create a Dockerfile

FROM n8nio/n8n:latest

USER root

# Install system dependencies for PDF to Image conversion
RUN apk update && apk add --no-cache \
    graphicsmagick \
    ghostscript \
    nodejs \
    npm

# Install the PDF Parse node globally
RUN npm install -g n8n-nodes-pdf-parse

USER node

Step 2: Create docker-compose.yml

version: '3.8'

services:
  n8n:
    build: .
    container_name: n8n-with-pdf-parse
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      # Basic n8n configuration
      - N8N_HOST=${N8N_HOST:-localhost}
      - N8N_PORT=5678
      - N8N_PROTOCOL=${N8N_PROTOCOL:-http}
      - NODE_ENV=production
      - WEBHOOK_URL=${WEBHOOK_URL:-http://localhost:5678/}
      - GENERIC_TIMEZONE=${TIMEZONE:-America/New_York}
      
      # Database configuration (optional - uses SQLite by default)
      - DB_TYPE=sqlite
      # For PostgreSQL, uncomment and configure:
      # - DB_TYPE=postgresdb
      # - DB_POSTGRESDB_HOST=postgres
      # - DB_POSTGRESDB_PORT=5432
      # - DB_POSTGRESDB_DATABASE=${DB_POSTGRESDB_DATABASE:-n8n}
      # - DB_POSTGRESDB_USER=${DB_POSTGRESDB_USER:-n8n}
      # - DB_POSTGRESDB_PASSWORD=${DB_POSTGRESDB_PASSWORD:-n8n}
      
      # Security
      - N8N_BASIC_AUTH_ACTIVE=${N8N_BASIC_AUTH_ACTIVE:-false}
      - N8N_BASIC_AUTH_USER=${N8N_BASIC_AUTH_USER:-}
      - N8N_BASIC_AUTH_PASSWORD=${N8N_BASIC_AUTH_PASSWORD:-}
      
      # Allow external npm modules (required for community nodes)
      - NODE_FUNCTION_ALLOW_EXTERNAL=*
      - NODE_FUNCTION_ALLOW_BUILTIN=*
      
    volumes:
      # Persist n8n data
      - n8n_data:/home/node/.n8n
      # Optional: Mount local workflows directory
      # - ./workflows:/home/node/.n8n/workflows
      # Optional: Mount custom nodes directory
      # - ./custom-nodes:/home/node/.n8n/nodes
      
    networks:
      - n8n-network
    
    # Health check
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:5678/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

  # Optional: PostgreSQL database for production use
  # postgres:
  #   image: postgres:14-alpine
  #   container_name: n8n-postgres
  #   restart: unless-stopped
  #   environment:
  #     - POSTGRES_USER=${DB_POSTGRESDB_USER:-n8n}
  #     - POSTGRES_PASSWORD=${DB_POSTGRESDB_PASSWORD:-n8n}
  #     - POSTGRES_DB=${DB_POSTGRESDB_DATABASE:-n8n}
  #   volumes:
  #     - postgres_data:/var/lib/postgresql/data
  #   networks:
  #     - n8n-network

volumes:
  n8n_data:
  # postgres_data:

networks:
  n8n-network:
    driver: bridge

Step 3: Create .env file (optional)

# n8n Configuration
N8N_HOST=localhost
N8N_PROTOCOL=http
WEBHOOK_URL=http://localhost:5678/
TIMEZONE=America/New_York

# Authentication (uncomment to enable)
# N8N_BASIC_AUTH_ACTIVE=true
# N8N_BASIC_AUTH_USER=admin
# N8N_BASIC_AUTH_PASSWORD=your-secure-password

# Database (for PostgreSQL)
# DB_POSTGRESDB_DATABASE=n8n
# DB_POSTGRESDB_USER=n8n
# DB_POSTGRESDB_PASSWORD=your-secure-password

Step 4: Deploy

# Build and start the services
docker-compose up -d

# View logs
docker-compose logs -f n8n

# Stop services
docker-compose down

# Stop and remove volumes (caution: deletes all data)
docker-compose down -v

Step 5: Access n8n

Open your browser and navigate to http://localhost:5678

The PDF Parse node should be available in the node selection panel under the "Transform" category.

Troubleshooting

Common Issues

  1. Node not appearing in N8N

    • Ensure the package is properly installed
    • Restart N8N after installation
    • Check N8N logs for loading errors
  2. "Invalid PDF" errors

    • Verify the file is actually a PDF
    • Check if the PDF is corrupted
    • Try with a different PDF file
  3. Image conversion errors (Missing system dependencies)

    • Error: Command failed: execvp failed, errno = 2 (No such file or directory) gm identify
    • Cause: GraphicsMagick not installed
    • Solution: Install GraphicsMagick and Ghostscript (see System Requirements above)

    Docker Environment: Use custom Dockerfile with dependencies:

    FROM n8nio/n8n:latest
    USER root
    RUN apt-get update && apt-get install -y graphicsmagick ghostscript
    USER node
  4. Memory issues with large PDFs

    • Use page range options to limit processing
    • Increase Node.js memory limit: --max-old-space-size=4096
  5. Network timeout errors

    • Check URL accessibility
    • Verify network connectivity
    • Consider downloading the file first

Debug Mode

Enable debug logging by setting the environment variable:

N8N_LOG_LEVEL=debug

License

MIT License - see LICENSE file for details.

Changelog

Version 1.0.0

  • Initial release
  • PDF text extraction with pdf-parse
  • Support for binary data and URL sources
  • Advanced parsing options
  • Comprehensive error handling
  • TypeScript implementation

Support

For issues, questions, or contributions:

Acknowledgments

  • Built with pdf-parse
  • Designed for N8N workflow automation
  • Inspired by the N8N community's needs for PDF processing