n8n-nodes-extract-from-document

v0.0.5

Published

4 months ago

Extracts text content from documents using n8n.

0High
0Medium
0Low

melvinotieno

n8n-community-node-package n8n-extract-from-document

Extract from Document Node

The Extract from Document node for n8n allows you to extract text content from various document formats in your workflows. It supports PDF and DOCX files and provides flexible options for handling the extracted content.

Features

Extracts text content from PDF and DOCX documents
Supports password-protected PDF files
Option to merge all pages into a single text or return text per page
Configurable maximum page limit
Flexible source data preservation options
Returns document metadata alongside extracted text

Supported File Types

| Format | MIME Type | Description | | ------ | ------------------------------------------------------------------------- | ------------------------ | | PDF | application/pdf | Portable Document Format | | DOCX | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Microsoft Word Document |

Properties

| Property | Description | | ------------------------ | --------------------------------------------------------------------------------------------- | | Input Binary Field | The name of the binary field containing the document to extract text from | | Merge Pages (option) | Whether to merge text from all pages into a single string or return an array of text per page | | Max Pages (option) | Maximum number of pages to process. Use -1 for no limit | | Password (option) | Password for protected PDF documents | | Keep Source (option) | Choose what data to preserve from the input: JSON, Binary, or Both |

Example Usage

Connect a node that outputs binary document data (e.g., HTTP Request, Read Binary File)
Add the Extract from Document node
Set the Input Binary Field to match your binary field name
Configure extraction options as needed
Process the extracted text in subsequent nodes

Example Output

{
  "text": "Extracted document content...",
  "numpages": 5,
  "info": {
    "PDFFormatVersion": "1.4",
    "IsAcroFormPresent": false
  },
  "metadata": {
    "Title": "Document Title",
    "Author": "Document Author"
  },
  "version": "3.4.120"
}

Error Handling

The node will throw an error if:

The specified binary field doesn't exist
The file format is not supported
The document is corrupted or unreadable
A password-protected PDF is accessed without the correct password

Enable "Continue on Fail" in the node settings to handle errors gracefully and continue processing other items.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme