n8n-nodes-docx-converter-enhanced

v1.0.0

Published

8 months ago

Enhanced n8n community node for DOCX to text conversion with RAG capabilities, page-aware chunking, and metadata extraction. Fork of n8n-nodes-docx-converter with advanced features for AI/ML workflows.

0High
0Medium
0Low

widji.santoso

n8n-community-node-package docx text-extraction rag chunking metadata page-aware document-processing ai ml

n8n-nodes-docx-converter-enhanced

🚀 Enhanced fork of n8n-nodes-docx-converter with advanced RAG capabilities!

This is an enhanced n8n community node that provides powerful DOCX to text conversion with RAG (Retrieval-Augmented Generation) capabilities, page-aware chunking, and comprehensive metadata extraction for AI/ML workflows.

✨ New Features (Enhanced Version)

📄 Page-Aware Chunking: Intelligent text chunking that preserves page boundaries
🧠 RAG-Ready Output: Optimized for AI/ML and RAG systems
📊 Metadata Extraction: Document properties, word count, estimated pages
🏗️ Structure Analysis: Heading detection and document structure mapping
🔄 Multiple Output Modes: Legacy text-only, enhanced metadata, or RAG chunks
⚡ Backward Compatible: Works with existing workflows

n8n is a fair-code licensed workflow automation platform.

📋 Table of Contents

Installation
Operations
Enhanced Features
Credentials
Compatibility
Usage
Attribution
Resources
Version History

Installation

Follow the installation guide in the n8n community nodes documentation.

Operations

DOCX to Text (Legacy)

Convert DOCX file to plain text (backward compatible)

DOCX to Text Enhanced

Convert DOCX with metadata extraction
Page-aware chunking for RAG systems
Document structure analysis
Multiple output formats

Enhanced Features

🎯 Output Modes

Text Only (Legacy): Simple text extraction for backward compatibility
Enhanced with Metadata: Text + document metadata + structure analysis
RAG-Ready Chunks: Page-aware chunks optimized for AI/ML workflows

📊 Metadata Extraction

Document title, author, creation/modification dates
Word count and estimated page count
Subject and description fields

🧩 Page-Aware Chunking

Configurable chunk size (words)
Overlapping chunks for context preservation
Page boundary preservation
Section and heading awareness

🏗️ Structure Analysis

Heading detection and hierarchy
Section counting
Document outline extraction

Credentials

No credentials are required for this node.

Compatibility

This node requires n8n version 1.0.0 or higher. It has been tested with the latest version of n8n.

Usage

Basic Usage (Legacy Mode)

Add the "DOCX to Text" or "DOCX to Text Enhanced" node to your workflow
Configure the input binary field containing your DOCX file
Choose "Text Only (Legacy)" output mode for simple text extraction

Enhanced Usage (RAG Mode)

Add the "DOCX to Text Enhanced" node
Set output mode to "RAG-Ready Chunks"
Configure chunk size (default: 300 words)
Set chunk overlap (default: 50 words)
Enable HTML conversion for better structure preservation

Output Examples

Enhanced Mode Output:

{
  "text": "Full document text...",
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "wordCount": 1250,
    "pageCount": 5
  },
  "structure": {
    "headings": ["Introduction", "Methods", "Results"],
    "sections": 3,
    "estimatedPages": 5
  }
}

RAG Chunks Output:

{
  "chunks": [
    {
      "content": "Chunk text content...",
      "pageStart": 1,
      "pageEnd": 1,
      "section": "Introduction",
      "chunkIndex": 0,
      "position": { "start": 0, "end": 300 }
    }
  ],
  "metadata": { ... },
  "totalChunks": 15
}

Attribution

🙏 This project is a fork of n8n-nodes-docx-converter by Blake Martin.

Original Repository: https://github.com/cre8tiv/n8n-docx-converter
Original Author: Blake Martin ([email protected])
License: MIT

We extend our gratitude to the original author for creating the foundation that made these enhancements possible.

Resources

Version History

1.0.0 (Enhanced Fork)

🚀 Major Enhancement Release
✨ Added RAG-ready chunking with page awareness
📊 Comprehensive metadata extraction
🏗️ Document structure analysis
🔄 Multiple output modes (legacy, enhanced, RAG chunks)
📄 Page boundary preservation in chunks
🧠 Optimized for AI/ML workflows
⚡ Maintained backward compatibility
🛠️ Added new dependencies: jszip, cheerio
📝 Enhanced documentation and examples

0.1.3 (Original)

Use input and output destinations

0.1.0 (Original)

Initial release by Blake Martin

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

n8n-nodes-docx-converter-enhanced

✨ New Features (Enhanced Version)

📋 Table of Contents

Installation

Operations

DOCX to Text (Legacy)

DOCX to Text Enhanced

Enhanced Features

🎯 Output Modes

📊 Metadata Extraction

🧩 Page-Aware Chunking

🏗️ Structure Analysis

Credentials

Compatibility

Usage

Basic Usage (Legacy Mode)

Enhanced Usage (RAG Mode)

Output Examples

Attribution

Resources

Version History

1.0.0 (Enhanced Fork)

0.1.3 (Original)

0.1.0 (Original)