npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@mazix/n8n-nodes-converter-documents

v1.1.2

Published

n8n node to convert various document formats (DOCX, XML, YML, XLS, XLSX, CSV, PDF, TXT, PPT, PPTX, HTML, JSON, ODT, ODP, ODS) to JSON or text format

Readme

📄 n8n Document Converter Node

npm version License: MIT Tests TypeScript

🚀 n8n community node for converting various document formats to JSON/text with AI-friendly output


📑 Table of Contents


✨ Features

🎯 Core Features

  • 12+ file formats supported
  • ✅ Automatic file type detection
  • ✅ Hybrid processing (primary + fallback)
  • ✅ Stream processing for large files
  • ✅ Promise pooling for concurrency control
  • ✅ Comprehensive error handling

🔒 Security & Performance

  • ✅ Input validation & sanitization
  • ✅ XSS protection (sanitize-html)
  • ✅ Path traversal protection
  • ✅ Memory-efficient streaming
  • ✅ Configurable file size limits (up to 100MB)
  • ✅ JSON structure normalization

📚 Supported Formats

| Category | Formats | Status | |----------|---------|--------| | Text Documents | DOCX, ODT, TXT, PDF | ✅ Full Support | | Spreadsheets | XLSX, ODS, CSV | ✅ Multi-sheet support | | Presentations | PPTX, ODP | ✅ Full Support | | Web & Data | HTML, HTM, XML, JSON | ✅ Full Support | | E-commerce | YML (Yandex Market) | ✅ Specialized parsing | | Legacy | DOC, PPT, XLS | ❌ Not supported* |

*Legacy formats require conversion to modern formats (DOCX, PPTX, XLSX)


📊 DOCX to HTML Conversion (v1.0.21+)

Latest: Node renamed to "Document Converter" in v1.0.22

🎨 Choose Your Output Format

Best for:

  • Simple text extraction
  • Minimal output size
  • Maximum speed
  • Backward compatibility

Output size: ~3,600 chars

Best for:

  • Documents with tables
  • AI/LLM processing
  • Preserving formatting
  • Structured content

Output size: ~58,000 chars (+1,591%)

📋 Usage in n8n

1. Add "Document Converter" node
2. Select "Output Format (DOCX)" parameter:
   • Plain Text → Simple extraction
   • HTML → Tables + formatting preserved

💡 Example Output

{
  "text": "Situation: Often search by one field\nAction: Create index on that field"
}
{
  "text": "<table><tr><td><strong>Situation</strong></td><td><strong>Action</strong></td></tr><tr><td>Often search by one field</td><td>Create index on that field</td></tr></table>"
}

🎯 HTML Format Features

| Feature | Description | |---------|-------------| | Tables | <table>, <tr>, <td> - full structure preserved | | Formatting | <strong>, <em>, <h1>-<h6> | | Lists | <ul>, <ol>, <li> | | Paragraphs | <p> tags for structure | | AI-Friendly | ✅ Understood by ChatGPT, Claude, Gemini |


� New in v1.1.0: Enhanced Controls

🧠 HTML Table Preservation

When converting HTML or DOCX (in HTML mode), tables are now preserved in the output. This is critical for RAG/LLM contexts, allowing AI models to understand structured data instead of flattened text.

⚙️ Advanced CSV & Excel Control

  • CSV Delimiter: Manually select , ; \t | or keep Auto.
  • Max Excel Rows: Limit rows per sheet (e.g., 1000) to prevent memory crashes on huge files.

� XLSX Multi-Sheet Processing

🗂️ How It Works

{
  "sheets": {
    "Products": [
      { "A": "ID", "B": "Name", "C": "Price" },
      { "A": 1, "B": "Apple", "C": 100 },
      { "A": 2, "B": "Banana", "C": 50 }
    ],
    "Orders": [
      { "A": "Order", "B": "Quantity" },
      { "A": 101, "B": 5 }
    ]
  }
}

📌 Key Features

| Feature | Details | |---------|---------| | Multiple Sheets | Each sheet = separate array in sheets object | | Column Names | A, B, C... Z (Excel-style) | | Row Format | Array of objects (rows) | | Empty Cells | Skipped (only filled cells included) | | Size Limit | Configurable (default: 0 / unlimited) | | Memory Safe | Large files auto-limited to prevent OOM |


🚀 Installation

Option 1: npm Package (Recommended)

Via n8n web interface:

Settings → Community nodes → Install
Package name: @mazix/n8n-nodes-converter-documents

Or via command line:

npm install @mazix/n8n-nodes-converter-documents

Option 2: Standalone Version

# 1. Clone and build
git clone https://github.com/mazixs/n8n-node-converter-documents.git
cd n8n-node-converter-documents
npm install
npm run standalone

# 2. Copy to n8n
cp -r ./standalone ~/.n8n/custom-nodes/n8n-node-converter-documents
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install

# 3. Restart n8n

Option 3: Manual Installation

mkdir -p ~/.n8n/custom-nodes/n8n-node-converter-documents
cp dist/*.js dist/*.svg ~/.n8n/custom-nodes/n8n-node-converter-documents/
cp package.json ~/.n8n/custom-nodes/n8n-node-converter-documents/
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install --production

📖 Usage Examples

Text Document Output

{
  "text": "Extracted text content...",
  "metadata": {
    "fileName": "document.docx",
    "fileSize": 12345,
    "fileType": "docx",
    "processedAt": "2024-06-01T12:00:00.000Z"
  }
}

Excel Spreadsheet Output

{
  "sheets": {
    "Sheet1": [
      { "A": "Name", "B": "Age", "C": "City" },
      { "A": "Alice", "B": 30, "C": "Moscow" },
      { "A": "Bob", "B": 25, "C": "SPB" }
    ]
  },
  "metadata": {
    "fileName": "data.xlsx",
    "fileSize": 23456,
    "fileType": "xlsx"
  }
}

JSON Normalization

Input:

{
  "user": {
    "name": "John",
    "address": { "city": "Moscow" }
  }
}

Output (flattened):

{
  "text": "{\n  \"user.name\": \"John\",\n  \"user.address.city\": \"Moscow\"\n}",
  "warning": "Multi-level JSON structure was converted to flat object"
}

🏗️ Architecture

Strategy Pattern Implementation

DOCX Processing Flow:
┌─────────────────────────────────────┐
│ 1. If outputFormat === 'html':     │
│    → mammoth.convertToHtml()       │
│    → [Success] Return HTML          │
│    → [Fail] Fallback to text       │
│                                     │
│ 2. Text mode (default):            │
│    → officeparser (primary)        │
│    → mammoth.extractRawText (fb)   │
│    → XML direct parsing (last)     │
└─────────────────────────────────────┘

Technology Stack

Core Libraries

  • officeparser (v5.1.1) - Primary parser
  • mammoth (v1.9.1) - DOCX processor
  • exceljs (v4.4.0) - Excel handler
  • pdf-parse (v1.1.1) - PDF fallback
  • papaparse (v5.5.3) - CSV parser

Build & Quality

  • TypeScript 5.8 (strict mode)
  • Jest (80 tests passing)
  • ESLint (TypeScript rules)
  • Webpack bundling
  • CommonJS modules

Security Features

| Feature | Implementation | |---------|----------------| | Input Validation | Strict type & structure checks | | XSS Protection | sanitize-html library | | Path Traversal | File name sanitization | | Memory Limits | 10K rows/sheet, 50MB default | | Dependency Audit | Regular npm audit checks |


💻 Development

Quick Start

npm install        # Install dependencies
npm run dev        # Watch mode
npm run build      # Compile
npm test           # Run 80 tests
npm run lint       # Check code quality

Build Commands

| Command | Description | |---------|-------------| | npm run build | TypeScript → JavaScript | | npm run bundle | Webpack bundling | | npm run standalone | Standalone with deps | | npm run test:coverage | Coverage report | | npm run lint:fix | Auto-fix issues |

Project Structure

├── src/
│   ├── FileToJsonNode.node.ts  # Main node (Strategy Pattern)
│   ├── helpers.ts               # Utilities
│   └── errors.ts                # Custom errors
├── test/
│   ├── unit/                    # Unit tests
│   ├── integration/             # Integration tests
│   └── samples/                 # Test files
├── docs/                        # Documentation
│   ├── SOLUTION.md
│   ├── HTML_CONVERSION_PLAN.md
│   └── MAMMOTH_ANALYSIS.md
└── dist/                        # Compiled output

📈 Latest Updates

🎉 v1.0.22 (Current - 2025-10-10)

🎨 UI & Quality

  • Node renamed: "Document Converter"
  • Icon fixed: 60×60 (proper size)
  • Code refactored: -78 lines
  • Zero duplication: 100% eliminated
  • Full error handling: PPTX fixed

📚 Docs & Tests

  • README redesign: Badges, TOC, tables
  • 80 tests passing (+7 XLSX)
  • Full JSDoc: All functions documented
  • Better IntelliSense: IDE support improved
  • Professional look: Visual tables & icons

What's New:

+ Node renamed to "Document Converter" (better UX)
+ Icon size fixed: 2048×1853 → 60×60
+ Code quality: eliminated all duplication
+ BaseConverterError class (DRY principle)
+ checkCFBFormat() helper (unified CFB check)
+ processViaOfficeParser() helper (unified error handling)
+ Full JSDoc documentation added
+ README complete visual redesign
+ 7 new XLSX multi-sheet tests

Previous Versions

  • DOCX to HTML conversion with table support
  • outputFormat parameter (text | html)
  • Table preservation in HTML
  • AI/LLM friendly output
  • 73 tests passing
  • Extract text from TextBoxes and shapes
  • ONLYOFFICE document fix
  • 62 tests passing
  • Fixed XML namespace extraction
  • No more schema URLs in output
  • 61 tests passing

📚 Documentation

| Document | Description | |----------|-------------| | CHANGELOG.md | Complete version history | | SOLUTION.md | Architecture overview | | HTML_CONVERSION_PLAN.md | DOCX to HTML implementation | | MAMMOTH_ANALYSIS.md | Library research findings | | optimization_plan.md | Performance strategies | | security.md | Security features |


🔧 Troubleshooting

Common Issues

Error: Cannot find module 'exceljs'

# Solution 1: Use standalone version (recommended)
npm run standalone

# Solution 2: Check dependencies
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm list
npm install

Large files causing OOM

  • Split files into smaller parts
  • Reduce maxFileSize parameter
  • Use streaming for CSV/TXT formats

⚠️ Limitations

| Limitation | Details | Workaround | |------------|---------|------------| | Legacy formats | DOC, PPT, XLS not supported | Convert to DOCX, PPTX, XLSX | | Memory | Large PDF/XLSX load into RAM | Split files or increase memory | | File size | Default 50MB limit | Configurable up to 100MB |


📊 Statistics

  • 12+ file formats supported
  • 80 tests passing
  • 5 specialized parsers
  • 10K rows per sheet limit
  • 100MB max file size
  • 0 critical vulnerabilities

🤝 Contributing

Issues and pull requests are welcome!


📝 License

MIT © mazix


🔗 Links


Made with ❤️ for the n8n community

If you find this helpful, please ⭐ star the repository!