@thasmorato/docx-parser

v1.4.0

Published

8 months ago

A modern JavaScript library for parsing and processing Microsoft Word DOCX documents with support for both buffer and stream operations. Features incremental parsing, checkbox detection, footnote support, and document validation.

Downloads

309

DOCX Parser

A modern JavaScript/TypeScript library for parsing DOCX documents using generator architecture and Clean Architecture.

🚀 Features

Streaming: Processes DOCX documents incrementally using async generators
Memory Efficient: Doesn't load the entire document into memory
TypeScript: Complete typing with well-defined interfaces
Clean Architecture: Clear separation between layers (Domain, Application, Infrastructure, Interfaces)
Flexible: Extracts text, images, tables, metadata, and formatting
Configurable: Advanced options to control parsing
Checkbox Detection: Automatically detects and parses checkbox states in lists
Footnote Support: Identifies and extracts footnotes with proper references
Header Levels: Detects document structure with header hierarchy (H1-H6)
Document Validation: Built-in validation for DOCX file integrity
Page Elements: Supports headers, footers, and page breaks
List Processing: Handles numbered, bulleted, and checkbox lists

📦 Installation

npm install @thasmorato/docx-parser
# or
yarn add @thasmorato/docx-parser
# or
pnpm add @thasmorato/docx-parser

📋 Main Exports

The library exports only essential items for a clean API:

// Main parsing functions
import {
  parseDocx,           // Buffer → AsyncGenerator
  parseDocxStream,     // ReadStream → AsyncGenerator
  parseDocxHttpStream, // HTTP Readable Stream → AsyncGenerator
  parseDocxReadable,   // Node.js Readable Stream → AsyncGenerator
  parseDocxWebStream,  // ReadableStream → AsyncGenerator
  parseDocxFile,       // File path → AsyncGenerator
  parseDocxToArray,    // Any source → Promise<Array>
  extractText,         // Extract only text
  extractImages,       // Extract only images
  getMetadata          // Extract only metadata
} from 'docx-parser';

// Essential types
import type {
  DocumentElement,     // Union of all element types
  ParseOptions,        // Configuration options
  ValidationResult,    // Validation response
  MetadataElement,     // Document metadata
  ParagraphElement,    // Text paragraphs
  ImageElement,        // Images with metadata
  TableElement,        // Tables with rows/cells
  HeaderElement,       // Document headers
  FooterElement,       // Document footers
  ListElement,         // Lists (bullet/numbered)
  PageBreakElement,    // Page breaks
  SectionElement       // Document sections
} from 'docx-parser';

// Error handling
import { DocxParseError } from 'docx-parser';

// Advanced utilities
import { StreamAdapter } from 'docx-parser';
import { ValidateDocumentUseCaseImpl } from 'docx-parser';

🔥 Basic Usage

Incremental Parsing (Recommended)

import { parseDocx } from 'docx-parser';
import { readFileSync } from 'fs';

const buffer = readFileSync('document.docx');

// Process document incrementally
for await (const element of parseDocx(buffer)) {
  console.log(`Type: ${element.type}`);

  if (element.type === 'paragraph') {
    console.log(`Text: ${element.content}`);
  } else if (element.type === 'image') {
    console.log(`Image: ${element.metadata?.filename}`);
  } else if (element.type === 'table') {
    console.log(`Table with ${element.content.length} rows`);
  }
}

File Parsing

import { parseDocxFile } from 'docx-parser';

// Read file directly from filesystem
for await (const element of parseDocxFile('./document.docx')) {
  console.log(element.type, element.content);
}

Stream Parsing

import { parseDocxStream, parseDocxHttpStream, parseDocxWebStream } from 'docx-parser';
import { createReadStream } from 'fs';
import axios from 'axios';

// Using Node.js ReadStream (recommended for local files)
const fileStream = createReadStream('./document.docx');
for await (const element of parseDocxStream(fileStream)) {
  console.log(element.type);
}

// Using HTTP streams (axios, fetch, etc.) - NEW!
const response = await axios({
  url: 'https://example.com/document.docx',
  responseType: 'stream'
});
for await (const element of parseDocxHttpStream(response.data)) {
  console.log(element.type);
}

// Using Web ReadableStream (for fetch, responses, etc.)
const fetchResponse = await fetch('https://example.com/document.docx');
const webStream = fetchResponse.body!;
for await (const element of parseDocxWebStream(webStream)) {
  console.log(element.type);
}

StreamAdapter Utility

import { StreamAdapter } from 'docx-parser';
import { createReadStream } from 'fs';

// Convert ReadStream to Web ReadableStream
const nodeStream = createReadStream('./document.docx');
const webStream = StreamAdapter.toWebStream(nodeStream);

// Convert ReadableStream to Buffer
const buffer = await StreamAdapter.toBuffer(webStream);

// Create ReadableStream from Buffer
const streamFromBuffer = StreamAdapter.fromBuffer(buffer);

🛠️ Complete API

Main Functions

`parseDocx(buffer, options?)`

Processes a DOCX Buffer incrementally.

import { parseDocx } from 'docx-parser';
import { readFileSync } from 'fs';

const buffer = readFileSync('doc.docx');
for await (const element of parseDocx(buffer, {
  includeImages: true,
  includeTables: true,
  normalizeWhitespace: true
})) {
  // Process each element
}

`parseDocxStream(stream, options?)`

Processes a DOCX ReadStream (Node.js) incrementally.

import { parseDocxStream } from 'docx-parser';
import { createReadStream } from 'fs';

const stream = createReadStream('doc.docx');
for await (const element of parseDocxStream(stream)) {
  // Process each element
}

`parseDocxReadable(stream, options?)`

Processes a DOCX from any Node.js Readable stream incrementally.

import { parseDocxReadable } from 'docx-parser';
import { Readable } from 'stream';

// Custom Readable stream
const customStream = new Readable({
  read() {
    // Custom stream logic
  }
});

for await (const element of parseDocxReadable(customStream)) {
  // Process each element
}

// Transform streams, pipeline streams, etc.
import { Transform } from 'stream';
const transformedStream = someStream.pipe(new Transform({
  transform(chunk, encoding, callback) {
    // Transform logic
    callback(null, chunk);
  }
}));

for await (const element of parseDocxReadable(transformedStream)) {
  // Process transformed stream
}

`parseDocxHttpStream(stream, options?)`

Processes a DOCX from HTTP Readable streams (axios, fetch, etc.) incrementally.

import { parseDocxHttpStream } from 'docx-parser';
import axios from 'axios';

const response = await axios({
  url: 'https://example.com/doc.docx',
  responseType: 'stream'
});

for await (const element of parseDocxHttpStream(response.data)) {
  // Process each element
}

`parseDocxWebStream(stream, options?)`

Processes a DOCX ReadableStream (Web API) incrementally.

import { parseDocxWebStream } from 'docx-parser';

const response = await fetch('document.docx');
const stream = response.body!;
for await (const element of parseDocxWebStream(stream)) {
  // Process each element
}

`parseDocxToArray(source, options?)`

Returns all elements as an array (non-streaming).

import { parseDocxToArray } from 'docx-parser';
import { createReadStream } from 'fs';

// From buffer
const elements = await parseDocxToArray(buffer);

// From ReadStream
const stream = createReadStream('doc.docx');
const elements2 = await parseDocxToArray(stream);

console.log(`Document has ${elements.length} elements`);

`extractText(source, options?)`

Extracts only text from the document.

import { extractText } from 'docx-parser';
import { createReadStream } from 'fs';

// From buffer
const text = await extractText(buffer, {
  preserveFormatting: true
});

// From ReadStream
const stream = createReadStream('doc.docx');
const text2 = await extractText(stream);

console.log(text);

`extractImages(source)`

Extracts only images from the document.

import { extractImages } from 'docx-parser';
import { createReadStream } from 'fs';

// From buffer
for await (const image of extractImages(buffer)) {
  console.log(`Image: ${image.metadata?.filename}`);
}

// From ReadStream
const stream = createReadStream('doc.docx');
for await (const image of extractImages(stream)) {
  // image.content contains the image Buffer
}

`getMetadata(source)`

Extracts only metadata from the document.

import { getMetadata } from 'docx-parser';
import { createReadStream } from 'fs';

// From buffer
const metadata = await getMetadata(buffer);

// From ReadStream
const stream = createReadStream('doc.docx');
const metadata2 = await getMetadata(stream);

console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Created: ${metadata.created}`);

Document Validation

`ValidateDocumentUseCaseImpl`

Validates DOCX document structure and integrity.

import { ValidateDocumentUseCaseImpl } from 'docx-parser';

const validator = new ValidateDocumentUseCaseImpl();
const result = await validator.validate(buffer);

if (!result.isValid) {
  console.log('Invalid document:');
  result.errors.forEach(error => {
    console.log(`- ${error.message} (${error.code})`);
  });
} else {
  console.log('Valid document!');
}

🌐 HTTP Streams Support

The library now supports HTTP streams from popular libraries like axios, fetch, and others. This resolves the common error "stream.getReader is not a function" when working with HTTP responses.

Using with Axios

import axios from 'axios';
import { parseDocxHttpStream, parseDocxToArray } from 'docx-parser';

// Method 1: Specific HTTP stream function (recommended)
async function parseFromUrl(url: string) {
  const response = await axios({
    method: 'get',
    url: url,
    responseType: 'stream'  // Important: use 'stream', not 'buffer'
  });

  for await (const element of parseDocxHttpStream(response.data)) {
    console.log(`${element.type}: ${element.content}`);
  }
}

// Method 2: Automatic detection
async function parseToArray(url: string) {
  const response = await axios({
    method: 'get',
    url: url,
    responseType: 'stream'
  });

  // parseDocxToArray automatically detects Node.js streams
  const elements = await parseDocxToArray(response.data, {
    includeImages: true,
    includeTables: true
  });

  return elements;
}

Using with Fetch API

// Fetch API (Node.js 18+)
const response = await fetch('https://example.com/document.docx');
if (!response.ok) throw new Error(`HTTP error! status: ${response.status}`);

// response.body is a Web ReadableStream - works automatically
const elements = await parseDocxToArray(response.body);

Error Handling for HTTP Streams

import axios from 'axios';
import { parseDocxHttpStream } from 'docx-parser';

async function safeParseFromUrl(url: string) {
  try {
    const response = await axios({
      method: 'get',
      url: url,
      responseType: 'stream',
      timeout: 30000,  // 30 seconds timeout
      headers: {
        'Accept': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
      }
    });

    const elements = [];
    for await (const element of parseDocxHttpStream(response.data)) {
      elements.push(element);
    }

    return elements;

  } catch (error) {
    if (axios.isAxiosError(error)) {
      console.error('HTTP Error:', error.response?.status, error.message);
    } else if (error.name === 'DocxParseError') {
      console.error('DOCX Parse Error:', error.message);
    } else {
      console.error('Unknown Error:', error);
    }
    throw error;
  }
}

📖 For complete HTTP streams documentation, see: project/HTTP_STREAMS_GUIDE.md

⚙️ Configuration Options

interface ParseOptions {
  // Content control
  includeMetadata?: boolean;     // Include metadata (default: true)
  includeImages?: boolean;       // Include images (default: true)
  includeTables?: boolean;       // Include tables (default: true)
  includeHeaders?: boolean;      // Include headers (default: false)
  includeFooters?: boolean;      // Include footers (default: false)

  // Image processing
  imageFormat?: 'buffer' | 'base64' | 'stream';  // Image format
  maxImageSize?: number;         // Maximum size in bytes (default: 10MB)

  // Text formatting
  preserveFormatting?: boolean;  // Preserve formatting (default: true)
  normalizeWhitespace?: boolean; // Normalize whitespace (default: true)

  // Performance
  chunkSize?: number;           // Chunk size (default: 64KB)
  concurrent?: boolean;         // Parallel processing (default: false)
}

📋 Element Types

MetadataElement

{
  type: 'metadata',
  id: string,
  position: { page: number, section: number, order: number },
  content: {
    title?: string,
    author?: string,
    subject?: string,
    created?: Date,
    modified?: Date,
    // ... other metadata
  }
}

ParagraphElement

{
  type: 'paragraph',
  id: string,
  position: { page: number, section: number, order: number },
  content: string,
  formatting?: {
    fontFamily?: string,
    fontSize?: number,
    bold?: boolean,
    italic?: boolean,
    underline?: boolean,
    color?: string,
    highlight?: string
  },
  // Special properties
  checkbox?: {                 // For checkbox list items
    checked: boolean
  },
  isFootnote?: boolean,        // If this is a footnote
  footnoteId?: string          // Footnote identifier
}

ImageElement

{
  type: 'image',
  id: string,
  position: { page: number, section: number, order: number },
  content: Buffer,
  metadata?: {
    filename?: string,
    format: 'png' | 'jpg' | 'gif' | 'svg' | 'wmf' | 'emf',
    width: number,
    height: number
  },
  positioning?: {
    inline?: boolean,
    x?: number,
    y?: number
  }
}

TableElement

{
  type: 'table',
  id: string,
  position: { page: number, section: number, order: number },
  content: Array<{
    cells: Array<{ content: string }>,
    isHeader: boolean
  }>
}

HeaderElement

{
  type: 'header',
  id: string,
  position: { page: number, section: number, order: number },
  content: string,
  level: 1 | 2 | 3 | 4 | 5 | 6,
  hasPageNumber?: boolean      // If header contains page numbers
}

FooterElement

{
  type: 'footer',
  id: string,
  position: { page: number, section: number, order: number },
  content: string
}

ListElement

{
  type: 'list',
  id: string,
  position: { page: number, section: number, order: number },
  content: string[],
  listType: 'bullet' | 'number',
  level: number
}

Additional Elements

// Page breaks
{
  type: 'pageBreak',
  id: string,
  position: { page: number, section: number, order: number },
  content: null
}

// Document sections
{
  type: 'section',
  id: string,
  position: { page: number, section: number, order: number },
  content: string,
  sectionType: 'header' | 'footer' | 'body'
}

🏗️ Advanced Examples

Filtering Content Types

import { parseDocx } from 'docx-parser';

// Only process text and tables
for await (const element of parseDocx(buffer, {
  includeImages: false,
  includeMetadata: false,
  includeTables: true
})) {
  if (element.type === 'paragraph') {
    console.log('Paragraph:', element.content);
  } else if (element.type === 'table') {
    console.log('Table found');
    element.content.forEach((row, i) => {
      console.log(`Row ${i}:`, row.cells.map(c => c.content).join(' | '));
    });
  }
}

Working with Checkboxes

import { parseDocx } from 'docx-parser';

for await (const element of parseDocx(buffer)) {
  if (element.type === 'paragraph' && element.checkbox) {
    const status = element.checkbox.checked ? '✅' : '☐';
    console.log(`${status} ${element.content}`);
  }
}

Processing Footnotes

import { parseDocx } from 'docx-parser';

const footnotes = [];
for await (const element of parseDocx(buffer)) {
  if (element.type === 'paragraph' && element.isFootnote) {
    footnotes.push({
      id: element.footnoteId,
      content: element.content
    });
  }
}

console.log('Found footnotes:', footnotes);

Header Level Detection

import { parseDocx } from 'docx-parser';

for await (const element of parseDocx(buffer)) {
  if (element.type === 'header') {
    const indent = '  '.repeat(element.level - 1);
    console.log(`${indent}H${element.level}: ${element.content}`);

    if (element.hasPageNumber) {
      console.log(`${indent}  (contains page number)`);
    }
  }
}

Working with Node.js Readable Streams

import { parseDocxReadable } from 'docx-parser';
import { Readable, Transform, pipeline } from 'stream';
import { promisify } from 'util';

// Custom stream processing
const customStream = new Readable({
  read() {
    // Custom logic to read DOCX data
  }
});

for await (const element of parseDocxReadable(customStream)) {
  console.log(`${element.type}: ${element.content}`);
}

// Pipeline with transforms
const pipelineAsync = promisify(pipeline);
const transformStream = new Transform({
  transform(chunk, encoding, callback) {
    // Apply transformations if needed
    callback(null, chunk);
  }
});

// Process transformed stream
const transformedStream = someDocxStream.pipe(transformStream);
for await (const element of parseDocxReadable(transformedStream)) {
  console.log('Transformed element:', element);
}

Image Processing

import { writeFileSync } from 'fs';
import { extractImages } from 'docx-parser';

let imageCount = 0;
for await (const image of extractImages(buffer)) {
  const filename = image.metadata?.filename || `image_${++imageCount}.${image.metadata?.format}`;
  writeFileSync(`./output/${filename}`, image.content);
  console.log(`Image saved: ${filename}`);
}

Document Validation

import { ValidateDocumentUseCaseImpl } from 'docx-parser';

const validator = new ValidateDocumentUseCaseImpl();
const result = await validator.validate(buffer);

if (!result.isValid) {
  console.log('Invalid document:');
  result.errors.forEach(error => {
    console.log(`- ${error.message} (${error.code})`);
  });
} else {
  console.log('Valid document!');
}

🔧 Architecture

The library follows Clean Architecture with 4 layers:

Domain: Types, interfaces, and business rules
Application: Use cases and application logic
Infrastructure: Concrete implementations (JSZip, XML parsing)
Interfaces: Public API and controllers

src/
├── domain/           # Entities and business rules
├── application/      # Use cases and application logic
├── infrastructure/   # Adapters and implementations
└── interfaces/       # Public API

🚨 Error Handling

import { DocxParseError } from 'docx-parser';

try {
  for await (const element of parseDocx(invalidBuffer)) {
    console.log(element);
  }
} catch (error) {
  if (error instanceof DocxParseError) {
    console.log('DOCX-specific error:', error.message);
  } else {
    console.log('General error:', error);
  }
}

📝 Performance Tips

Use streaming: Prefer parseDocx() over parseDocxToArray() for large documents
Filter content: Disable includeImages if you don't need images
Chunk size: Adjust chunkSize based on document sizes
Limit images: Use maxImageSize to avoid very large images

🤝 Contributing

Fork the project
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

Development Setup

# Clone the repository
git clone https://github.com/ThaSMorato/docx-parser.git
cd docx-parser

# Install dependencies
pnpm install

# Run tests
pnpm test

# Run linting
pnpm lint

# Build the project
pnpm build

Automated Publishing

This project uses GitHub Actions for automated NPM publishing with semantic versioning:

MAJOR: MAJOR: breaking change or BREAKING CHANGE in commit
MINOR: MINOR: new feature or feat: prefix
PATCH: fix:, docs:, chore: or any other commit type

See .github/VERSIONING.md for detailed information about the versioning system.

📄 License

MIT License - see LICENSE for details.

🐛 Reporting Bugs

Open an issue on GitHub with:

Library version
Sample DOCX file (if possible)
Code that reproduces the problem
Complete error message

Made with ❤️ and TypeScript

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

DOCX Parser

🚀 Features

📦 Installation

📋 Main Exports

🔥 Basic Usage

Incremental Parsing (Recommended)

File Parsing

Stream Parsing

StreamAdapter Utility

🛠️ Complete API

Main Functions

parseDocx(buffer, options?)

parseDocxStream(stream, options?)

parseDocxReadable(stream, options?)

parseDocxHttpStream(stream, options?)

parseDocxWebStream(stream, options?)

parseDocxToArray(source, options?)

extractText(source, options?)

extractImages(source)

getMetadata(source)

Document Validation

ValidateDocumentUseCaseImpl

🌐 HTTP Streams Support

Using with Axios

Using with Fetch API

Error Handling for HTTP Streams

⚙️ Configuration Options

📋 Element Types

MetadataElement

ParagraphElement

ImageElement

TableElement

HeaderElement

FooterElement

ListElement

Additional Elements

🏗️ Advanced Examples

Filtering Content Types

Working with Checkboxes

Processing Footnotes

Header Level Detection

Working with Node.js Readable Streams

Image Processing

Document Validation

🔧 Architecture

🚨 Error Handling

📝 Performance Tips

🤝 Contributing

Development Setup

Automated Publishing

📄 License

🐛 Reporting Bugs

`parseDocx(buffer, options?)`

`parseDocxStream(stream, options?)`

`parseDocxReadable(stream, options?)`

`parseDocxHttpStream(stream, options?)`

`parseDocxWebStream(stream, options?)`

`parseDocxToArray(source, options?)`

`extractText(source, options?)`

`extractImages(source)`

`getMetadata(source)`

`ValidateDocumentUseCaseImpl`