@thasmorato/docx-parser
v1.4.0
Published
A modern JavaScript library for parsing and processing Microsoft Word DOCX documents with support for both buffer and stream operations. Features incremental parsing, checkbox detection, footnote support, and document validation.
Downloads
511
Maintainers
Readme
DOCX Parser
A modern JavaScript/TypeScript library for parsing DOCX documents using generator architecture and Clean Architecture.
🚀 Features
- Streaming: Processes DOCX documents incrementally using async generators
- Memory Efficient: Doesn't load the entire document into memory
- TypeScript: Complete typing with well-defined interfaces
- Clean Architecture: Clear separation between layers (Domain, Application, Infrastructure, Interfaces)
- Flexible: Extracts text, images, tables, metadata, and formatting
- Configurable: Advanced options to control parsing
- Checkbox Detection: Automatically detects and parses checkbox states in lists
- Footnote Support: Identifies and extracts footnotes with proper references
- Header Levels: Detects document structure with header hierarchy (H1-H6)
- Document Validation: Built-in validation for DOCX file integrity
- Page Elements: Supports headers, footers, and page breaks
- List Processing: Handles numbered, bulleted, and checkbox lists
📦 Installation
npm install @thasmorato/docx-parser
# or
yarn add @thasmorato/docx-parser
# or
pnpm add @thasmorato/docx-parser📋 Main Exports
The library exports only essential items for a clean API:
// Main parsing functions
import {
parseDocx, // Buffer → AsyncGenerator
parseDocxStream, // ReadStream → AsyncGenerator
parseDocxHttpStream, // HTTP Readable Stream → AsyncGenerator
parseDocxReadable, // Node.js Readable Stream → AsyncGenerator
parseDocxWebStream, // ReadableStream → AsyncGenerator
parseDocxFile, // File path → AsyncGenerator
parseDocxToArray, // Any source → Promise<Array>
extractText, // Extract only text
extractImages, // Extract only images
getMetadata // Extract only metadata
} from 'docx-parser';
// Essential types
import type {
DocumentElement, // Union of all element types
ParseOptions, // Configuration options
ValidationResult, // Validation response
MetadataElement, // Document metadata
ParagraphElement, // Text paragraphs
ImageElement, // Images with metadata
TableElement, // Tables with rows/cells
HeaderElement, // Document headers
FooterElement, // Document footers
ListElement, // Lists (bullet/numbered)
PageBreakElement, // Page breaks
SectionElement // Document sections
} from 'docx-parser';
// Error handling
import { DocxParseError } from 'docx-parser';
// Advanced utilities
import { StreamAdapter } from 'docx-parser';
import { ValidateDocumentUseCaseImpl } from 'docx-parser';🔥 Basic Usage
Incremental Parsing (Recommended)
import { parseDocx } from 'docx-parser';
import { readFileSync } from 'fs';
const buffer = readFileSync('document.docx');
// Process document incrementally
for await (const element of parseDocx(buffer)) {
console.log(`Type: ${element.type}`);
if (element.type === 'paragraph') {
console.log(`Text: ${element.content}`);
} else if (element.type === 'image') {
console.log(`Image: ${element.metadata?.filename}`);
} else if (element.type === 'table') {
console.log(`Table with ${element.content.length} rows`);
}
}File Parsing
import { parseDocxFile } from 'docx-parser';
// Read file directly from filesystem
for await (const element of parseDocxFile('./document.docx')) {
console.log(element.type, element.content);
}Stream Parsing
import { parseDocxStream, parseDocxHttpStream, parseDocxWebStream } from 'docx-parser';
import { createReadStream } from 'fs';
import axios from 'axios';
// Using Node.js ReadStream (recommended for local files)
const fileStream = createReadStream('./document.docx');
for await (const element of parseDocxStream(fileStream)) {
console.log(element.type);
}
// Using HTTP streams (axios, fetch, etc.) - NEW!
const response = await axios({
url: 'https://example.com/document.docx',
responseType: 'stream'
});
for await (const element of parseDocxHttpStream(response.data)) {
console.log(element.type);
}
// Using Web ReadableStream (for fetch, responses, etc.)
const fetchResponse = await fetch('https://example.com/document.docx');
const webStream = fetchResponse.body!;
for await (const element of parseDocxWebStream(webStream)) {
console.log(element.type);
}StreamAdapter Utility
import { StreamAdapter } from 'docx-parser';
import { createReadStream } from 'fs';
// Convert ReadStream to Web ReadableStream
const nodeStream = createReadStream('./document.docx');
const webStream = StreamAdapter.toWebStream(nodeStream);
// Convert ReadableStream to Buffer
const buffer = await StreamAdapter.toBuffer(webStream);
// Create ReadableStream from Buffer
const streamFromBuffer = StreamAdapter.fromBuffer(buffer);🛠️ Complete API
Main Functions
parseDocx(buffer, options?)
Processes a DOCX Buffer incrementally.
import { parseDocx } from 'docx-parser';
import { readFileSync } from 'fs';
const buffer = readFileSync('doc.docx');
for await (const element of parseDocx(buffer, {
includeImages: true,
includeTables: true,
normalizeWhitespace: true
})) {
// Process each element
}parseDocxStream(stream, options?)
Processes a DOCX ReadStream (Node.js) incrementally.
import { parseDocxStream } from 'docx-parser';
import { createReadStream } from 'fs';
const stream = createReadStream('doc.docx');
for await (const element of parseDocxStream(stream)) {
// Process each element
}parseDocxReadable(stream, options?)
Processes a DOCX from any Node.js Readable stream incrementally.
import { parseDocxReadable } from 'docx-parser';
import { Readable } from 'stream';
// Custom Readable stream
const customStream = new Readable({
read() {
// Custom stream logic
}
});
for await (const element of parseDocxReadable(customStream)) {
// Process each element
}
// Transform streams, pipeline streams, etc.
import { Transform } from 'stream';
const transformedStream = someStream.pipe(new Transform({
transform(chunk, encoding, callback) {
// Transform logic
callback(null, chunk);
}
}));
for await (const element of parseDocxReadable(transformedStream)) {
// Process transformed stream
}parseDocxHttpStream(stream, options?)
Processes a DOCX from HTTP Readable streams (axios, fetch, etc.) incrementally.
import { parseDocxHttpStream } from 'docx-parser';
import axios from 'axios';
const response = await axios({
url: 'https://example.com/doc.docx',
responseType: 'stream'
});
for await (const element of parseDocxHttpStream(response.data)) {
// Process each element
}parseDocxWebStream(stream, options?)
Processes a DOCX ReadableStream (Web API) incrementally.
import { parseDocxWebStream } from 'docx-parser';
const response = await fetch('document.docx');
const stream = response.body!;
for await (const element of parseDocxWebStream(stream)) {
// Process each element
}parseDocxToArray(source, options?)
Returns all elements as an array (non-streaming).
import { parseDocxToArray } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
const elements = await parseDocxToArray(buffer);
// From ReadStream
const stream = createReadStream('doc.docx');
const elements2 = await parseDocxToArray(stream);
console.log(`Document has ${elements.length} elements`);extractText(source, options?)
Extracts only text from the document.
import { extractText } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
const text = await extractText(buffer, {
preserveFormatting: true
});
// From ReadStream
const stream = createReadStream('doc.docx');
const text2 = await extractText(stream);
console.log(text);extractImages(source)
Extracts only images from the document.
import { extractImages } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
for await (const image of extractImages(buffer)) {
console.log(`Image: ${image.metadata?.filename}`);
}
// From ReadStream
const stream = createReadStream('doc.docx');
for await (const image of extractImages(stream)) {
// image.content contains the image Buffer
}getMetadata(source)
Extracts only metadata from the document.
import { getMetadata } from 'docx-parser';
import { createReadStream } from 'fs';
// From buffer
const metadata = await getMetadata(buffer);
// From ReadStream
const stream = createReadStream('doc.docx');
const metadata2 = await getMetadata(stream);
console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Created: ${metadata.created}`);Document Validation
ValidateDocumentUseCaseImpl
Validates DOCX document structure and integrity.
import { ValidateDocumentUseCaseImpl } from 'docx-parser';
const validator = new ValidateDocumentUseCaseImpl();
const result = await validator.validate(buffer);
if (!result.isValid) {
console.log('Invalid document:');
result.errors.forEach(error => {
console.log(`- ${error.message} (${error.code})`);
});
} else {
console.log('Valid document!');
}🌐 HTTP Streams Support
The library now supports HTTP streams from popular libraries like axios, fetch, and others. This resolves the common error "stream.getReader is not a function" when working with HTTP responses.
Using with Axios
import axios from 'axios';
import { parseDocxHttpStream, parseDocxToArray } from 'docx-parser';
// Method 1: Specific HTTP stream function (recommended)
async function parseFromUrl(url: string) {
const response = await axios({
method: 'get',
url: url,
responseType: 'stream' // Important: use 'stream', not 'buffer'
});
for await (const element of parseDocxHttpStream(response.data)) {
console.log(`${element.type}: ${element.content}`);
}
}
// Method 2: Automatic detection
async function parseToArray(url: string) {
const response = await axios({
method: 'get',
url: url,
responseType: 'stream'
});
// parseDocxToArray automatically detects Node.js streams
const elements = await parseDocxToArray(response.data, {
includeImages: true,
includeTables: true
});
return elements;
}Using with Fetch API
// Fetch API (Node.js 18+)
const response = await fetch('https://example.com/document.docx');
if (!response.ok) throw new Error(`HTTP error! status: ${response.status}`);
// response.body is a Web ReadableStream - works automatically
const elements = await parseDocxToArray(response.body);Error Handling for HTTP Streams
import axios from 'axios';
import { parseDocxHttpStream } from 'docx-parser';
async function safeParseFromUrl(url: string) {
try {
const response = await axios({
method: 'get',
url: url,
responseType: 'stream',
timeout: 30000, // 30 seconds timeout
headers: {
'Accept': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
}
});
const elements = [];
for await (const element of parseDocxHttpStream(response.data)) {
elements.push(element);
}
return elements;
} catch (error) {
if (axios.isAxiosError(error)) {
console.error('HTTP Error:', error.response?.status, error.message);
} else if (error.name === 'DocxParseError') {
console.error('DOCX Parse Error:', error.message);
} else {
console.error('Unknown Error:', error);
}
throw error;
}
}📖 For complete HTTP streams documentation, see: project/HTTP_STREAMS_GUIDE.md
⚙️ Configuration Options
interface ParseOptions {
// Content control
includeMetadata?: boolean; // Include metadata (default: true)
includeImages?: boolean; // Include images (default: true)
includeTables?: boolean; // Include tables (default: true)
includeHeaders?: boolean; // Include headers (default: false)
includeFooters?: boolean; // Include footers (default: false)
// Image processing
imageFormat?: 'buffer' | 'base64' | 'stream'; // Image format
maxImageSize?: number; // Maximum size in bytes (default: 10MB)
// Text formatting
preserveFormatting?: boolean; // Preserve formatting (default: true)
normalizeWhitespace?: boolean; // Normalize whitespace (default: true)
// Performance
chunkSize?: number; // Chunk size (default: 64KB)
concurrent?: boolean; // Parallel processing (default: false)
}📋 Element Types
MetadataElement
{
type: 'metadata',
id: string,
position: { page: number, section: number, order: number },
content: {
title?: string,
author?: string,
subject?: string,
created?: Date,
modified?: Date,
// ... other metadata
}
}ParagraphElement
{
type: 'paragraph',
id: string,
position: { page: number, section: number, order: number },
content: string,
formatting?: {
fontFamily?: string,
fontSize?: number,
bold?: boolean,
italic?: boolean,
underline?: boolean,
color?: string,
highlight?: string
},
// Special properties
checkbox?: { // For checkbox list items
checked: boolean
},
isFootnote?: boolean, // If this is a footnote
footnoteId?: string // Footnote identifier
}ImageElement
{
type: 'image',
id: string,
position: { page: number, section: number, order: number },
content: Buffer,
metadata?: {
filename?: string,
format: 'png' | 'jpg' | 'gif' | 'svg' | 'wmf' | 'emf',
width: number,
height: number
},
positioning?: {
inline?: boolean,
x?: number,
y?: number
}
}TableElement
{
type: 'table',
id: string,
position: { page: number, section: number, order: number },
content: Array<{
cells: Array<{ content: string }>,
isHeader: boolean
}>
}HeaderElement
{
type: 'header',
id: string,
position: { page: number, section: number, order: number },
content: string,
level: 1 | 2 | 3 | 4 | 5 | 6,
hasPageNumber?: boolean // If header contains page numbers
}FooterElement
{
type: 'footer',
id: string,
position: { page: number, section: number, order: number },
content: string
}ListElement
{
type: 'list',
id: string,
position: { page: number, section: number, order: number },
content: string[],
listType: 'bullet' | 'number',
level: number
}Additional Elements
// Page breaks
{
type: 'pageBreak',
id: string,
position: { page: number, section: number, order: number },
content: null
}
// Document sections
{
type: 'section',
id: string,
position: { page: number, section: number, order: number },
content: string,
sectionType: 'header' | 'footer' | 'body'
}🏗️ Advanced Examples
Filtering Content Types
import { parseDocx } from 'docx-parser';
// Only process text and tables
for await (const element of parseDocx(buffer, {
includeImages: false,
includeMetadata: false,
includeTables: true
})) {
if (element.type === 'paragraph') {
console.log('Paragraph:', element.content);
} else if (element.type === 'table') {
console.log('Table found');
element.content.forEach((row, i) => {
console.log(`Row ${i}:`, row.cells.map(c => c.content).join(' | '));
});
}
}Working with Checkboxes
import { parseDocx } from 'docx-parser';
for await (const element of parseDocx(buffer)) {
if (element.type === 'paragraph' && element.checkbox) {
const status = element.checkbox.checked ? '✅' : '☐';
console.log(`${status} ${element.content}`);
}
}Processing Footnotes
import { parseDocx } from 'docx-parser';
const footnotes = [];
for await (const element of parseDocx(buffer)) {
if (element.type === 'paragraph' && element.isFootnote) {
footnotes.push({
id: element.footnoteId,
content: element.content
});
}
}
console.log('Found footnotes:', footnotes);Header Level Detection
import { parseDocx } from 'docx-parser';
for await (const element of parseDocx(buffer)) {
if (element.type === 'header') {
const indent = ' '.repeat(element.level - 1);
console.log(`${indent}H${element.level}: ${element.content}`);
if (element.hasPageNumber) {
console.log(`${indent} (contains page number)`);
}
}
}Working with Node.js Readable Streams
import { parseDocxReadable } from 'docx-parser';
import { Readable, Transform, pipeline } from 'stream';
import { promisify } from 'util';
// Custom stream processing
const customStream = new Readable({
read() {
// Custom logic to read DOCX data
}
});
for await (const element of parseDocxReadable(customStream)) {
console.log(`${element.type}: ${element.content}`);
}
// Pipeline with transforms
const pipelineAsync = promisify(pipeline);
const transformStream = new Transform({
transform(chunk, encoding, callback) {
// Apply transformations if needed
callback(null, chunk);
}
});
// Process transformed stream
const transformedStream = someDocxStream.pipe(transformStream);
for await (const element of parseDocxReadable(transformedStream)) {
console.log('Transformed element:', element);
}Image Processing
import { writeFileSync } from 'fs';
import { extractImages } from 'docx-parser';
let imageCount = 0;
for await (const image of extractImages(buffer)) {
const filename = image.metadata?.filename || `image_${++imageCount}.${image.metadata?.format}`;
writeFileSync(`./output/${filename}`, image.content);
console.log(`Image saved: ${filename}`);
}Document Validation
import { ValidateDocumentUseCaseImpl } from 'docx-parser';
const validator = new ValidateDocumentUseCaseImpl();
const result = await validator.validate(buffer);
if (!result.isValid) {
console.log('Invalid document:');
result.errors.forEach(error => {
console.log(`- ${error.message} (${error.code})`);
});
} else {
console.log('Valid document!');
}🔧 Architecture
The library follows Clean Architecture with 4 layers:
- Domain: Types, interfaces, and business rules
- Application: Use cases and application logic
- Infrastructure: Concrete implementations (JSZip, XML parsing)
- Interfaces: Public API and controllers
src/
├── domain/ # Entities and business rules
├── application/ # Use cases and application logic
├── infrastructure/ # Adapters and implementations
└── interfaces/ # Public API🚨 Error Handling
import { DocxParseError } from 'docx-parser';
try {
for await (const element of parseDocx(invalidBuffer)) {
console.log(element);
}
} catch (error) {
if (error instanceof DocxParseError) {
console.log('DOCX-specific error:', error.message);
} else {
console.log('General error:', error);
}
}📝 Performance Tips
- Use streaming: Prefer
parseDocx()overparseDocxToArray()for large documents - Filter content: Disable
includeImagesif you don't need images - Chunk size: Adjust
chunkSizebased on document sizes - Limit images: Use
maxImageSizeto avoid very large images
🤝 Contributing
- Fork the project
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Development Setup
# Clone the repository
git clone https://github.com/ThaSMorato/docx-parser.git
cd docx-parser
# Install dependencies
pnpm install
# Run tests
pnpm test
# Run linting
pnpm lint
# Build the project
pnpm buildAutomated Publishing
This project uses GitHub Actions for automated NPM publishing with semantic versioning:
- MAJOR:
MAJOR: breaking changeorBREAKING CHANGEin commit - MINOR:
MINOR: new featureorfeat:prefix - PATCH:
fix:,docs:,chore:or any other commit type
See .github/VERSIONING.md for detailed information about the versioning system.
📄 License
MIT License - see LICENSE for details.
🐛 Reporting Bugs
Open an issue on GitHub with:
- Library version
- Sample DOCX file (if possible)
- Code that reproduces the problem
- Complete error message
Made with ❤️ and TypeScript
