@omer-go/docx-parser-converter-ts
v1.0.0
Published
A TypeScript library to convert DOCX files to WYSIWYG HTML or plain text formats while preserving styles.
Maintainers
Readme
DOCX Parser Converter - TypeScript/JavaScript
TypeScript/JavaScript implementation of the DOCX parser and converter. Works in both browser and Node.js environments with ES Modules, UMD, and IIFE builds available.
For installation and quick start, see the main README.
Quick Start
import { docxToHtml, docxToText } from '@omer-go/docx-parser-converter-ts';
// Convert to HTML
const html = await docxToHtml(buffer);
// Convert to plain text
const text = await docxToText(buffer);Environment Compatibility
Browser
The library is fully browser-compatible using standard Web APIs:
// From file input
const file = document.getElementById('fileInput').files[0];
const html = await docxToHtml(file);
// From ArrayBuffer
const arrayBuffer = await file.arrayBuffer();
const html = await docxToHtml(arrayBuffer);
// From Blob
const blob = new Blob([data]);
const html = await docxToHtml(blob);Node.js
In Node.js, you can also use file paths:
import { docxToHtml, docxToText } from '@omer-go/docx-parser-converter-ts';
// File path (Node.js only)
const html = await docxToHtml('document.docx');
// Save directly to file (Node.js only)
await docxToHtml('document.docx', undefined, { outputPath: 'output.html' });
// Bytes also work
import { readFileSync } from 'fs';
const bytes = readFileSync('document.docx');
const html = await docxToHtml(bytes);Configuration
Use ConversionConfig to customize the conversion:
import { docxToHtml, docxToText, ConversionConfig } from '@omer-go/docx-parser-converter-ts';
const config: ConversionConfig = {
// HTML-specific options
title: 'My Document', // Document title in <title> tag
language: 'en', // HTML lang attribute
styleMode: 'inline', // "inline", "class", or "none"
useSemanticTags: false, // Use CSS spans (false) vs <strong>, <em> (true)
fragmentOnly: false, // Output just content without HTML wrapper
customCss: 'body { margin: 2em; }', // Custom CSS to include
responsive: true, // Include viewport meta tag
// Text-specific options
textFormatting: 'plain', // "plain" or "markdown"
tableMode: 'auto', // "auto", "ascii", "tabs", or "plain"
paragraphSeparator: '\n\n', // Separator between paragraphs
};
const html = await docxToHtml(buffer, config);
const text = await docxToText(buffer, config);Configuration Options
HTML Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| styleMode | "inline" | "class" | "none" | "inline" | How to output CSS styles |
| useSemanticTags | boolean | false | Use semantic tags (<strong>, <em>) vs CSS spans |
| preserveWhitespace | boolean | false | Preserve whitespace in content |
| includeDefaultStyles | boolean | true | Include default CSS styles |
| title | string | "" | Document title for HTML output |
| language | string | "en" | HTML lang attribute |
| fragmentOnly | boolean | false | Output only content, no HTML wrapper |
| customCss | string \| null | null | Custom CSS to include |
| cssFiles | string[] | [] | External CSS files to reference |
| useCssVariables | boolean | false | Use CSS custom properties |
| responsive | boolean | true | Include viewport meta tag |
| includePrintStyles | boolean | false | Include print media query styles |
Text Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| textFormatting | "plain" | "markdown" | "plain" | Output format |
| tableMode | "auto" | "ascii" | "tabs" | "plain" | "auto" | Table rendering mode |
| paragraphSeparator | string | "\n\n" | Separator between paragraphs |
| preserveEmptyParagraphs | boolean | true | Preserve empty paragraphs |
Table Rendering Modes
auto: Automatically selects ASCII for tables with visible borders, tabs for othersascii: ASCII box drawing characters (+,-,|)tabs: Tab-separated columnsplain: Space-separated columns
Example ASCII table output:
+----------+----------+
| Header 1 | Header 2 |
+----------+----------+
| Cell 1 | Cell 2 |
+----------+----------+Markdown Formatting
When using textFormatting: "markdown", formatting is preserved:
const config: ConversionConfig = { textFormatting: 'markdown' };
const text = await docxToText(buffer, config);
// Output: "This is **bold** and *italic* text."Module Formats
ES Modules (Recommended)
import { docxToHtml, docxToText } from '@omer-go/docx-parser-converter-ts';UMD (Browser <script> tag)
<script src="path/to/dist/docx-parser-converter.umd.js"></script>
<script>
const { docxToHtml, docxToText } = window.DocxParserConverter;
</script>IIFE (Browser global)
<script src="path/to/dist/docx-parser-converter.iife.js"></script>
<script>
const { docxToHtml, docxToText } = DocxParserConverter;
</script>Browser Usage Example
<input type="file" id="docxFile" accept=".docx" />
<button onclick="handleConvert()">Convert</button>
<div id="htmlOutput"></div>
<pre id="textOutput"></pre>
<script type="module">
import { docxToHtml, docxToText } from '@omer-go/docx-parser-converter-ts';
window.handleConvert = async function() {
const fileInput = document.getElementById('docxFile');
if (!fileInput.files?.length) {
alert('Please select a DOCX file.');
return;
}
const file = fileInput.files[0];
// Convert to HTML
const html = await docxToHtml(file, { title: 'Converted Document' });
document.getElementById('htmlOutput').innerHTML = html;
// Convert to Plain Text
const text = await docxToText(file);
document.getElementById('textOutput').textContent = text;
}
</script>Input Types
The library accepts multiple input types:
// ArrayBuffer
const arrayBuffer = await file.arrayBuffer();
const html = await docxToHtml(arrayBuffer);
// Uint8Array
const uint8 = new Uint8Array(arrayBuffer);
const html = await docxToHtml(uint8);
// Blob
const blob = new Blob([data]);
const html = await docxToHtml(blob);
// File (from input element)
const file = inputElement.files[0];
const html = await docxToHtml(file);
// File path (Node.js only)
const html = await docxToHtml('document.docx');
// null/undefined returns empty output
const html = await docxToHtml(null); // Returns empty HTML document
const text = await docxToText(null); // Returns ""Supported DOCX Elements
Text Formatting
- Bold, italic, underline, strikethrough
- Subscript, superscript
- Highlight colors
- Font family, size, and color
- All caps, small caps
- Various underline styles (single, double, dotted, dashed, wave, etc.) with color support
Paragraph Formatting
- Alignment (left, center, right, justify)
- Indentation (left, right, first line, hanging)
- Spacing (before, after, line spacing)
- Borders and shading
- Keep with next, keep lines together, page break before
Lists and Numbering
- Bullet lists
- Numbered lists (decimal, roman, letters, ordinal)
- Multi-level lists with various formats
- List restart and override support
Tables
- Simple and complex tables
- Cell merging (horizontal and vertical)
- Full border support (outer borders, inside grid lines, per-cell borders)
- Cell-level border overrides (tcBorders override tblBorders)
- Cell shading and backgrounds
- Column widths and table alignment
Images
- Inline images with dimensions and alt text
- Floating/anchored images with positioning (left, right, center)
- Embedded as base64 data URLs in HTML output
Hyperlinks
- External URLs resolved from relationships
TypeScript-Specific Limitations
The TypeScript implementation does not currently support:
- Headers and footers
- Footnotes and endnotes
- Comments
- Custom XML parts
See the main README for the full feature comparison.
Error Handling
The library provides specific exceptions for different error cases:
import {
docxToHtml,
DocxNotFoundError,
DocxReadError,
DocxValidationError
} from '@omer-go/docx-parser-converter-ts';
try {
const html = await docxToHtml('document.docx');
} catch (e) {
if (e instanceof DocxNotFoundError) {
console.error('File not found');
} else if (e instanceof DocxReadError) {
console.error('Cannot read file:', e.message);
} else if (e instanceof DocxValidationError) {
console.error('Invalid DOCX:', e.message);
}
}Exception Types
| Exception | Description |
|-----------|-------------|
| DocxParserError | Base class for all DOCX errors |
| DocxNotFoundError | File not found (Node.js only) |
| DocxReadError | Cannot read file |
| DocxValidationError | Invalid DOCX structure |
| DocxEncryptedError | Password-protected file |
| DocxMissingPartError | Required part missing (e.g., document.xml) |
| XmlParseError | XML parsing failed |
Advanced API
For more control, use parseDocx directly:
import { parseDocx, HTMLConverter, TextConverter } from '@omer-go/docx-parser-converter-ts';
// Parse DOCX to get document model and metadata
const [document, metadata] = await parseDocx(buffer);
if (document) {
// Use converters directly
const htmlConverter = new HTMLConverter(document, {
styles: metadata.styles,
numbering: metadata.numbering,
relationships: metadata.relationships,
imageData: metadata.imageData,
});
const html = htmlConverter.convert({
styleMode: 'class',
title: 'My Document',
});
}Architecture
The library follows a three-phase conversion process:
- Parse: Unzip DOCX and parse XML parts (
document.xml,styles.xml,numbering.xml) into typed models - Resolve: Apply hierarchical style inheritance (direct formatting > character style > paragraph style > defaults)
- Convert: Transform models to HTML or plain text output
Project Structure
docx_parser_converter_ts/
├── src/
│ ├── index.ts # Public exports
│ ├── api.ts # Main API (docxToHtml, docxToText, parseDocx)
│ ├── config.ts # ConversionConfig interface
│ ├── core/ # Core utilities
│ │ ├── docx-reader.ts # DOCX file opening and validation
│ │ ├── xml-extractor.ts # XML content extraction
│ │ ├── constants.ts # XML namespaces
│ │ └── exceptions.ts # Custom exceptions
│ ├── models/ # TypeScript interfaces
│ │ ├── common/ # Shared models (Color, Border, Spacing, etc.)
│ │ ├── document/ # Document models (Paragraph, Run, Table, etc.)
│ │ ├── numbering/ # Numbering definitions
│ │ └── styles/ # Style definitions
│ ├── parsers/ # XML to model conversion
│ │ ├── document/ # Document element parsers
│ │ ├── numbering/ # Numbering parsers
│ │ └── styles/ # Style parsers
│ └── converters/ # Model to output conversion
│ ├── common/ # Style resolution, numbering tracking
│ ├── html/ # HTML conversion
│ └── text/ # Text conversion
└── tests/ # Test suiteDevelopment
Setup
# Clone the repository
git clone https://github.com/omer-go/docx-parser-converter.git
cd docx-parser-converter/docx_parser_converter_ts
# Install dependencies (using pnpm)
pnpm installBuild
# Build all formats (ES, UMD, IIFE)
pnpm build
# Type checking
pnpm tsc --noEmitTesting
# Run all tests
pnpm test
# Run with coverage
pnpm test --coverage
# Run specific test file
pnpm test src/__tests__/api.test.tsLinting
# Run ESLint
pnpm lintKnown Limitations
Not Currently Supported
- Headers and footers: Document headers/footers are not included
- Footnotes and endnotes: These are not extracted
- Comments and track changes: Revision marks are not processed
- OLE objects: Embedded Excel charts, etc. are not supported
- Text boxes: Floating text boxes and shapes are not extracted
- RTL/BiDi text: Right-to-left text may not render correctly
- Password-protected files: Encrypted documents cannot be opened
Partial Support
- Themes: Theme colors and fonts are not resolved
- Custom XML: Custom document properties are not extracted
License
MIT License
Related Documentation
- XML to CSS Conversion - XML to CSS conversion reference
- XML Structure Guide - OOXML structure reference
