html-docxjs-compiler

v2.0.0

Published

3 months ago

Convert HTML to DOCXjs elements.

0High
0Medium
0Low

janisha

html docx converter html-to-docx document

HTML to DOCXjs Compiler

A powerful and flexible TypeScript library that converts HTML strings into DOCXjs XmlComponent format. Built on top of the excellent docx library, this package parses HTML using cheerio and transforms it into XmlComponent objects that can be seamlessly integrated with the docx API.

🚀 Features

✅ Comprehensive HTML Support - Handles headings, paragraphs, lists, tables, images, and inline formatting
✅ CSS Styling - Supports inline styles (colors, alignment, text decoration, etc.)
✅ Image Handling - Base64 data URIs, HTTP/HTTPS URLs, and custom image download strategies
✅ Extensible Architecture - Strategy pattern for custom image sources (Firebase, S3, Azure, etc.)
✅ Type Safe - Written in TypeScript with full type definitions
✅ Zero Config - Works out of the box with sensible defaults
✅ Production Ready - Battle-tested with proper error handling

📦 Installation

npm install html-docxjs-compiler docx

Peer Dependencies

This package requires docx as a peer dependency:

npm install docx@^9.5.0

🎯 Quick Start

Basic Usage

import { transformHtmlToDocx } from 'html-docxjs-compiler';
import { Document, Packer } from 'docx';
import * as fs from 'fs';

async function createDocument() {
  const html = `
    <h1>My Document</h1>
    <p>This is a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
    <ul>
      <li>First item</li>
      <li>Second item</li>
    </ul>
  `;
  
  // Transform HTML to DOCX elements
  const elements = await transformHtmlToDocx(html);
  
  // Create a document with the elements
  const doc = new Document({
    sections: [{
      children: elements
    }]
  });
  
  // Generate and save the document
  const buffer = await Packer.toBuffer(doc);
  fs.writeFileSync('output.docx', buffer);
}

createDocument();

📖 How It Works

The library uses a three-stage process:

Parse HTML - Uses cheerio to parse HTML into a DOM-like structure
Transform to XmlComponents - Recursively processes each HTML element and converts it to docx XmlComponent objects (Paragraph, TextRun, Table, ImageRun, etc.)
Integration with docx - Returns array of XmlComponent[] that can be used directly in the docx Document API

HTML String → cheerio Parser → Element Handlers → XmlComponent[] → docx Document

Configuration is Optional

All configuration options are completely optional:

✅ Works out-of-the-box with no configuration
✅ Base64 images work without any setup
✅ URL-based images require explicit strategyManager configuration
✅ Graceful degradation: missing features log warnings, don't throw errors

// Simple usage - works immediately
const elements = await transformHtmlToDocx('<p>Hello World</p>');

// With image URL support - requires configuration
import { ImageDownloadStrategyManager, HttpImageDownloadStrategy } from 'html-docxjs-compiler';

const htmlWithImages = await transformHtmlToDocx('<p><img src="imageurl...">></p>');

const strategyManager = new ImageDownloadStrategyManager([
  new HttpImageDownloadStrategy()
]);

const elements = await transformHtmlToDocx(htmlWithImages, { strategyManager });

🎨 Examples

Example 1: Document with Formatting

import { transformHtmlToDocx } from 'html-docxjs-compiler';
import { Document, Packer } from 'docx';

const html = `
  <h1>Project Report</h1>
  <h2>Executive Summary</h2>
  <p style="text-align: center; color: #333333;">
    This report provides an overview of our <strong>Q4 2024</strong> performance.
  </p>
  
  <h3>Key Highlights</h3>
  <ul>
    <li>Revenue increased by <strong>25%</strong></li>
    <li>Customer satisfaction: <em>95%</em></li>
    <li>New product launch was <u>successful</u></li>
  </ul>
`;

async function generateReport() {
  const elements = await transformHtmlToDocx(html);
  
  const doc = new Document({
    sections: [{
      children: elements
    }]
  });
  
  const buffer = await Packer.toBuffer(doc);
  // Save or send buffer...
}

Example 2: Tables with Styling

const html = `
  <h2>Sales Data</h2>
  <table>
    <thead>
      <tr>
        <th style="background-color: #4472C4; color: white;">Product</th>
        <th style="background-color: #4472C4; color: white;">Q3</th>
        <th style="background-color: #4472C4; color: white;">Q4</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Widget A</td>
        <td style="text-align: right;">$50,000</td>
        <td style="text-align: right;">$65,000</td>
      </tr>
      <tr>
        <td>Widget B</td>
        <td style="text-align: right;">$30,000</td>
        <td style="text-align: right;">$42,000</td>
      </tr>
    </tbody>
  </table>
`;

const elements = await transformHtmlToDocx(html);

Example 3: Images with HTTP URLs

const html = `
  <h1>Product Catalog</h1>
  <p>Check out our latest products:</p>
  <img src="https://example.com/product-image.jpg" alt="Product" />
  <p>Available in multiple colors.</p>
`;

const elements = await transformHtmlToDocx(html);

Example 4: Base64 Images

const html = `
  <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA..." />
  <p>Company Logo</p>
`;

const elements = await transformHtmlToDocx(html);

🔧 Advanced Configuration

Custom Image Download Strategies

The library uses a Strategy Pattern for image downloads, allowing you to customize how images are fetched from different sources.

Creating Custom Image Strategy

import { ImageDownloadStrategy } from 'html-docxjs-compiler';
import axios from 'axios';

// Custom strategy for S3 signed URLs
class S3ImageDownloadStrategy implements ImageDownloadStrategy {
  canHandle(url: string): boolean {
    return url.includes('s3.amazonaws.com') || url.includes('s3-');
  }

  async download(url: string): Promise<string> {
    const response = await axios.get(url, {
      responseType: 'arraybuffer',
      headers: {
        'Authorization': 'Bearer YOUR_TOKEN' // Add custom headers
      }
    });

    const base64 = Buffer.from(response.data, 'binary').toString('base64');
    return `data:image/png;base64,${base64}`;
  }
}

// Use your custom strategy
const s3Strategy = new S3ImageDownloadStrategy();
const strategyManager = new ImageDownloadStrategyManager([s3Strategy]);

const elements = await transformHtmlToDocx(html, { strategyManager });

Multiple Strategies (Chain of Responsibility)

import { 
  ImageDownloadStrategyManager,
  HttpImageDownloadStrategy,
} from 'html-docxjs-compiler';

// Strategies are tried in order until one can handle the URL
const strategyManager = new ImageDownloadStrategyManager([
  new FirebaseImageDownloadStrategy('firebase-bucket.appspot.com'),
  new S3ImageDownloadStrategy(),
  new HttpImageDownloadStrategy() // Fallback for any HTTP/HTTPS URL
]);

const elements = await transformHtmlToDocx(html, { strategyManager });

📋 Supported HTML Elements

Block Elements

| Element | Description | Styling Support | |---------|-------------|-----------------| | h1 - h6 | Headings (converted to DOCX heading styles) | ✅ | | p | Paragraphs | ✅ text-align, color, etc. | | div | Division container | ✅ | | ul, ol | Unordered/Ordered lists | ✅ Nested lists supported | | li | List items | ✅ | | table | Tables | ✅ | | tr | Table rows | ✅ | | td, th | Table cells/headers | ✅ colspan, rowspan, background-color, vertical-align | | thead, tbody | Table sections | ✅ |

Inline Elements

| Element | Description | Styling Support | |---------|-------------|-----------------| | strong, b | Bold text | ✅ | | em, i | Italic text | ✅ | | u | Underlined text | ✅ | | s | Strikethrough text | ✅ | | sub | Subscript | ✅ | | sup | Superscript | ✅ | | span | Inline container | ✅ color, background-color, etc. | | a | Hyperlinks | ✅ Creates clickable links | | br | Line break | ✅ | | img | Images | ✅ Auto-resize, multiple sources |

Supported CSS Styles

Colors: 147+ named colors + hex values (e.g., #FF0000, red, darkblue)
Text Alignment: left, center, right, justify
Vertical Alignment: top, middle, bottom (table cells)
Background Color: For table cells and spans
Font Styles: bold, italic, underline, strikethrough

🖼️ Image Handling

Image Constraints

Images are automatically resized to fit within these constraints while maintaining aspect ratio:

Maximum Width: 600px
Maximum Height: 900px

Supported Image Sources

Base64 Data URIs - Always supported without configuration (data:image/png;base64,...)
HTTP/HTTPS URLs - Requires strategyManager with appropriate strategies
Custom Sources - Implement ImageDownloadStrategy interface

Note: If no strategyManager is provided:

Base64 images will work normally
URL-based images will be skipped with a console warning
No errors will be thrown

📚 API Reference

Main Functions

`transformHtmlToDocx(html: string, options?: HtmlToDocxOptions): Promise<XmlComponent[]>`

Primary function to convert HTML to DOCX elements.

Parameters:

html (string): HTML string to convert
options (optional): Configuration options
- strategyManager (ImageDownloadStrategyManager, optional): Custom image download strategy manager
  - If not provided, only base64 images will work
  - URL-based images will be skipped with a warning

Returns:

Promise<XmlComponent[]>: Array of docx components ready to use in Document

Example:

// Without images or with base64 images only
const elements = await transformHtmlToDocx('<p>Hello</p>');

// With URL-based image support
const strategyManager = new ImageDownloadStrategyManager([
  new HttpImageDownloadStrategy()
]);
const elements = await transformHtmlToDocx('<p>Hello</p>', {
  strategyManager
});

`transformHtmlToDocxSimple(html: string, options?: HtmlToDocxOptions): Promise<XmlComponent[]>`

Simplified transformation for basic text rendering (wraps all content in paragraphs).

Parameters:

Same as transformHtmlToDocx

Returns:

Promise<XmlComponent[]>

Use Case: Simple text content without complex structure

`textToDocx(text: string): Promise<XmlComponent[]>`

Converts plain text to DOCX, preserving line breaks as <br /> tags.

Parameters:

text (string): Plain text string

Returns:

Promise<XmlComponent[]>

Strategy Classes

`ImageDownloadStrategyManager`

Manages multiple image download strategies using Chain of Responsibility pattern.

Constructor:

new ImageDownloadStrategyManager(strategies?: ImageDownloadStrategy[])

Methods:

addStrategy(strategy: ImageDownloadStrategy): void - Add a new strategy
download(url: string): Promise<string> - Download image using first matching strategy

`HttpImageDownloadStrategy`

Default strategy for HTTP/HTTPS URLs.

Methods:

canHandle(url: string): boolean - Returns true for http:// or https:// URLs
download(url: string): Promise<string> - Downloads and returns base64 data URI

`ImageDownloadStrategy` Interface

Implement this interface to create custom image download strategies.

interface ImageDownloadStrategy {
  canHandle(url: string): boolean;
  download(url: string): Promise<string>;
}

🔍 How XmlComponents Work

The docx library uses XmlComponent objects to build Word documents. This package transforms HTML elements into these components:

// HTML
<p>Hello <strong>world</strong>!</p>

// Becomes
new Paragraph({
  children: [
    new TextRun({ text: "Hello " }),
    new TextRun({ text: "world", bold: true }),
    new TextRun({ text: "!" })
  ]
})

Common XmlComponent Types

Paragraph - Block of text (from <p>, <div>, <h1>, etc.)
TextRun - Styled text segment (from <span>, <strong>, etc.)
Table - Table structure (from <table>)
TableRow - Table row (from <tr>)
TableCell - Table cell (from <td>, <th>)
ImageRun - Embedded image (from <img>)
ExternalHyperlink - Clickable link (from <a>)

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📄 License

This project is dual-licensed:

Personal / Non-Commercial Use
Free under an MIT-style non-commercial license.
You can use it for personal, educational, and other non-commercial projects.
Commercial Use
Commercial use requires a paid license (per legal entity / organization).
See LICENSE and LICENSE-COMMERCIAL.md for full terms.

Commercial Licensing (Overview)

Standard and Enterprise one-time licenses
Per company/organization
Perpetual, unlimited use in internal and client projects

After purchase you receive a license key and your payment receipt, which together serve as proof of license.

Is your use commercial?
If you're using this in a business, company, SaaS, client work, or any for-profit context, you should obtain a commercial license.

Questions or edge cases? Email [email protected].

🙏 Acknowledgments

docx - Excellent library for generating DOCX files
cheerio - HTML parser

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

HTML to DOCXjs Compiler

🚀 Features

📦 Installation

Peer Dependencies

🎯 Quick Start

Basic Usage

📖 How It Works

Configuration is Optional

🎨 Examples

Example 1: Document with Formatting

Example 2: Tables with Styling

Example 3: Images with HTTP URLs

Example 4: Base64 Images

🔧 Advanced Configuration

Custom Image Download Strategies

Creating Custom Image Strategy

Multiple Strategies (Chain of Responsibility)

📋 Supported HTML Elements

Block Elements

Inline Elements

Supported CSS Styles

🖼️ Image Handling

Image Constraints

Supported Image Sources

📚 API Reference

Main Functions

transformHtmlToDocx(html: string, options?: HtmlToDocxOptions): Promise<XmlComponent[]>

transformHtmlToDocxSimple(html: string, options?: HtmlToDocxOptions): Promise<XmlComponent[]>

textToDocx(text: string): Promise<XmlComponent[]>

Strategy Classes

ImageDownloadStrategyManager

HttpImageDownloadStrategy

ImageDownloadStrategy Interface

🔍 How XmlComponents Work

Common XmlComponent Types

🤝 Contributing

📄 License

Commercial Licensing (Overview)

🙏 Acknowledgments

`transformHtmlToDocx(html: string, options?: HtmlToDocxOptions): Promise<XmlComponent[]>`

`transformHtmlToDocxSimple(html: string, options?: HtmlToDocxOptions): Promise<XmlComponent[]>`

`textToDocx(text: string): Promise<XmlComponent[]>`

`ImageDownloadStrategyManager`

`HttpImageDownloadStrategy`

`ImageDownloadStrategy` Interface