@file-viewer/docx

v0.3.14

Published

18 minutes ago

Pure HTML DOCX renderer fork for browser-based Word previews.

0High
0Medium
0Low

wybaby168

word docx docxjs html-renderer word-preview

@file-viewer/docx

Pure HTML DOCX renderer for browser-based Word previews. This package preserves the final non-canvas docx-viewer line as a stable maintenance track, converting WordprocessingML into HTML while keeping semantic structure, page layout, headers/footers, numbering, fields, tables, images and common DrawingML content as closely as browser layout allows.

This fork includes production hardening for large Word documents: Worker-based parsing, asynchronous rendering yields, Word-saved pagination support, dynamic overflow pagination and layout telemetry.

Installation

npm install @file-viewer/docx

For the standalone browser build, ship these files together and keep the Worker URL same-origin:

dist/docx-preview.js
dist/docx-preview.worker.js
dist/jszip.min.js

Basic usage

<script src="dist/jszip.min.js"></script>
<script src="dist/docx-preview.js"></script>
<div id="container"></div>
<script>
const options = {
  useWorker: true,
  workerUrl: "dist/docx-preview.worker.js",
  workerJsZipUrl: "dist/jszip.min.js",
  awaitLayout: true,
  strictWordCompatibility: true,
  ignoreLastRenderedPageBreak: false,
  preserveComplexFieldResults: true,
  updatePageReferences: false,
  hideWebHiddenContent: false,
  progress: ev => console.log(ev.phase, ev.current, ev.total, ev.message)
};

docx.renderAsync(fileOrArrayBuffer, document.getElementById("container"), null, options)
  .then(() => console.log("docx: finished"));
</script>

ES module usage:

import { renderAsync } from "@file-viewer/docx";

await renderAsync(fileOrArrayBuffer, container, null, {
  useWorker: true,
  workerUrl: new URL("@file-viewer/docx/dist/docx-preview.worker.js", import.meta.url).toString(),
  workerJsZipUrl: new URL("@file-viewer/docx/dist/jszip.min.js", import.meta.url).toString(),
  awaitLayout: true
});

API

renderAsync(
  document: Blob | ArrayBuffer | Uint8Array,
  bodyContainer: HTMLElement,
  styleContainer?: HTMLElement,
  options?: Partial<Options>
): Promise<WordDocument>

parseAsync(
  document: Blob | ArrayBuffer | Uint8Array,
  options?: Partial<Options>
): Promise<WordDocument>

parseAsyncInWorker(
  document: Blob | ArrayBuffer | Uint8Array,
  options?: Partial<Options>
): Promise<WordDocument>

renderDocument(
  wordDocument: WordDocument,
  options?: Partial<Options>
): Promise<Node[]>

awaitRenderedLayout(
  container: HTMLElement,
  options?: Partial<Options>
): Promise<LayoutSnapshot>

collectLayoutSnapshot(
  container: HTMLElement,
  options?: Partial<Options>
): LayoutSnapshot

Important options

{
  className: "docx",
  inWrapper: true,
  hideWrapperOnPrint: false,
  ignoreWidth: false,
  ignoreHeight: false,
  ignoreFonts: false,
  breakPages: true,

  // Word fidelity / pagination
  ignoreLastRenderedPageBreak: false,
  strictWordCompatibility: true,
  paginationTolerance: 2,
  maxDynamicPaginationPasses: 1000,
  awaitLayout: true,

  // Worker and responsiveness
  useWorker: true,
  workerUrl: "dist/docx-preview.worker.js",
  workerJsZipUrl: "dist/jszip.min.js",
  workerFallback: true,
  workerTimeout: 120000,
  renderPageBatchSize: 2,
  renderYieldEveryMs: 16,
  progress: ev => void,

  // Field and print-layout compatibility
  preserveComplexFieldResults: true,
  updatePageReferences: false,
  hideWebHiddenContent: false,

  // Content switches
  renderHeaders: true,
  renderFooters: true,
  renderFootnotes: true,
  renderEndnotes: true,
  renderComments: false,
  renderAltChunks: true,
  renderChanges: false,
  experimental: false,
  trimXmlDeclaration: true,
  useBase64URL: false,
  debug: false
}

Large-document rendering path

The production path is intentionally split into stages:

The main thread starts a Worker using workerUrl.
The Worker loads JSZip using workerJsZipUrl, unzips the package, parses XML parts, resolves relationships and serializes a compact document snapshot.
The main thread restores the snapshot into WordDocument and renders HTML pages in batches, yielding to requestIdleCallback, requestAnimationFrame or setTimeout between page batches.
awaitRenderedLayout waits for images and fonts, runs dynamic overflow pagination and returns a LayoutSnapshot.
collectLayoutSnapshot can be used in CI or telemetry to detect overflow pages, unresolved media, page count, aggregate text length, fields and floating objects.

This avoids the former demo-page freeze caused by synchronous ZIP/XML parsing and very large DOM construction on the UI thread.

Pagination, headers/footers and TOC

The renderer supports:

explicit page breaks (w:br w:type="page");
Word-saved page break positions (w:lastRenderedPageBreak), enabled by default;
paragraph pageBreakBefore, keepNext, keepLines and widowControl hints;
section page size, orientation, columns, margins, headers and footers;
table row splitting with w:tblHeader repeat-header preservation;
post-render dynamic pagination for content that still overflows after structural page splitting;
layout-time updates for truly page-local fields such as PAGE, NUMPAGES, SECTIONPAGES and SECTION;
preservation of Word's stored complex-field results for TOC, PAGEREF, REF, SEQ, IF, MERGEFIELD and related fields by default, so a Word-authored table of contents keeps its cached page numbers and tab leaders instead of being recalculated incorrectly by the browser.

For Word-authored documents, keep ignoreLastRenderedPageBreak: false. Those markers preserve the positions calculated by the Word-compatible editor that last saved the document and are especially important for long tables and large Chinese technical documents. Keep preserveComplexFieldResults: true unless you intentionally want to recompute cross-reference fields; the default matches Word print-layout behavior for existing TOC/PAGEREF results.

w:webHidden is not hidden by default in this renderer because it only applies to Word's Web Layout view. In Print Layout, Word still displays the TOC tab before the page number and the cached PAGEREF result even when those runs are marked w:webHidden. Set hideWebHiddenContent: true only for an explicit Web Layout style preview.

Header/footer selection follows the WordprocessingML print-layout rules: first header/footer references are used only when the section has w:titlePg, and even references are used only when document settings contain w:evenAndOddHeaders; otherwise the default/odd header is used. This prevents even-page empty headers from hiding the normal header in documents that merely contain unused even header references.

Complex-script formatting is kept script-aware. w:iCs and w:bCs now affect RTL/complex-script spans, but they are not applied to East Asian TOC text; this avoids turning Chinese TOC level 3 entries italic when the DOCX only specified complex-script italics.

Regression fixture

tests/regression/database-design.docx is a large Chinese database-design document used to validate production behavior. The fixture has a large word/document.xml, thousands of paragraphs, many tables, TOC/PAGEREF fields, saved page breaks and multiple header/footer references.

Open this page in a local server to run the browser regression:

tests/regression/database-design-render.html

The Node regression can be run with:

node tests/regression/check-database-design.cjs

It renders tests/regression/database-design.docx and asserts that TOC entries keep Word's cached page numbers such as 引言 5, 各系统与数据库对应关系 6, 表间关系 16 and 数据恢复策略 103 instead of collapsing to page 1. It also verifies that w:tab right-tab leaders are emitted as measurable docx-tab-stop elements with dotted leader styling, so TOC and figure-list page-number dot leaders remain visible. The regression also checks print-layout TOC hyperlink styling: TOC hyperlinks keep their anchors, but nested Hyperlink character-style runs are forced to inherit the paragraph color and text decoration, matching Word's black print-layout TOC rather than a browser-blue link. A successful browser run updates the fixed status banner with page count, overflow pages, unresolved media and text length.

Build

This repository includes a lightweight build script that emits CommonJS, ESM and Worker bundles without requiring a full Rollup install in constrained environments:

npm run build

The emitted .min.* files are functionally synchronized builds. Use a production minifier if you need compressed bundle size.

Notes on fidelity

The renderer follows OOXML semantics and uses browser-native layout. It is designed for production preview and regression testing, not for claiming byte-for-byte or pixel-perfect identity with Microsoft Word's private layout engine. Use LayoutSnapshot and the supplied regression fixture to monitor the remaining differences that depend on browser fonts, font fallback and shaping engines.

Numbering suffix and heading/list numbering

OOXML separates numbering text from the suffix after the numbering symbol. w:lvlText contains the numbering pattern, w:numFmt defines the format used to expand %1, %2, etc., and w:suff defines the content inserted between the generated number and paragraph text. When w:suff is omitted, OOXML treats it as tab. The renderer emits this suffix as a valid CSS unicode separator token, not as escaped visible text, so headings/lists no longer show literal \9 or \a0 after the numbering.

Run the focused regression with:

node tests/regression/check-numbering-suffix.cjs

The large database-design regression uses the cached rendered snapshot by default to avoid spending minutes rebuilding the full document in constrained CI. Set DOCXJS_RERENDER_REGRESSION=1 when you need to regenerate it from the DOCX.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@file-viewer/docx

Installation

Basic usage

API

Important options

Large-document rendering path

Pagination, headers/footers and TOC

Regression fixture

Build

Notes on fidelity

Numbering suffix and heading/list numbering