otomate

v0.1.0

Published

a day ago

Universal document diffing library — structure-aware, string-level, multi-format

0High
0Medium
0Low

8lipverts

diff document html docx word css ast tree-diff universal-document-model

Otomate

Universal document diffing library — structure-aware, string-level, multi-format.

otomate evaluates differences across documents with three core properties that existing libraries lack:

Structure tracking — tree-based diffing that understands document hierarchy, not just flat text
String tracking — character and word-level granularity within text nodes via Myers algorithm
Lossless multi-format conversion — HTML, docx, JSON with round-trip fidelity

Architecture

HTML ──→ ┌─────────────────────┐ ──→ HTML
         │                     │
docx ──→ │  Universal Document │ ──→ docx
         │     Model (UDM)     │
 CSS ──→ │                     │ ──→ JSON
         └─────────┬───────────┘
                   │
              ┌────▼────┐
              │  Diff    │  GumTree-inspired
              │  Engine  │  3-phase algorithm
              └────┬────┘
                   │
           DiffResult (typed operations)

Universal Document Model (UDM)

A JSON-serializable AST that any document format converts to/from.

ProseMirror-style marks — Text("hello", marks: [{type: "strong"}]) instead of nested Strong > Text("hello"). Changing bold→italic is one mark update, not a delete+insert+reparent.
classes as first-class field — HTML classes, docx styles, and Markdown attributes all map to classes: string[] on any node. Classes affect the content hash and show up in diffs.
Format-specific metadata — data.html, data.docx namespaces preserve round-trip fidelity per format.
Lossless docx round-trip — embedded otomate-udm.json inside the docx ZIP enables perfect reconstruction when both writer and reader are otomate.

Diff Algorithm

Three-phase GumTree-inspired approach:

| Phase | Algorithm | What it catches | |-------|-----------|----------------| | 1. Top-down | FNV-1a content hash matching | Identical subtrees (unchanged content) | | 2. Bottom-up | Dice coefficient + inner matching | Modified nodes, moves, reorders | | 3. Text diff | Myers O(ND) | Word/character-level changes within text nodes |

Output: serializable DiffResult with typed operations — insert, delete, move, update, updateText.

Packages

| Package | Description | |---------|-------------| | @otomate/core | UDM types, node builders, traversal, FNV-1a hashing | | @otomate/diff | Tree diff engine + Myers string-level diffing | | @otomate/html | HTML ↔ UDM adapter (via hast) | | @otomate/docx | docx ↔ UDM adapter (via jszip + fast-xml-parser) | | @otomate/inject | Template injection — fill {{placeholders}} with data | | @otomate/css-docx | CSS properties → OOXML style mapping | | @otomate/ui | Web UI for testing conversions and diffs |

Installation

npm / pnpm (Node.js or bundler)

npm install otomate

import { readHtml, writeDocx, diff } from "otomate";

CDN (browser, no build step)

<script src="https://cdn.jsdelivr.net/npm/otomate/dist/otomate.umd.cjs"></script>
<script>
  const tree = otomate.readHtml("<p>Hello <strong>world</strong></p>");
  const html = otomate.writeHtml(tree);
  console.log(tree); // UDM tree
</script>

Or with ES modules:

<script type="module">
  import { readHtml, writeDocx, diff } from "https://cdn.jsdelivr.net/npm/otomate/dist/otomate.js";

  const tree = readHtml("<h1>Hello</h1><p>World</p>");
  const docxBuffer = await writeDocx(tree);
</script>

Individual packages (tree-shakeable)

npm install @otomate/core @otomate/diff @otomate/html @otomate/docx

Quick Start

Parse HTML to UDM

import { readHtml, writeHtml } from "@otomate/html";

const tree = readHtml(`
  <h1>Hello World</h1>
  <p>This is <strong>bold</strong> and <em>italic</em> text.</p>
`);

// tree is a UDM Root node:
// root
//   heading depth=1
//     text "Hello World"
//   paragraph
//     text "This is "
//     text "bold" [strong]
//     text " and "
//     text "italic" [emphasis]
//     text " text."

Parse with CSS styling

const tree = readHtml(html, {
  css: `
    h1 { font-size: 28pt; color: #1e3a5f; }
    .intro { font-style: italic; color: #374151; }
    .highlighted { background-color: #dbeafe; }
  `
});
// CSS rules stored on tree.data.css — used by docx writer for Word styles

Convert to docx

import { writeDocx } from "@otomate/docx";

const buffer = await writeDocx(tree);
// buffer is a valid .docx file with:
// - CSS-derived Word styles (fonts, colors, sizes, backgrounds)
// - Embedded otomate-udm.json for lossless round-trip
fs.writeFileSync("output.docx", buffer);

Lossless round-trip

import { readDocx } from "@otomate/docx";

const tree2 = await readDocx(buffer);
// tree2 is identical to tree — all classes, hierarchy, marks preserved
// via embedded otomate-udm.json (invisible to Word, used by otomate)

Diff two documents

import { diff } from "@otomate/diff";

const result = diff(oldTree, newTree);
// result.operations: [
//   { type: "updateText", path: [0,0], changes: [
//     { type: "equal", value: "Hello " },
//     { type: "delete", value: "World" },
//     { type: "insert", value: "Everyone" }
//   ]},
//   { type: "insert", path: [2], node: { type: "paragraph", ... } },
//   { type: "move", from: [1], to: [3] }
// ]

Cross-format diff

// Diff HTML against a docx — both parsed to UDM first
const htmlTree = readHtml(htmlString);
const docxTree = await readDocx(docxBuffer);
const result = diff(htmlTree, docxTree);
// Structural comparison regardless of source format

Template injection

Fill {{placeholders}} in a document template with data from any source.

import { readDocx, writeDocx, inject } from "otomate";

// 1. Read template (HTML or docx with content controls)
const tree = await readDocx(templateBuffer);

// 2. Inject data — any JSON object, keys match {{placeholders}}
const filled = inject(tree, {
  name: "John Doe",
  position: "Senior Engineer",
  benefits: [
    { name: "Health", description: "Full medical coverage" },
    { name: "401k", description: "6% company match" },
  ],
  showRelocation: true,
  relocationAmount: "$15,000",
});

// 3. Export filled document
const output = await writeDocx(filled);

Placeholder syntax:

| Syntax | Description | |--------|-------------| | {{fieldName}} | Replace with value (inherits formatting) | | {{obj.nested}} | Dot-path into nested objects | | {{#each items}}...{{/each}} | Repeat block for each array item | | {{#if condition}}...{{/if}} | Conditional block | | {{#if x}}...{{else}}...{{/if}} | Conditional with else | | {{@richField}} | Replace paragraph with block-level content |

Word content controls: When reading a .docx created with Word's Developer tab content controls, otomate automatically detects them and converts each control's tag name into a {{placeholder}}. No manual placeholder typing needed.

Data format: Free-form JSON — no fixed schema. The keys in your data object map directly to the placeholder names in your template.

Interactive demo: Open examples/inject-demo.html to try it live with editable JSON and instant preview.

Web UI

A built-in test interface for interactive conversion and diffing:

cd packages/ui
pnpm dev
# Opens at http://localhost:5555

Four panels:

Input — HTML editor or docx file upload
CSS Stylesheet — element and class rules (applied to preview and docx export)
UDM Tree — live parsed document structure
Output — HTML preview, HTML source, or UDM JSON

Features:

Live CSS → preview (edit a rule, see it instantly)
Export .docx with CSS-derived Word styles
Re-import .docx with lossless round-trip
Diff two documents with typed operation view

Node Types

Block nodes

| UDM Type | HTML | docx | Description | |----------|------|------|-------------| | root | <body> | document | Document root | | paragraph | <p> | w:p | Paragraph | | heading | <h1>–<h6> | w:pStyle Heading1-6 | Heading (depth 1-6) | | blockquote | <blockquote> | indented w:p + left border | Block quote | | list | <ul>/<ol> | w:numPr | List (ordered/unordered) | | listItem | <li> | list paragraph | List item | | codeBlock | <pre><code> | w:pStyle Code | Code block with optional lang | | table | <table> | w:tbl | Table | | tableRow | <tr> | w:tr | Table row | | tableCell | <td>/<th> | w:tc | Table cell | | thematicBreak | <hr> | bottom border | Horizontal rule | | div | <div> | transparent container | Generic container | | image | <img> | w:drawing | Image |

Inline marks

| Mark | HTML | docx | Description | |------|------|------|-------------| | strong | <strong>/<b> | w:b | Bold | | emphasis | <em>/<i> | w:i | Italic | | underline | <u> | w:u | Underline | | strikethrough | <del>/<s> | w:strike | Strikethrough | | superscript | <sup> | w:vertAlign superscript | Superscript | | subscript | <sub> | w:vertAlign subscript | Subscript | | code | <code> | monospace font | Inline code | | link | <a> | w:hyperlink | Hyperlink (with URL) | | highlight | <mark> | w:highlight | Highlighted text |

CSS → docx Style Mapping

When CSS rules are provided, otomate generates real Word styles with OOXML formatting:

| CSS Property | OOXML | Notes | |---|---|---| | font-family | w:rFonts | First font in stack | | font-size | w:sz | Converted to half-points | | font-weight: bold | w:b | ≥700 or "bold" | | font-style: italic | w:i | | | color | w:color | Hex, rgb(), named colors | | background-color | w:shd (paragraph) + w:shd (run) | | | text-decoration: underline | w:u | | | text-decoration: line-through | w:strike | | | text-align | w:jc | left, center, right, justify→both | | margin-top/bottom | w:spacing before/after | Converted to twips | | margin-left | w:ind left | Converted to twips | | line-height | w:spacing line | |

Container CSS (on div elements) cascades to all child paragraphs, headings, and list items.

Lossless Round-Trip Strategy

otomate uses a two-tier system for format conversion:

Tier 1 — Universal structure (diffable across formats): Text, paragraphs, headings, lists, tables, images, bold, italic, links, classes.

Tier 2 — Format-specific metadata (preserved per format):

data.html — id, data-* attributes, style, aria-*, ol type
data.docx — spacing, indent, alignment, fonts, colors, raw XML parts

otomate-to-otomate round-trip: The docx writer embeds word/otomate-udm.json (the full UDM tree) and word/otomate-css.json (CSS rules) inside the ZIP. Word ignores these files. When otomate reads the docx back, it finds the snapshot and reconstructs the tree perfectly — all classes, hierarchy, marks, and CSS rules intact.

Development

# Install dependencies
pnpm install

# Build all packages
pnpm -r build

# Run tests
pnpm -r test

# Start the UI dev server
cd packages/ui && pnpm dev

License

Business Source License 1.1 — free for non-production use. Production use permitted except for offering otomate as a competitive hosted/embedded service. Converts to MIT on 2030-04-03.

All dependencies are MIT or Apache-2.0 licensed.