@origints/mammoth
v0.1.1
Published
DOCX to HTML conversion for Origins using mammoth.js
Readme
@origints/mammoth
DOCX to HTML/text conversion for Origins using mammoth.js.
Why
Word documents are everywhere in enterprise workflows, but extracting their content programmatically is challenging. You need to convert them to a usable format while preserving semantic structure.
This package wraps mammoth.js and exposes it as Origins transforms. Convert DOCX files to clean HTML or plain text, with full control over style mapping and conversion options.
Features
- Convert DOCX to semantic HTML
- Convert DOCX to plain text
- Custom style mapping for headings, lists, and more
- Configurable image handling
- Conversion warnings and messages
- Integrates with Origins transform registry
Quick Start
npm install @origints/mammoth @origints/coreimport { Planner, loadFile, run, globalRegistry } from "@origints/core";
import { docxToHtml, registerMammothTransforms } from "@origints/mammoth";
registerMammothTransforms(globalRegistry);
const plan = Planner.in(loadFile("document.docx"))
.mapIn(docxToHtml())
.emit((out, $) => out.add("html", $.get("html").asString()))
.compile();
const result = await run(plan, {}, globalRegistry);
if (result.ok) {
console.log(result.value.html);
}Expected output:
<h1>Document Title</h1><p>Content here...</p>Installation
- Supported platforms:
- macOS / Linux / Windows
- Runtime requirements:
- Node.js >= 18
- Package managers:
- npm, pnpm, yarn
- Peer dependencies:
- @origints/core ^0.1.0
npm install @origints/mammoth @origints/core
# or
pnpm add @origints/mammoth @origints/coreUsage
Basic HTML conversion
import { Planner, loadFile, globalRegistry } from "@origints/core";
import { docxToHtml, registerMammothTransforms } from "@origints/mammoth";
registerMammothTransforms(globalRegistry);
const plan = Planner.in(loadFile("report.docx"))
.mapIn(docxToHtml())
.emit((out, $) => {
out.add("html", $.get("html").asString());
out.add("messages", $.get("messages").asArray());
})
.compile();Custom style mapping
const plan = Planner.in(loadFile("document.docx"))
.mapIn(
docxToHtml({
styleMap: [
"p[style-name='Title'] => h1.document-title",
"p[style-name='Heading 1'] => h1",
"p[style-name='Heading 2'] => h2",
"p[style-name='Quote'] => blockquote",
],
})
)
.emit((out, $) => out.add("html", $.get("html").asString()))
.compile();Convert to plain text
import { docxToText } from "@origints/mammoth";
const plan = Planner.in(loadFile("document.docx"))
.mapIn(docxToText())
.emit((out, $) => out.add("text", $.get("text").asString()))
.compile();Image handling options
const plan = Planner.in(loadFile("document.docx"))
.mapIn(
docxToHtml({
imageHandling: "omit", // or 'base64'
})
)
.emit((out, $) => out.add("html", $.get("html").asString()))
.compile();Project Status
- Experimental - APIs may change
Non-Goals
- Not a DOCX writer/generator
- Not a full Word document parser (no styles, comments, etc.)
- Not a PDF converter
Documentation
- See
@origints/corefor Origins concepts - See mammoth.js for conversion details
Contributing
- Open an issue before large changes
- Keep PRs focused
- Tests required for new features
License
MIT
