@origints/mammoth
v0.3.2
Published
DOCX to HTML conversion for Origins using mammoth.js
Downloads
53
Maintainers
Readme
@origints/mammoth
DOCX to HTML/text conversion for Origins using mammoth.js.
Features
- Convert DOCX to semantic HTML
- Convert DOCX to plain text
- Custom style mapping for headings, lists, and more
- Configurable image handling
- Conversion warnings and messages
- Integrates with Origins transform registry
Installation
npm install @origints/mammoth @origints/coreUsage with Planner
Convert a DOCX file and extract the HTML
import { Planner, loadFile, run } from '@origints/core'
import { docxToHtml } from '@origints/mammoth'
const plan = new Planner()
.in(loadFile('document.docx'))
.mapIn(docxToHtml())
.emit((out, $) => out.add('html', $.get('html').string()))
.compile()
const result = await run(plan, { readFile, registry })
// result.value: { html: '<h1>Title</h1><p>Content...</p>' }Convert with custom style mapping
import { docxToHtml } from '@origints/mammoth'
const plan = new Planner()
.in(loadFile('report.docx'))
.mapIn(
docxToHtml({
styleMap: [
"p[style-name='Title'] => h1.document-title",
"p[style-name='Heading 1'] => h1",
"p[style-name='Heading 2'] => h2",
"p[style-name='Quote'] => blockquote",
],
idPrefix: 'doc-',
})
)
.emit((out, $) => out.add('content', $.get('html').string()))
.compile()Extract plain text from a DOCX file
import { docxToText } from '@origints/mammoth'
const plan = new Planner()
.in(loadFile('document.docx'))
.mapIn(docxToText())
.emit((out, $) => out.add('text', $.get('text').string()))
.compile()
const result = await run(plan, { readFile, registry })
// result.value: { text: 'Document Title\nContent here...' }Combine DOCX with other sources
const plan = new Planner()
.in(loadFile('report.docx'))
.mapIn(docxToHtml())
.emit((out, $) => out.add('reportHtml', $.get('html').string()))
.in(loadFile('metadata.json'))
.mapIn(parseJson())
.emit((out, $) =>
out
.add('author', $.get('author').string())
.add('date', $.get('date').string())
)
.compile()Standalone usage (without Planner)
import * as fs from 'fs'
import { docxToHtmlImpl, docxToTextImpl } from '@origints/mammoth'
const buffer = fs.readFileSync('document.docx')
// Convert to HTML
const htmlResult = await docxToHtmlImpl.execute(buffer)
console.log(htmlResult.html)
// Log conversion warnings
for (const msg of htmlResult.messages) {
console.warn(msg.message)
}
// Convert to plain text
const textResult = await docxToTextImpl.execute(buffer)
console.log(textResult.text)Image handling
import { docxToHtml } from '@origints/mammoth'
// Omit images
const plan = new Planner()
.in(loadFile('document.docx'))
.mapIn(docxToHtml({ imageHandling: 'omit' }))
.emit((out, $) => out.add('html', $.get('html').string()))
.compile()API
| Export | Description |
| ------------------------------------- | -------------------------------------------------- |
| docxToHtml(options?) | Create a transform AST for HTML conversion |
| docxToText(options?) | Create a transform AST for text conversion |
| docxToHtmlImpl | Async transform implementation for HTML conversion |
| docxToTextImpl | Async transform implementation for text conversion |
| registerMammothTransforms(registry) | Register all mammoth transforms with a registry |
License
MIT
