@lifeaitools/regen-mde

v0.6.1

Published

18 days ago

Convert between DOCX/PPTX and Markdown — Word OMML equations to KaTeX TeX on the way in, LaTeX to native OMML on the way out.

0High
0Medium
0Low

lifeaiuser

docx markdown word omml katex converter

regen-mde

regen-mde is the Windows editor and conversion suite for Markdown, Word, and PowerPoint files. Its build-corpus CLI converts .docx, .pptx, and .ppt files to Markdown while preserving the pieces that usually break in generic converters:

Word OMML equations as KaTeX-readable TeX
embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
Markdown tables for simple Word tables
HTML table fallback for complex tables
headings, lists, links, bold, italic, inline code, and code-style paragraphs
PowerPoint slide extraction with slide title detection, table mapping, and repetitive footer suppression

Install

build-corpus is a dual package: the same version ships to both PyPI (Python-native) and npm (Node wrapper). Pick the channel that fits your OS.

| OS | Recommended | Command | What you get | |----|-------------|---------|--------------| | Ubuntu / Debian / Linux | PyPI via pipx | pipx install build-corpus | native build-corpus CLI, no Node, isolated from system Python (PEP 668-safe) | | macOS | PyPI via pipx | pipx install build-corpus | native build-corpus CLI, no Node | | Windows | npm (full kit) | npm install -g regen-mde | build-corpus CLI plus the regen-mde editor and Explorer right-click menus |

Ubuntu / Debian / Linux

The CLI is pure Python; the editor is Windows-only, so on Linux you install just the converter.

# prerequisites
sudo apt update
sudo apt install -y python3 python3-pip pipx
sudo apt install -y libreoffice        # optional — only needed to convert legacy .ppt

# install the CLI (isolated venv, survives Ubuntu's externally-managed Python)
pipx install build-corpus
build-corpus --help

The npm package (regen-mde) also installs on Linux (npm install -g regen-mde) — it shells out to python3 for conversion. On Ubuntu 24.04+ the postinstall installs the Python dependencies into your user site; if that is blocked it prints the pipx install build-corpus fallback and still completes. The regen-mde editor is not built on Linux.

Windows

Python is the native runtime:

pip install build-corpus

The npm package ships the Windows installer plus the conversion CLI:

npm pack regen-mde

Extract the package and run dist\release\regen-mde-<version>-win-x64-setup.exe for a normal Windows install. The installer creates Start Menu entries for regen-mde and Uninstall regen-mde, registers right-click Explorer verbs for .docx and .md, and removes those entries during uninstall.

The legacy global npm command path is still supported for automation:

npm install -g regen-mde

On Windows, the installer and supported automation paths add right-click Explorer menus for .docx, .pptx, .ppt, .md, and folders:

Life AI -> Open in regen-mde
opens .md directly and opens .docx by converting it into editable Markdown first
Life AI -> Convert to Markdown
runs build-corpus "%1" --out-same-dir for .docx, .pptx, and .ppt
writes .md, assets, and reports beside the source document
Life AI -> Convert to Word
runs build-corpus "%1" --to word --out-same-dir
writes .docx and export report beside the source document
Life AI -> Inline Markdown Images
runs build-corpus "%1" --inline-images
writes <name>.inline.md with local or HTTP image references embedded as data URIs
folder Convert Documents to Markdown
runs build-corpus "%V" --out-same-dir
converts all .docx, .pptx, and .ppt files in the selected folder tree

The installer also registers .md under Explorer's New menu so you can create a blank Markdown document directly from New.

Set BUILD_CORPUS_SKIP_WINDOWS_MENU=1 before a global npm install if you do not want the Explorer menu. Set BUILD_CORPUS_SKIP_EDITOR=1 before a global npm install if you want the CLI conversion verbs but not the editor open verbs.

To remove the Windows Explorer menus without uninstalling the package:

build-corpus --uninstall-windows

If you uninstall the global npm package, build-corpus now removes those Explorer menu entries automatically during uninstall.

For a project-local install, use npx:

npm install regen-mde
npx build-corpus --help

On Windows, if build-corpus launches a Python executable and fails with ModuleNotFoundError, a stale pip install is shadowing the npm command. Remove it with:

py -3 -m pip uninstall build-corpus

For S3/R2 image upload support:

pip install "build-corpus[s3]"

Basic Usage

build-corpus input.docx --out out
build-corpus deck.pptx --out out
build-corpus input.md --to word --out out
build-corpus input.md --to word --word-template C:\path\custom.dotx --out out
regen-mde input.md
regen-mdeditor input.md
regen-mdeditor input.md
build-corpus editor input.md
build-corpus editor input.docx

regen-mde

regen-mde is a Windows WebView2 desktop app bundled with the package. It uses the same local Build Corpus conversion engine as the CLI:

Markdown opens directly.
Word and PowerPoint files open by converting into Markdown.
Save writes Markdown.
Save As writes a new Markdown file.
Export DOCX writes Word output through the Markdown-to-Word route.

Build the Windows executable locally:

npm run editor:windows

The executable is written to:

dist\windows-editor\BuildCorpusEditor.exe

Convert every .docx in a folder:

build-corpus ./word-files --out ./markdown

Convert every supported file type in a folder (.docx, .pptx, .ppt):

build-corpus ./source-files --out ./markdown

Convert specific selected files or folders from automation:

build-corpus .\a.docx .\deck.pptx .\folder --out-same-dir

Move successfully processed source .docx, .pptx, and .ppt files into sources beside each file:

build-corpus ./source-files --out-same-dir --move-sources

Write Markdown beside each source document:

build-corpus ./word-files --out-same-dir

Image Modes

Local asset files, the default:

build-corpus input.docx --images assets

Single-file Markdown with base64 image data URIs:

build-corpus input.docx --images base64

Re-merge an existing Markdown file that references local or HTTP-hosted images into a single Markdown file with inline image data:

build-corpus input.md --inline-images

Upload images to S3-compatible storage and write public URLs:

build-corpus input.docx --images s3 --config examples\build-corpus.config.example.json

Cloudflare R2 uses the same s3 mode. Set endpoint_url to:

https://ACCOUNT_ID.r2.cloudflarestorage.com

Config

Copy examples/build-corpus.config.example.json and edit it for your environment.

{
  "conversion": {
    "equations": "tex",
    "images": "s3"
  },
  "output": {
    "out": "out",
    "out_same_dir": false
  },
  "s3": {
    "bucket": "build-corpus-assets",
    "public_base_url": "https://assets.example.com",
    "prefix": "knowledge-base",
    "endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
    "region_name": "auto",
    "access_key_id": "%R2_ACCESS_KEY_ID%",
    "secret_access_key": "%R2_SECRET_ACCESS_KEY%"
  }
}

Build Corpus expands environment variables in JSON string values, so credentials do not need to be committed.

Output Placement

There are two output modes.

Write all converted Markdown into one output tree:

{
  "output": {
    "out": "./markdown",
    "out_same_dir": false
  }
}

Write each .md, asset folder, and report beside the source .docx:

{
  "output": {
    "out_same_dir": true
  }
}

The same-dir mode is equivalent to:

build-corpus ./word-files --out-same-dir

Markdown to Word Templates

Markdown -> Word conversion uses this template precedence:

--word-template <path>
word.template in the JSON config
the bundled installed package template
built-in fallback styles if no template can be found

Template files are treated as style sources. Build Corpus creates a fresh output document body, then applies the template's Word styles, numbering, theme, fonts, and settings. It does not reuse the template body content as the exported document.

Equations

Equation handling is real in both directions:

DOCX → Markdown — Word OMML equations are converted to KaTeX-readable TeX (via omml2latex). The default mode is parseable TeX:

build-corpus input.docx --equations tex

Equation images are only for visual debugging:

build-corpus input.docx --equations image

Markdown → Word — inline $...$ and display $$...$$ LaTeX are converted to native Office Math (OMML) that Word renders as real equations — not raw text in a math font. The pipeline is latex2mathml → mathml2omml, so commands like \sum, \int, \frac, \Delta, \rightarrow, and \leq render correctly:

build-corpus notes.md --to word --out out

If a fragment cannot be parsed as LaTeX, it falls back to the literal text in Cambria Math and is flagged in the export report's warnings. Fence display equations with $$ on their own lines and no blank lines inside the fence.

Fidelity report (md → word)

Every md→word export writes export-report.json (and a build-corpus-batch-report.json across a batch) so you can confirm nothing was silently dropped or altered. Beyond the raw output stats, the report carries:

fidelity_ok — top-level ship gate. true only when every reconciliation row matches (and zero equations fell back). The batch summary prints all_fidelity_ok plus the list of fidelity_failures.

reconciliation — input vs output per element type:

"reconciliation": {
  "tables":    { "in": 1, "out": 1, "ok": true },
  "equations": { "in": 3, "out_omml": 2, "fell_back": 1, "ok": false },
  "images":    { "in": 2, "out": 0, "failed": 2, "ok": false },
  "code_blocks": { "in": 0, "out": 0, "ok": true },
  "headings":  { "in": 1, "out": 1, "ok": true },
  "links":     { "in": 1, "out": 1, "ok": true }
}

issues — one entry per problem with the source line: { "type", "line", "source"|"target", "reason" }.
text_fixups — markdown escapes the engine resolved on your content, e.g. { "total": 2, "currency_unescaped": 2 }. Escaped currency like \$252.3B is kept as literal text ($252.3B), never mistaken for inline math.

A one-line stdout digest for a quick CLI glance:

[OK] tables 1/1  [!!] equations 2/3 (1 fell back)  [!!] images 0/2 (2 failed)  …  -> fidelity_ok=false

Image failures carry a specific reason so you know how to react:

| reason | meaning | fix | |--------|---------|-----| | missing-file | target path not found | correct the path | | unsupported-on-platform | EMF/WMF that needs metafile→PNG conversion | install LibreOffice / run on Windows | | unsupported-format | .html/.jsx/.svg etc. — cannot be embedded | pre-render to PNG via a render pipeline | | skipped-remote | http(s)/data: target | localize the asset first |

build-corpus does not rasterize HTML/JSX — that belongs to a separate render step (e.g. a headless-browser screenshot). It flags them and moves on.

PowerPoint Notes

.pptx is processed directly.
.ppt is converted to .pptx first using LibreOffice (soffice --headless --convert-to pptx).
Repeated boilerplate blocks that appear on most slides are removed from the emitted Markdown.
Slide images are exported from the original package binaries (ppt/media/*), not screen-captured display rasters.
Markdown output uses size-aware HTML image tags (<img ... width= height=>) based on OOXML display extents (a:xfrm/a:ext).
The export report includes low_dpi_images to flag images whose effective on-slide DPI is under 150.

Validation

The package includes a KaTeX validator for emitted Markdown math:

build-corpus-katex out

Repeatable Test Wrappers

Run a single known DOCX through conversion plus validators:

.\scripts\run-smoke.ps1 -Docx ".\fixtures\sample.docx" -Out ".tmp\smoke" -Images assets

Run a whole folder corpus:

.\scripts\run-corpus.ps1 -Source ".\fixtures\wordtest" -Out ".tmp\wordtest" -Images base64

Build a public online DOCX corpus for regression testing:

python .\tools\collect_online_docx_corpus.py --out ".tmp\online-docx\source-docx" --target 50
.\scripts\run-corpus.ps1 -Source ".tmp\online-docx\source-docx" -Out ".tmp\online-docx\markdown"

Build a public online PPTX corpus and compare input/output extraction:

python .\tools\collect_online_pptx_corpus.py --out ".tmp\online-pptx\source-pptx" --target 20
.\scripts\run-corpus.ps1 -Source ".tmp\online-pptx\source-pptx" -Out ".tmp\online-pptx\markdown"
python .\tools\compare_pptx_inputs_outputs.py --manifest ".tmp\online-pptx\source-pptx\online-pptx-manifest.json" --out ".tmp\online-pptx\markdown" --report ".tmp\online-pptx\markdown\pptx-io-compare.json"

Failed Documents

If a document does not convert correctly, open an issue with:

the .docx file if it is safe to share
the generated .md
the export-report.json
the command and config used
a screenshot of the expected Word output if layout is the issue

For confidential files, strip or replace sensitive content before sharing. The useful part is the broken DOCX structure, not the private text.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

regen-mde

Install

Ubuntu / Debian / Linux

Windows

Basic Usage

regen-mde

Image Modes

Config

Output Placement

Markdown to Word Templates

Equations

Fidelity report (md → word)

PowerPoint Notes

Validation

Repeatable Test Wrappers

Failed Documents