npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

build-corpus

v0.2.1

Published

Convert DOCX to Markdown with tables, images, and KaTeX-readable Word equations.

Readme

Build Corpus

Build Corpus converts .docx, .pptx, and .ppt files to Markdown while preserving the pieces that usually break in generic converters:

  • Word OMML equations as KaTeX-readable TeX
  • embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
  • Markdown tables for simple Word tables
  • HTML table fallback for complex tables
  • headings, lists, links, bold, italic, inline code, and code-style paragraphs
  • PowerPoint slide extraction with slide title detection, table mapping, and repetitive footer suppression

Install

Python is the native runtime:

pip install build-corpus

The npm package is a convenience wrapper around the Python CLI:

npm install -g build-corpus

On Windows, a global npm install also adds right-click Explorer menus for .docx and .md files under Life AI:

  • Life AI -> Open in Regen.MDeditor
  • runs build-corpus-editor "%1"
  • opens .md directly and opens .docx by converting it into editable Markdown first
  • Life AI -> Convert to Markdown
  • runs build-corpus "%1" --out-same-dir
  • writes .md, assets, and reports beside the source document
  • Life AI -> Convert to Word
  • runs build-corpus "%1" --to word --out-same-dir
  • writes .docx and export report beside the source document

Set BUILD_CORPUS_SKIP_WINDOWS_MENU=1 before install if you do not want the Explorer menu. Set BUILD_CORPUS_SKIP_EDITOR=1 before install if you want the CLI conversion verbs but not the editor open verbs. Run regen-mdeditor-uninstall before npm uninstall -g build-corpus to remove Windows Explorer verbs on npm versions that no longer run uninstall lifecycle hooks.

To remove the Windows Explorer menus without uninstalling the package:

build-corpus --uninstall-windows

If you uninstall the global npm package, build-corpus now removes those Explorer menu entries automatically during uninstall.

For a project-local install, use npx:

npm install build-corpus
npx build-corpus --help

On Windows, if build-corpus launches a Python executable and fails with ModuleNotFoundError, a stale pip install is shadowing the npm command. Remove it with:

py -3 -m pip uninstall build-corpus

For S3/R2 image upload support:

pip install "build-corpus[s3]"

Basic Usage

build-corpus input.docx --out out
build-corpus deck.pptx --out out
build-corpus input.md --to word --out out
build-corpus input.md --to word --word-template C:\path\custom.dotx --out out
build-corpus editor input.md
build-corpus editor input.docx

Regen.MDeditor

Regen.MDeditor is a Windows WebView2 desktop app bundled with the package. It uses the same local Build Corpus conversion engine as the CLI:

  • Markdown opens directly.
  • Word and PowerPoint files open by converting into Markdown.
  • Save writes Markdown.
  • Save As writes a new Markdown file.
  • Export DOCX writes Word output through the Markdown-to-Word route.

Build the Windows executable locally:

npm run editor:windows

The executable is written to:

dist\windows-editor\BuildCorpusEditor.exe

Convert every .docx in a folder:

build-corpus ./word-files --out ./markdown

Convert every supported file type in a folder (.docx, .pptx, .ppt):

build-corpus ./source-files --out ./markdown

Write Markdown beside each source document:

build-corpus ./word-files --out-same-dir

Image Modes

Local asset files, the default:

build-corpus input.docx --images assets

Single-file Markdown with base64 image data URIs:

build-corpus input.docx --images base64

Upload images to S3-compatible storage and write public URLs:

build-corpus input.docx --images s3 --config examples\build-corpus.config.example.json

Cloudflare R2 uses the same s3 mode. Set endpoint_url to:

https://ACCOUNT_ID.r2.cloudflarestorage.com

Config

Copy examples/build-corpus.config.example.json and edit it for your environment.

{
  "conversion": {
    "equations": "tex",
    "images": "s3"
  },
  "output": {
    "out": "out",
    "out_same_dir": false
  },
  "s3": {
    "bucket": "build-corpus-assets",
    "public_base_url": "https://assets.example.com",
    "prefix": "knowledge-base",
    "endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
    "region_name": "auto",
    "access_key_id": "%R2_ACCESS_KEY_ID%",
    "secret_access_key": "%R2_SECRET_ACCESS_KEY%"
  }
}

Build Corpus expands environment variables in JSON string values, so credentials do not need to be committed.

Output Placement

There are two output modes.

Write all converted Markdown into one output tree:

{
  "output": {
    "out": "./markdown",
    "out_same_dir": false
  }
}

Write each .md, asset folder, and report beside the source .docx:

{
  "output": {
    "out_same_dir": true
  }
}

The same-dir mode is equivalent to:

build-corpus ./word-files --out-same-dir

Markdown to Word Templates

Markdown -> Word conversion uses this template precedence:

  1. --word-template <path>
  2. word.template in the JSON config
  3. the bundled installed package template
  4. built-in fallback styles if no template can be found

Template files are treated as style sources. Build Corpus creates a fresh output document body, then applies the template's Word styles, numbering, theme, fonts, and settings. It does not reuse the template body content as the exported document.

Equations

The default equation mode is parseable TeX:

build-corpus input.docx --equations tex

Equation images are only for visual debugging:

build-corpus input.docx --equations image

PowerPoint Notes

  • .pptx is processed directly.
  • .ppt is converted to .pptx first using LibreOffice (soffice --headless --convert-to pptx).
  • Repeated boilerplate blocks that appear on most slides are removed from the emitted Markdown.
  • Slide images are exported from the original package binaries (ppt/media/*), not screen-captured display rasters.
  • Markdown output uses size-aware HTML image tags (<img ... width= height=>) based on OOXML display extents (a:xfrm/a:ext).
  • The export report includes low_dpi_images to flag images whose effective on-slide DPI is under 150.

Validation

The package includes a KaTeX validator for emitted Markdown math:

build-corpus-katex out

Repeatable Test Wrappers

Run a single known DOCX through conversion plus validators:

.\scripts\run-smoke.ps1 -Docx ".\fixtures\sample.docx" -Out ".tmp\smoke" -Images assets

Run a whole folder corpus:

.\scripts\run-corpus.ps1 -Source ".\fixtures\wordtest" -Out ".tmp\wordtest" -Images base64

Build a public online DOCX corpus for regression testing:

python .\tools\collect_online_docx_corpus.py --out ".tmp\online-docx\source-docx" --target 50
.\scripts\run-corpus.ps1 -Source ".tmp\online-docx\source-docx" -Out ".tmp\online-docx\markdown"

Build a public online PPTX corpus and compare input/output extraction:

python .\tools\collect_online_pptx_corpus.py --out ".tmp\online-pptx\source-pptx" --target 20
.\scripts\run-corpus.ps1 -Source ".tmp\online-pptx\source-pptx" -Out ".tmp\online-pptx\markdown"
python .\tools\compare_pptx_inputs_outputs.py --manifest ".tmp\online-pptx\source-pptx\online-pptx-manifest.json" --out ".tmp\online-pptx\markdown" --report ".tmp\online-pptx\markdown\pptx-io-compare.json"

Failed Documents

If a document does not convert correctly, open an issue with:

  • the .docx file if it is safe to share
  • the generated .md
  • the export-report.json
  • the command and config used
  • a screenshot of the expected Word output if layout is the issue

For confidential files, strip or replace sensitive content before sharing. The useful part is the broken DOCX structure, not the private text.