npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

mnhm-data-extractor

v2.0.0

Published

This project extracts structured data and embedded images from `.docx` records used by the Diekirch Military Museum workflow. It supports both the original interactive walkthrough and a non-interactive batch mode designed for larger imports.

Readme

Word Document Data Extractor

This project extracts structured data and embedded images from .docx records used by the Diekirch Military Museum workflow. It supports both the original interactive walkthrough and a non-interactive batch mode designed for larger imports.

Requirements

  • Node.js
  • npm

Install dependencies with:

npm install

Directory Layout

Default input and output paths live under Files/:

Files/
  inputs/                 Source .docx files
  outputs/
    txt/                  Extracted text files
    json/                 Parsed JSON records
    images/               Extracted images grouped per record
    all/                  Consolidated per-record folders
    review/               Per-record review files for anomalies/failures
    reports/              Run report and manifest

Usage

Interactive mode:

npm run start

Local npx usage from this repository:

npx . --non-interactive --clean

Published-package usage:

npx mnhm-data-extractor --non-interactive --clean

Batch mode:

node cli.js --non-interactive --clean

Batch mode also activates automatically if you pass any batch flag such as --input-dir or --resume.

To publish this as an npx tool, publish the package to npm with the bin entry intact. The executable name will be mnhm-data-extractor.

Batch Flags

--input-dir <path>      Override the default input directory
--output-dir <path>     Override the default output directory
--clean                 Remove the output directory before processing
--report <path>         Write the run report to a custom path
--manifest <path>       Write the manifest to a custom path
--limit <n>             Process only the first n valid .docx files
--resume                Reuse the manifest and skip unchanged successful files
--concurrency <n>       Process up to n files in parallel
--log-level <level>     debug, info, warn, or error

Example:

node cli.js ^
  --input-dir ".\\Files\\inputs" ^
  --output-dir ".\\Files\\outputs" ^
  --clean ^
  --concurrency 4 ^
  --resume

PowerShell example:

node .\cli.js `
  --input-dir .\Files\inputs `
  --output-dir .\Files\outputs `
  --clean `
  --concurrency 4 `
  --resume

Batch Behavior

  • Input validation rejects non-files, empty files, and unsupported extensions before parsing.
  • One bad file does not stop the run. Failures are recorded and the batch continues.
  • A run report is written to outputs/reports/run-report.json by default.
  • A manifest is written to outputs/reports/run-manifest.json by default.
  • Review files are written to outputs/review/ when a record fails or needs manual inspection.
  • Re-running the batch with the same inputs skips unchanged successful files when the manifest and expected outputs are still present.

Output Schema

Each JSON file contains:

  • schemaVersion
  • person
  • source
  • fields
  • other
  • parsing
  • review
  • metadata

High-level example:

{
  "schemaVersion": "3.0.0",
  "person": {
    "id": "rec_1234567890ab",
    "name": "Example Person"
  },
  "source": {
    "documentFileName": "example person.docx",
    "id": "src_1234567890ab",
    "textFileName": "example-person.txt"
  },
  "fields": {
    "wehrmacht_service": {
      "label": "Wehrmacht",
      "provenance": [
        {
          "normalizedValue": "Main field content",
          "rawValue": "Main field content",
          "reviewStatus": "auto_accepted",
          "sourceDocument": "example person.docx",
          "sourceFieldLabel": "Wehrmacht",
          "sourceTextFile": "example-person.txt",
          "sourceType": "field"
        }
      ],
      "value": "Main field content",
      "subfields": {
        "deserted": {
          "label": "Desertéiert",
          "provenance": [],
          "value": "Subfield content"
        }
      }
    }
  },
  "other": "Unmapped trailing content",
  "parsing": {
    "anomalies": [],
    "missingRequiredFields": [],
    "unknownFields": []
  },
  "review": {
    "reasons": [],
    "status": "auto_accepted"
  },
  "metadata": {
    "outputFileName": "example-person.json",
    "recordId": "rec_1234567890ab",
    "sourceDocxFile": "example person.docx",
    "sourceId": "src_1234567890ab",
    "sourceTextFile": "example-person.txt"
  }
}

Run Report

The run report contains:

  • batch options
  • per-file status
  • success and failure counts
  • skipped count
  • review-required count
  • unknown field and image failure totals
  • benchmark data such as elapsed time, average file time, throughput, concurrency, and total input bytes

Recommended Workflow

  1. Place source .docx files in the input directory.
  2. Run a clean batch for the first import.
  3. Review outputs/reports/run-report.json.
  4. Inspect any files in outputs/review/.
  5. Re-run with --resume for subsequent passes.
  6. Consume the JSON files from outputs/json/ or the consolidated folders in outputs/all/.

Development

Lint the project:

npm run lint

Run automated tests:

npm test

The test suite includes parser behavior checks and snapshot-style batch output verification using the sample .docx fixtures already in the repository.

License

MIT. See LICENSE.