mnhm-data-extractor
v2.0.0
Published
This project extracts structured data and embedded images from `.docx` records used by the Diekirch Military Museum workflow. It supports both the original interactive walkthrough and a non-interactive batch mode designed for larger imports.
Readme
Word Document Data Extractor
This project extracts structured data and embedded images from .docx records used by the Diekirch Military Museum workflow. It supports both the original interactive walkthrough and a non-interactive batch mode designed for larger imports.
Requirements
- Node.js
- npm
Install dependencies with:
npm installDirectory Layout
Default input and output paths live under Files/:
Files/
inputs/ Source .docx files
outputs/
txt/ Extracted text files
json/ Parsed JSON records
images/ Extracted images grouped per record
all/ Consolidated per-record folders
review/ Per-record review files for anomalies/failures
reports/ Run report and manifestUsage
Interactive mode:
npm run startLocal npx usage from this repository:
npx . --non-interactive --cleanPublished-package usage:
npx mnhm-data-extractor --non-interactive --cleanBatch mode:
node cli.js --non-interactive --cleanBatch mode also activates automatically if you pass any batch flag such as --input-dir or --resume.
To publish this as an npx tool, publish the package to npm with the bin entry intact. The executable name will be mnhm-data-extractor.
Batch Flags
--input-dir <path> Override the default input directory
--output-dir <path> Override the default output directory
--clean Remove the output directory before processing
--report <path> Write the run report to a custom path
--manifest <path> Write the manifest to a custom path
--limit <n> Process only the first n valid .docx files
--resume Reuse the manifest and skip unchanged successful files
--concurrency <n> Process up to n files in parallel
--log-level <level> debug, info, warn, or errorExample:
node cli.js ^
--input-dir ".\\Files\\inputs" ^
--output-dir ".\\Files\\outputs" ^
--clean ^
--concurrency 4 ^
--resumePowerShell example:
node .\cli.js `
--input-dir .\Files\inputs `
--output-dir .\Files\outputs `
--clean `
--concurrency 4 `
--resumeBatch Behavior
- Input validation rejects non-files, empty files, and unsupported extensions before parsing.
- One bad file does not stop the run. Failures are recorded and the batch continues.
- A run report is written to
outputs/reports/run-report.jsonby default. - A manifest is written to
outputs/reports/run-manifest.jsonby default. - Review files are written to
outputs/review/when a record fails or needs manual inspection. - Re-running the batch with the same inputs skips unchanged successful files when the manifest and expected outputs are still present.
Output Schema
Each JSON file contains:
schemaVersionpersonsourcefieldsotherparsingreviewmetadata
High-level example:
{
"schemaVersion": "3.0.0",
"person": {
"id": "rec_1234567890ab",
"name": "Example Person"
},
"source": {
"documentFileName": "example person.docx",
"id": "src_1234567890ab",
"textFileName": "example-person.txt"
},
"fields": {
"wehrmacht_service": {
"label": "Wehrmacht",
"provenance": [
{
"normalizedValue": "Main field content",
"rawValue": "Main field content",
"reviewStatus": "auto_accepted",
"sourceDocument": "example person.docx",
"sourceFieldLabel": "Wehrmacht",
"sourceTextFile": "example-person.txt",
"sourceType": "field"
}
],
"value": "Main field content",
"subfields": {
"deserted": {
"label": "Desertéiert",
"provenance": [],
"value": "Subfield content"
}
}
}
},
"other": "Unmapped trailing content",
"parsing": {
"anomalies": [],
"missingRequiredFields": [],
"unknownFields": []
},
"review": {
"reasons": [],
"status": "auto_accepted"
},
"metadata": {
"outputFileName": "example-person.json",
"recordId": "rec_1234567890ab",
"sourceDocxFile": "example person.docx",
"sourceId": "src_1234567890ab",
"sourceTextFile": "example-person.txt"
}
}Run Report
The run report contains:
- batch options
- per-file status
- success and failure counts
- skipped count
- review-required count
- unknown field and image failure totals
- benchmark data such as elapsed time, average file time, throughput, concurrency, and total input bytes
Recommended Workflow
- Place source
.docxfiles in the input directory. - Run a clean batch for the first import.
- Review
outputs/reports/run-report.json. - Inspect any files in
outputs/review/. - Re-run with
--resumefor subsequent passes. - Consume the JSON files from
outputs/json/or the consolidated folders inoutputs/all/.
Development
Lint the project:
npm run lintRun automated tests:
npm testThe test suite includes parser behavior checks and snapshot-style batch output verification using the sample .docx fixtures already in the repository.
License
MIT. See LICENSE.
