@mvasin/doc-converter
v0.0.2
Published
`@mvasin/doc-converter` converts `.doc`, `.docx`, and `.pdf` resumes into Markdown files, optionally clearing prior output before writing.
Readme
PDF/DOC(X) Converter
@mvasin/doc-converter converts .doc, .docx, and .pdf resumes into Markdown files, optionally clearing prior output before writing.
Usage
Run without installation:
npx @mvasin/doc-converter --input <file|directory> --output <file|directory> [--clear-output]--input(required): path to a single.doc,.docx, or.pdf, or a directory containing those files.--output(required): a file path when the input is a single file, or a directory path when the input is a folder.--clear-output(optional): when provided, deletes existing output files/folders before conversion; otherwise, existing.mdfiles are preserved and conversion skips them.
When testing locally without publishing, link the package and execute the bin:
npm link doc-converter --input ./input/resume.docx --output ./resume.md
Requirements
- Node.js (see
package.jsonfor engine/dep versions). - LibreOffice (for headless
.doc→.docxconversion).
What it does under the hood
- Cleans the
output/directory so each run starts fresh. - Accepts
.doc,.docx, and.pdfresumes fromoutput/. - Converts
.docfiles to.docxviasoffice --headless --convert-to docx(LibreOffice must be installed on the host). - Pipes every
.docxthroughmammothandturndownto emit Markdown, stripping inline images. - Extracts text from PDFs using
pdf-parseand formats it with paragraph spacing for readability. - Logs success/failure for each file and leaves
.mdfiles inoutput/.
NPM dependencies
mammoth: reads.docxand produces HTML.turndown: converts Mammoth’s HTML into Markdown while allowing custom rules like image stripping.pdf-parse: pulls raw text from PDFs for Markdown output.- Node.js built-ins:
fs,path,os, andchild_processfor file handling and LibreOffice calls.
