pdf2md-cli

v1.0.0

Published

12 days ago

Convert PDFs to Markdown from the command line — supports local files and web URLs

Downloads

131

0High
0Medium
0Low

nexacrawl

pdf markdown cli converter pdf-to-markdown developer-tools

pdf2md-cli

Convert any PDF to clean Markdown — from the command line.
Smart heading detection, page separators, and full source metadata. Works with local files and remote URLs.

Install

npm install -g pdf2md-cli

Requires Node.js 18+.

Usage

Convert a local PDF

pdf2md local ./report.pdf

Fetch and convert a remote PDF

pdf2md web https://arxiv.org/pdf/2103.00020.pdf

Flags

| Flag | Short | Description | |---|---|---| | --output <file> | -o | Custom output filename (default: same name as PDF) | | --clipboard | -c | Copy Markdown to clipboard after saving |

Examples

# Save to a custom path
pdf2md local ./research.pdf --output ./notes/research.md

# Fetch remote PDF and copy result to clipboard
pdf2md web https://example.com/paper.pdf --clipboard

# Full example with both flags
pdf2md local ./invoice.pdf -o ./invoices/invoice.md -c

Output Format

Every conversion produces a structured Markdown file:

<!-- Source: report.pdf | Extracted: Apr 18, 2026, 06:00 PM -->

## Page 1

# Document Title

## Section Heading

Paragraph text flows here with proper line breaks
and paragraph spacing preserved from the original PDF.

---

## Page 2

### Subsection

Continued content...

Heading detection maps font sizes from the PDF's transform matrix:

| Font size | Markdown heading | |---|---| | ≥ 22px | # H1 | | ≥ 18px | ## H2 | | ≥ 15px | ### H3 | | > 14px (distinct size) | #### H4 |

Features

Smart heading detection — reads raw font-size data from the PDF transform matrix, not guesswork
Y-sorted text extraction — items sorted by visual position (top→bottom, left→right) before parsing
Page separators — each page is a clearly labeled ## Page N section with --- dividers
Source metadata — output header includes filename/URL and extraction timestamp
Colored terminal output — chalk-powered ✔ success, ✖ error, ⚠ warn messages
Spinner feedback — ora spinner shows progress during extraction
Graceful error handling — specific messages for missing files, encrypted PDFs, 404s, timeouts
Works offline — local mode requires no internet connection

Error Handling

| Scenario | Behavior | |---|---| | File not found | Clear error with resolved path | | Not a .pdf file | Rejects before attempting extraction | | Encrypted / password-protected PDF | Specific error message | | Scanned / image-only PDF | Detects no text content, explains why | | Unreachable URL | Distinguishes DNS failure vs. refused connection | | HTTP 404 / 403 | Status-specific error messages | | Request timeout (30s) | Clean timeout message |

Tech Stack

| Package | Version | Role | |---|---|---| | commander | ^12 | CLI commands & flags | | pdf-parse | ^1.1 | PDF text extraction (Node.js) | | axios | ^1.7 | Fetch remote PDFs | | chalk | ^5 | Colored terminal output | | ora | ^8 | Loading spinner |

Why This Project

Most PDF-to-Markdown tools treat extraction as a text dump — they grab raw characters and call it done. The result is flat, unstructured Markdown that loses all the hierarchy the original document had.

I originally built two Tampermonkey userscripts — one for browser-tab PDFs, one for local files — that solved this by reading the PDF's raw transform matrix to detect font sizes, then mapping those sizes to proper heading levels. After using them daily and finding them more reliable than any existing tool I tried, I ported the core logic to a Node.js CLI so it can live in developer workflows, automation scripts, and CI pipelines.

The result: Markdown that actually reflects the document's structure — with real headings, proper paragraph breaks, and page markers — ready to paste into Obsidian, Notion, or any Markdown editor.

Local Development

git clone https://github.com/Kevork-Nexacrawl-dev/pdf2md-cli.git
cd pdf2md-cli
npm install
npm link          # makes pdf2md available globally

pdf2md local ./test.pdf

Contributing

PRs are welcome. Open an issue first for any significant changes.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pdf2md-cli

Install

Usage

Convert a local PDF

Fetch and convert a remote PDF

Flags

Examples

Output Format

Features

Error Handling

Tech Stack

Why This Project

Local Development

Contributing

License