@pdf2md/core
v0.2.0
Published
Convert PDF files to Markdown — client-side and Node.js
Readme
@pdf2md/core
Convert PDF files to Markdown. Runs client-side in the browser or in Node.js.
Install
npm install @pdf2md/coreLibrary Usage
import { convert } from "@pdf2md/core";
import { readFile } from "node:fs/promises";
const buffer = await readFile("document.pdf");
const result = await convert(buffer.buffer);
console.log(result.markdown);
console.log(result.stats); // { pageCount, wordCount, processingMs }Options
const result = await convert(buffer.buffer, {
maxPages: 10, // Limit number of pages to convert
includeMetadata: true, // Extract title, author, etc.
signal: controller.signal, // AbortSignal for cancellation
onProgress: (progress) => {
console.log(`${progress.stage}: page ${progress.currentPage}/${progress.totalPages}`);
},
});Result
interface ConversionResult {
status: "success" | "partial" | "failed";
markdown: string;
messages: ConversionMessage[]; // Errors and warnings
stats: { pageCount: number; wordCount: number; processingMs: number };
metadata?: { title?: string; author?: string; subject?: string; keywords?: string[]; creationDate?: string };
}CLI Usage
npx @pdf2md/core document.pdf > output.mdPrints Markdown to stdout. Warnings and errors go to stderr.
Conversion Pipeline
- Parse PDF via PDF.js
- Extract text items with position/font metadata
- Extract link annotations
- Detect and strip repeated headers/footers
- Build font size histogram for heading detection (H1-H6)
- Group text into blocks by vertical proximity
- Classify blocks: heading, list-item, or paragraph
- Match link annotations to text by bounding box overlap
- Apply bold/italic from font name heuristics
- Assemble Markdown output
Limits
- 15 MB maximum file size
- No OCR support (scanned/image-based PDFs return an error)
- Password-protected PDFs are not supported
License
MIT
