mdize
v0.2.0
Published
Convert documents (PDF, DOCX, PPTX, XLSX, HTML, CSV, images, XML) to Markdown
Readme
Mdize
Convert documents to Markdown, preserving their structure (headings, lists, tables, links, images). The Markdown output is designed to be processed by AI, not humans.
An attempt to port to TypeScript markitdown Python library from Microsoft.
Supported Formats
| Format | Extensions | Notes |
|--------|-----------|-------|
| PDF | .pdf | Rich text (headings, bold, italic, links), borderless table detection |
| DOCX | .docx | Via mammoth → HTML → Markdown |
| PPTX | .pptx | Slides, tables, charts, images, notes |
| XLSX | .xlsx | All sheets as Markdown tables |
| HTML | .html, .htm | Strips scripts/styles, preserves structure |
| CSV | .csv | Markdown table with charset auto-detection |
| Images | .jpg, .png | EXIF metadata + optional OCR with table detection |
| XML/RSS | .xml, .rss, .atom | RSS and Atom feed parsing |
| Plain text | .txt, .md, .json | Passthrough with charset handling |
Installation
npm install mdizeRequires Node.js >= 18.
Usage
import { Mdize } from "mdize";
const converter = new Mdize();
// Convert a file
const result = await converter.convertFile("document.pdf");
console.log(result.markdown);
// Convert a buffer
import { readFile } from "node:fs/promises";
const buffer = await readFile("spreadsheet.xlsx");
const result2 = await converter.convertBuffer(buffer, { extension: ".xlsx" });
console.log(result2.markdown);
// Auto-detect: string = file path, Buffer = raw data
const result3 = await converter.convert("presentation.pptx");Options
// Keep full data URIs (e.g. base64 images in DOCX/PPTX)
const result = await converter.convertFile("doc.docx", { keepDataUris: true });
// Enable OCR for images (requires tesseract.js)
const result = await converter.convertFile("invoice.jpg", { ocr: true });
// Provide charset hint for non-UTF8 files
const result = await converter.convertBuffer(csvBuffer, {
extension: ".csv",
charset: "cp932",
});Custom Converters
import { Mdize, DocumentConverter, PRIORITY_SPECIFIC } from "mdize";
class MyConverter extends DocumentConverter {
accepts(input, info) {
return info.extension === ".custom";
}
async convert(input, info, options) {
return { markdown: input.toString("utf-8") };
}
}
const converter = new Mdize();
converter.register(new MyConverter(), PRIORITY_SPECIFIC);API
Mdize
| Method | Description |
|--------|-------------|
| convert(source, options?) | Auto-detect: file path (string) or Buffer |
| convertFile(path, options?) | Convert a local file |
| convertBuffer(buffer, info?, options?) | Convert a Buffer with optional metadata |
| register(converter, priority?) | Register a custom converter |
ConversionResult
interface ConversionResult {
markdown: string; // The converted Markdown
title?: string; // Document title (from HTML <title>, etc.)
}StreamInfo
interface StreamInfo {
filename?: string;
extension?: string; // e.g. ".pdf"
mimetype?: string; // e.g. "application/pdf"
charset?: string; // e.g. "utf-8", "cp932"
}ConvertOptions
interface ConvertOptions {
url?: string; // URL context for the document
keepDataUris?: boolean; // Keep full base64 data URIs
ocr?: boolean; // Enable OCR for images
}Development
npm test # Run tests
npm run build # Build ESM + CJS
npm run typecheck # Type checkLicense
MIT
