markitdownllm
v0.1.5
Published
Browser-native document-to-Markdown converter for LLM pipelines. Converts PDF, DOCX, XLSX, PPTX, HTML, CSV, EPUB and more — entirely client-side, no server required.
Maintainers
Readme
MarkItDownLLM
Browser-native document → Markdown converter for LLM pipelines
markitdownllm converts PDF, DOCX, XLSX, PPTX, HTML, CSV, EPUB and more to clean, structured Markdown — entirely in the browser, with zero server calls and zero data leaving the device.
markitdownllm.com - Live Site
Ported from Microsoft's MarkItDown Python library. Same architecture, same converter pipeline, same LLM-optimised output — running natively in any modern browser.
Why Markdown for LLMs?
Markdown is the most token-efficient way to feed structured documents into LLMs:
| Signal | Impact | | --- | --- | | Token reduction | 20–45% fewer tokens vs unstructured text extraction | | Tables | ~35% fewer tokens than describing the same data in prose | | Headings | LLM skip-scans to relevant sections without reading everything | | Training alignment | GPT-4o, Claude, Gemini are all trained heavily on Markdown — they produce more accurate outputs on structured input |
Quick Start
npm install markitdownllmNote: PDF conversion requires
pdfjs-distas a peer dependency. Copynode_modules/pdfjs-dist/build/pdf.worker.min.mjsto yourpublic/directory and setworkerSrc(see PDF setup).
import { MarkItDown } from 'markitdownllm';
const converter = new MarkItDown();
// From a browser File object (drag-and-drop, <input type="file">)
const result = await converter.convert(file);
console.log(result.markdown);Supported Formats
| Format | Extensions | Notes |
| --- | --- | --- |
| PDF | .pdf | Text extraction with table detection |
| Word | .docx | Headings, lists, tables, images (via mammoth) |
| Excel | .xlsx, .xls | Multi-sheet, aligned GFM tables |
| PowerPoint | .pptx | Slides, speaker notes, embedded tables, image captions |
| HTML | .html, .htm | Main content extraction, full GFM output |
| CSV / TSV | .csv, .tsv | GFM table with separator row |
| EPUB | .epub | Spine traversal, chapter-by-chapter extraction |
| Plain text | .txt, .md, .rst | Passthrough |
API
new MarkItDown(options?)
import { MarkItDown } from 'markitdownllm';
const converter = new MarkItDown({
// Optional: enable AI image captioning for PPTX embedded images
llmConfig: {
provider: 'anthropic', // 'anthropic' | 'openai'
apiKey: 'sk-ant-...',
model: 'claude-haiku-4-5-20251001', // optional, defaults to fastest model
},
// Optional: custom mammoth style map for DOCX
docxStyleMap: `
p[style-name='Custom Heading'] => h2:fresh
`,
});
const result = await converter.convert(file);
// result.markdown — clean, LLM-ready Markdown stringconverter.convert(file: File): Promise<DocumentConverterResult>
Converts a browser File object to Markdown. The correct converter is selected automatically based on file extension and MIME type. Falls back to the next converter on error.
const result = await converter.convert(file);
console.log(result.markdown); // GFM Markdown stringconverter.register(converter, priority)
Register a custom converter. Lower priority = tried first.
import { MarkItDown, PRIORITY_SPECIFIC } from 'markitdownllm';
const md = new MarkItDown();
md.register(
{
accepts: (file, info) => info.extension === '.custom',
convert: async (file, info) => ({
markdown: `Custom output for ${file.name}`,
}),
},
PRIORITY_SPECIFIC
);Individual Converters
All converters are exported and can be used directly:
import {
PdfConverter,
DocxConverter,
XlsxConverter,
XlsConverter,
PptxConverter,
HtmlConverter,
CsvConverter,
EpubConverter,
PlainTextConverter,
} from 'markitdownllm';
const pdfConverter = new PdfConverter();
const result = await pdfConverter.convert(file, { extension: '.pdf' });captionImage(bytes, mimeType, llmConfig, prompt?)
Caption an image using an LLM vision API. Used internally by the PPTX converter when llmConfig is set.
import { captionImage } from 'markitdownllm';
const caption = await captionImage(imageBytes, 'image/png', {
provider: 'anthropic',
apiKey: 'sk-ant-...',
});normalizeWhitespace(text)
Post-processor applied to all converter output: strips trailing spaces per line, collapses 3+ blank lines to 2, trims document edges.
import { normalizeWhitespace } from 'markitdownllm';
const clean = normalizeWhitespace(rawText);LLM Image Captioning
When llmConfig is provided, the PPTX converter sends embedded images to an LLM vision API and uses the response as alt text. Supports both Anthropic and OpenAI.
const converter = new MarkItDown({
llmConfig: {
provider: 'anthropic',
apiKey: process.env.ANTHROPIC_API_KEY!,
model: 'claude-sonnet-4-6', // better quality captions
},
});
const result = await converter.convert(pptxFile);
// Images become: | Provider | Default model | Notes |
| --- | --- | --- |
| anthropic | claude-haiku-4-5-20251001 | Fast and cheap |
| openai | gpt-4o-mini | Widely available |
Without llmConfig, images fall back to embedded alt text or filename: .
PDF Setup
PDF conversion uses pdfjs-dist as a peer dependency. The worker file must be served as a static asset.
npm install pdfjs-distVite / React:
// vite.config.ts
import { viteStaticCopy } from 'vite-plugin-static-copy';
export default {
plugins: [
viteStaticCopy({
targets: [{ src: 'node_modules/pdfjs-dist/build/pdf.worker.min.mjs', dest: '' }],
}),
],
};Next.js:
# Copy worker to public directory
cp node_modules/pdfjs-dist/build/pdf.worker.min.mjs public/import { PdfConverter } from 'markitdownllm';
import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.mjs';Architecture
markitdownllm mirrors the Python MarkItDown architecture exactly:
MarkItDown
├── converter registry (priority-sorted)
├── getStreamInfo() — detects MIME type and extension from File
├── convert(file) — dispatches to first accepting converter
│ └── normalizeWhitespace() applied to all output
│
├── PdfConverter — pdfjs-dist, spatial table detection
├── DocxConverter — mammoth + style_map + HtmlConverter
├── XlsxConverter — SheetJS AOA → GFM tables directly
├── XlsConverter — SheetJS (xlrd engine)
├── PptxConverter — JSZip + XML parse, shape sort, LLM image captions
├── EpubConverter — JSZip + OPF spine traversal + HtmlConverter
├── HtmlConverter — Turndown + @joplin/turndown-plugin-gfm
├── CsvConverter — native CSV parser → GFM table
└── PlainTextConverter — passthroughConverter Interface
interface DocumentConverter {
accepts(file: File, streamInfo: StreamInfo): boolean;
convert(file: File, streamInfo: StreamInfo): Promise<DocumentConverterResult>;
}
interface DocumentConverterResult {
markdown: string;
title?: string;
}
interface StreamInfo {
mimetype?: string;
extension?: string;
filename?: string;
charset?: string;
}Tests
npm test # run all tests
npm run test:watch # watch mode
npm run test:coverage # coverage report ✓ tests/post-process.test.ts (8 tests)
✓ tests/plain-text.test.ts (7 tests)
✓ tests/csv.test.ts (10 tests)
✓ tests/html.test.ts (16 tests)
✓ tests/xlsx.test.ts (7 tests)
✓ tests/markitdown.test.ts (10 tests)
Tests 58 passed (58)Related
- MarkItDown Python — the original Python library this is ported from
- MarkItDownLLM Tool — ready-to-use drag-and-drop UI built on this package
License
MIT © 2026
