npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

markitdownllm

v0.1.5

Published

Browser-native document-to-Markdown converter for LLM pipelines. Converts PDF, DOCX, XLSX, PPTX, HTML, CSV, EPUB and more — entirely client-side, no server required.

Readme

MarkItDownLLM

Browser-native document → Markdown converter for LLM pipelines

npm license tests bundle TypeScript


markitdownllm converts PDF, DOCX, XLSX, PPTX, HTML, CSV, EPUB and more to clean, structured Markdown — entirely in the browser, with zero server calls and zero data leaving the device.

markitdownllm.com - Live Site

Ported from Microsoft's MarkItDown Python library. Same architecture, same converter pipeline, same LLM-optimised output — running natively in any modern browser.


Why Markdown for LLMs?

Markdown is the most token-efficient way to feed structured documents into LLMs:

| Signal | Impact | | --- | --- | | Token reduction | 20–45% fewer tokens vs unstructured text extraction | | Tables | ~35% fewer tokens than describing the same data in prose | | Headings | LLM skip-scans to relevant sections without reading everything | | Training alignment | GPT-4o, Claude, Gemini are all trained heavily on Markdown — they produce more accurate outputs on structured input |


Quick Start

npm install markitdownllm

Note: PDF conversion requires pdfjs-dist as a peer dependency. Copy node_modules/pdfjs-dist/build/pdf.worker.min.mjs to your public/ directory and set workerSrc (see PDF setup).

import { MarkItDown } from 'markitdownllm';

const converter = new MarkItDown();

// From a browser File object (drag-and-drop, <input type="file">)
const result = await converter.convert(file);
console.log(result.markdown);

Supported Formats

| Format | Extensions | Notes | | --- | --- | --- | | PDF | .pdf | Text extraction with table detection | | Word | .docx | Headings, lists, tables, images (via mammoth) | | Excel | .xlsx, .xls | Multi-sheet, aligned GFM tables | | PowerPoint | .pptx | Slides, speaker notes, embedded tables, image captions | | HTML | .html, .htm | Main content extraction, full GFM output | | CSV / TSV | .csv, .tsv | GFM table with separator row | | EPUB | .epub | Spine traversal, chapter-by-chapter extraction | | Plain text | .txt, .md, .rst | Passthrough |


API

new MarkItDown(options?)

import { MarkItDown } from 'markitdownllm';

const converter = new MarkItDown({
  // Optional: enable AI image captioning for PPTX embedded images
  llmConfig: {
    provider: 'anthropic',       // 'anthropic' | 'openai'
    apiKey: 'sk-ant-...',
    model: 'claude-haiku-4-5-20251001', // optional, defaults to fastest model
  },
  // Optional: custom mammoth style map for DOCX
  docxStyleMap: `
    p[style-name='Custom Heading'] => h2:fresh
  `,
});

const result = await converter.convert(file);
// result.markdown — clean, LLM-ready Markdown string

converter.convert(file: File): Promise<DocumentConverterResult>

Converts a browser File object to Markdown. The correct converter is selected automatically based on file extension and MIME type. Falls back to the next converter on error.

const result = await converter.convert(file);
console.log(result.markdown); // GFM Markdown string

converter.register(converter, priority)

Register a custom converter. Lower priority = tried first.

import { MarkItDown, PRIORITY_SPECIFIC } from 'markitdownllm';

const md = new MarkItDown();
md.register(
  {
    accepts: (file, info) => info.extension === '.custom',
    convert: async (file, info) => ({
      markdown: `Custom output for ${file.name}`,
    }),
  },
  PRIORITY_SPECIFIC
);

Individual Converters

All converters are exported and can be used directly:

import {
  PdfConverter,
  DocxConverter,
  XlsxConverter,
  XlsConverter,
  PptxConverter,
  HtmlConverter,
  CsvConverter,
  EpubConverter,
  PlainTextConverter,
} from 'markitdownllm';

const pdfConverter = new PdfConverter();
const result = await pdfConverter.convert(file, { extension: '.pdf' });

captionImage(bytes, mimeType, llmConfig, prompt?)

Caption an image using an LLM vision API. Used internally by the PPTX converter when llmConfig is set.

import { captionImage } from 'markitdownllm';

const caption = await captionImage(imageBytes, 'image/png', {
  provider: 'anthropic',
  apiKey: 'sk-ant-...',
});

normalizeWhitespace(text)

Post-processor applied to all converter output: strips trailing spaces per line, collapses 3+ blank lines to 2, trims document edges.

import { normalizeWhitespace } from 'markitdownllm';
const clean = normalizeWhitespace(rawText);

LLM Image Captioning

When llmConfig is provided, the PPTX converter sends embedded images to an LLM vision API and uses the response as alt text. Supports both Anthropic and OpenAI.

const converter = new MarkItDown({
  llmConfig: {
    provider: 'anthropic',
    apiKey: process.env.ANTHROPIC_API_KEY!,
    model: 'claude-sonnet-4-6', // better quality captions
  },
});

const result = await converter.convert(pptxFile);
// Images become: ![AI-generated description of the chart](slide1_image.png)

| Provider | Default model | Notes | | --- | --- | --- | | anthropic | claude-haiku-4-5-20251001 | Fast and cheap | | openai | gpt-4o-mini | Widely available |

Without llmConfig, images fall back to embedded alt text or filename: ![image_name.png](image_name.png).


PDF Setup

PDF conversion uses pdfjs-dist as a peer dependency. The worker file must be served as a static asset.

npm install pdfjs-dist

Vite / React:

// vite.config.ts
import { viteStaticCopy } from 'vite-plugin-static-copy';

export default {
  plugins: [
    viteStaticCopy({
      targets: [{ src: 'node_modules/pdfjs-dist/build/pdf.worker.min.mjs', dest: '' }],
    }),
  ],
};

Next.js:

# Copy worker to public directory
cp node_modules/pdfjs-dist/build/pdf.worker.min.mjs public/
import { PdfConverter } from 'markitdownllm';
import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.mjs';

Architecture

markitdownllm mirrors the Python MarkItDown architecture exactly:

MarkItDown
├── converter registry (priority-sorted)
├── getStreamInfo()  — detects MIME type and extension from File
├── convert(file)    — dispatches to first accepting converter
│   └── normalizeWhitespace() applied to all output
│
├── PdfConverter     — pdfjs-dist, spatial table detection
├── DocxConverter    — mammoth + style_map + HtmlConverter
├── XlsxConverter    — SheetJS AOA → GFM tables directly
├── XlsConverter     — SheetJS (xlrd engine)
├── PptxConverter    — JSZip + XML parse, shape sort, LLM image captions
├── EpubConverter    — JSZip + OPF spine traversal + HtmlConverter
├── HtmlConverter    — Turndown + @joplin/turndown-plugin-gfm
├── CsvConverter     — native CSV parser → GFM table
└── PlainTextConverter — passthrough

Converter Interface

interface DocumentConverter {
  accepts(file: File, streamInfo: StreamInfo): boolean;
  convert(file: File, streamInfo: StreamInfo): Promise<DocumentConverterResult>;
}

interface DocumentConverterResult {
  markdown: string;
  title?: string;
}

interface StreamInfo {
  mimetype?: string;
  extension?: string;
  filename?: string;
  charset?: string;
}

Tests

npm test              # run all tests
npm run test:watch    # watch mode
npm run test:coverage # coverage report
 ✓ tests/post-process.test.ts  (8 tests)
 ✓ tests/plain-text.test.ts    (7 tests)
 ✓ tests/csv.test.ts           (10 tests)
 ✓ tests/html.test.ts          (16 tests)
 ✓ tests/xlsx.test.ts          (7 tests)
 ✓ tests/markitdown.test.ts    (10 tests)

 Tests  58 passed (58)

Related


License

MIT © 2026