@lifeng688/document-converter-mcp

v1.0.1

Published

3 days ago

A local-first MCP server for converting documents between Markdown, PDF, DOCX, and HTML

0High
0Medium
0Low

lifeng688

mcp server document-converter markdown pdf docx html pandoc markitdown

@lifeng688/document-converter-mcp

A local-first MCP server for converting documents between Markdown, PDF, DOCX, and HTML, with AI-friendly Markdown output and safe file access.

English: This project focuses on AI-friendly document conversion, not pixel-perfect layout reconstruction.
中文: 本项目重点是 AI 友好的文档转换，而不是像素级版式还原。

Features

6 conversion tools: Markdown ↔ PDF, Markdown ↔ DOCX, Markdown ↔ HTML, PDF → Markdown
Dual engine support: Pandoc (primary) + MarkItDown (enhanced PDF/DOCX extraction)
Safe file access: Workspace-isolated path validation, sensitive file blocking, no-overwrite-by-default
Secure command execution: Spawn-based, no shell injection, structured errors with timeouts
AI-friendly output: Optional cleanForLLM flag for cleaner Markdown
Batch processing: Convert entire directories with per-file error tolerance
Structured results: Consistent JSON response format across all tools

Supported Formats

| Source | Targets | |--------|---------| | Markdown (.md) | PDF, DOCX, HTML | | DOCX (.docx) | Markdown | | PDF (.pdf) | Markdown |

Installation

Prerequisites

Node.js >= 18.0.0
Pandoc >= 3.0
Python 3 >= 3.8 (optional, for MarkItDown)

PDF Engine (required for Markdown → PDF)

Pandoc can convert Markdown to PDF, but it requires an external PDF engine.

| Engine | Install | Notes | |--------|---------|-------| | pdflatex (default) | MiKTeX (Windows), TeX Live (Linux/macOS) | Most common, ~2 GB install | | xelatex | TeX Live / MiKTeX | Recommended for Chinese/CJK documents | | lualatex | TeX Live / MiKTeX | Lua-based LaTeX engine | | wkhtmltopdf | apt install wkhtmltopdf / brew install wkhtmltopdf | Lightweight HTML-to-PDF engine | | weasyprint | pip install weasyprint | Python-based HTML-to-PDF | | typst | cargo install typst | Modern, fast typesetting system |

Chinese documents: Use pdfEngine: "xelatex" with a TeX Live / MiKTeX installation that includes the ctex package.

Install Pandoc

macOS:

brew install pandoc

Ubuntu/Debian:

sudo apt-get update && sudo apt-get install -y pandoc

Windows: Download from https://pandoc.org/installing.html

Verify:

pandoc --version

Install MarkItDown (optional, recommended for PDF → Markdown)

pip install markitdown

Verify:

python3 -c "import markitdown; print('ok')"

PDF support requires optional dependencies:
# For PDF extraction only:
python -m pip install -U "markitdown[pdf]"

# For all optional converters (PDF, EPUB, HTML, etc.):
python -m pip install -U "markitdown[all]"
markitdown exists does not guarantee PDF support is installed.

Install the Server

npm install -g @lifeng688/document-converter-mcp

Or use directly via npx:

npx @lifeng688/document-converter-mcp

For development, clone the repo and build locally:

git clone https://github.com/guanweiqiang/document-convert-mcp.git
cd document-convert-mcp
npm install
npm run build

MCP Client Configuration

Install the package globally first:

npm install -g @lifeng688/document-converter-mcp

Claude Desktop

Edit your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS, or %APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "document-converter": {
      "command": "npx",
      "args": ["-y", "@lifeng688/document-converter-mcp"],
      "env": {
        "DOC_CONVERTER_WORKSPACE": "E:/MCPWorkDir"
      }
    }
  }
}

Or if installed globally, use the local path:

{
  "mcpServers": {
    "document-converter": {
      "command": "document-converter-mcp",
      "env": {
        "DOC_CONVERTER_WORKSPACE": "E:/MCPWorkDir"
      }
    }
  }
}

Sample configs are in examples/:

mcp.json — MCP Inspector config
claude-desktop-config.json — Claude Desktop config

Tools

1. `markdown_to_pdf`

Convert Markdown to PDF using Pandoc.

Note: Pandoc requires an external PDF engine (LaTeX distribution or alternative) to generate PDFs. See Installation for setup instructions.
中文文档：pdflatex 不支持中文 Unicode 字符。中文 Markdown 转 PDF 请使用 pdfEngine: "xelatex"（推荐）或 lualatex。

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | inputPath | string | Yes | — | Input Markdown file path | | outputPath | string | No | Auto-derived | Output PDF path | | title | string | No | — | PDF document title | | toc | boolean | No | false | Include table of contents | | pageSize | enum | No | A4 | Page size: A4 or Letter | | theme | enum | No | default | Theme: default, github, academic | | pdfEngine | enum | No | Pandoc default | PDF engine: pdflatex, xelatex, lualatex, wkhtmltopdf, weasyprint, typst. Leave unset to let Pandoc choose. | | cjkMainFont | string | No | — | CJK main font name for Chinese/Japanese/Korean documents (e.g. "Microsoft YaHei", "SimSun", "Noto Sans CJK SC"). Passed as -V CJKmainfont:<font>. | | preserveSource | boolean | No | false | Save original Markdown as sidecar files (sample.pdf.source.md, sample.pdf.meta.json) for accurate PDF-to-Markdown recovery. | | strictMarkdown | boolean | No | false | Reject input if Markdown has structural issues like unclosed code blocks. | | overwrite | boolean | No | false | Allow overwriting existing files |

2. `markdown_to_docx`

Convert Markdown to DOCX using Pandoc.

3. `docx_to_markdown`

Convert DOCX to Markdown using Pandoc or MarkItDown.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | inputPath | string | Yes | — | Input DOCX file path | | outputPath | string | No | Auto-derived | Output Markdown path | | extractImages | boolean | No | false | Extract embedded images | | imageDir | string | No | — | Directory for extracted images | | engine | enum | No | pandoc | Engine: pandoc or markitdown | | markdownFlavor | enum | No | gfm | Markdown dialect: gfm (GitHub Flavored), commonmark, or pandoc | | cleanForLLM | boolean | No | false | Clean Markdown for AI consumption | | overwrite | boolean | No | false | Allow overwriting existing files |

4. `pdf_to_markdown`

Extract text from PDF to Markdown.

Warning: This is content extraction, not layout reconstruction. Scanned PDFs, complex tables, two-column papers, and mathematical formulas may not convert reliably. For scanned PDFs, OCR is required (not included).
PDF → Markdown is content extraction, not layout or semantic structure reconstruction.
PDF 转 Markdown 是内容提取，不是版式或语义结构还原。
普通 PDF 通常不保存 Markdown 语义。
标题、表格、代码块、列表、阅读顺序都可能无法可靠恢复。
MarkItDown PDF support: By default pip install markitdown installs only core text/DOCX support. PDF extraction requires the optional [pdf] extra.
Sidecar recovery: If the PDF was generated by this server with preserveSource: true, the original Markdown is available as a sidecar file. The default preferSourceSidecar: true will automatically find and return it.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | inputPath | string | Yes | — | Input PDF file path | | outputPath | string | No | Auto-derived | Output Markdown path | | engine | enum | No | markitdown | Engine: markitdown or pandoc | | cleanForLLM | boolean | No | false | Clean Markdown for AI consumption | | preferSourceSidecar | boolean | No | true | First check for a .source.md sidecar file. If found, return original Markdown instead of extracting PDF text. | | overwrite | boolean | No | false | Allow overwriting existing files |

5. `markdown_to_html`

Convert Markdown to HTML using Pandoc.

6. `batch_convert`

Convert all matching files in a directory.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | inputDir | string | Yes | — | Source directory | | outputDir | string | Yes | — | Destination directory | | from | enum | Yes | — | Source format: md, markdown, docx, pdf | | to | enum | Yes | — | Target format: md, markdown, docx, pdf, html | | recursive | boolean | No | false | Traverse subdirectories | | overwrite | boolean | No | false | Overwrite existing files | | cleanForLLM | boolean | No | false | Clean Markdown for AI consumption |

Usage Examples

Convert a single Markdown file to PDF

Tool: markdown_to_pdf
Args: {
  "inputPath": "reports/quarterly.md",
  "toc": true,
  "pageSize": "Letter"
}

Convert Markdown to PDF with xelatex (for Chinese documents)

pdflatex 不适合中文文档，会报 LaTeX Error: Unicode character not set up for use with LaTeX。中文 Markdown 转 PDF 推荐使用 xelatex，并指定 CJK 字体。
Windows: cjkMainFont: "Microsoft YaHei" 或 "SimSun" 或 "SimHei"
macOS: cjkMainFont: "Songti SC" 或 "Heiti SC"
Linux: cjkMainFont: "Noto Sans CJK SC" (需安装 fonts-noto-cjk 包)
For Chinese Markdown documents, use pdfEngine='xelatex' and set cjkMainFont. On Windows, recommended fonts are Microsoft YaHei, SimSun, or SimHei.

Tool: markdown_to_pdf
Args: {
  "inputPath": "sample.md",
  "outputPath": "sample.pdf",
  "toc": true,
  "pageSize": "A4",
  "pdfEngine": "xelatex",
  "cjkMainFont": "Microsoft YaHei",
  "preserveSource": true,
  "overwrite": true
}

Extract text from a PDF for AI analysis

Tool: pdf_to_markdown
Args: {
  "inputPath": "papers/research.pdf",
  "engine": "markitdown",
  "cleanForLLM": true
}

Batch convert all Markdown files to PDF

Tool: batch_convert
Args: {
  "inputDir": "docs/source",
  "outputDir": "docs/published",
  "from": "md",
  "to": "pdf",
  "recursive": true,
  "overwrite": true
}

Security

This server implements strict security measures:

Workspace isolation: All file access is confined to a configured workspace directory
Path traversal prevention: .. sequences and absolute path escapes are blocked
Sensitive file blocking: .env, .ssh/, .npmrc, etc. are never accessible
File size limits: Input files over 50 MB are rejected by default
No shell injection: All commands use spawn() with argument arrays
No overwrite by default: Existing files are protected unless explicitly allowed

See docs/security.md for full details.

Recommended Workflows

Good

Markdown → PDF — High-quality PDF output with Pandoc
Markdown → DOCX — High-quality Word output
Markdown → HTML — High-quality HTML output
DOCX → Markdown — Good text extraction
PDF → Markdown — For text extraction only. See Conversion Quality for limitations.

Not recommended

Markdown → PDF → Markdown for structure recovery
- PDFs do not preserve Markdown semantics (headings, tables, code blocks, lists, reading order)
- The round-trip will lose structural information

Accurate recovery from PDF

If you need to recover the original Markdown from a PDF generated by this server, use preserveSource: true when calling markdown_to_pdf:

{
  "inputPath": "sample.md",
  "outputPath": "sample.pdf",
  "preserveSource": true,
  "overwrite": true
}

This generates sidecar files (sample.pdf.source.md, sample.pdf.meta.json). Then when calling pdf_to_markdown, the default preferSourceSidecar: true will automatically find and return the original Markdown.

Conversion Quality

This project focuses on AI-friendly document conversion, not pixel-perfect layout reconstruction.

See docs/conversion-quality.md for format-specific quality notes and engine comparisons.

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run in development mode (hot reload)
npm run dev

# Type check without emitting
npm run typecheck

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@lifeng688/document-converter-mcp

Features

Supported Formats

Installation

Prerequisites

PDF Engine (required for Markdown → PDF)

Install Pandoc

Install MarkItDown (optional, recommended for PDF → Markdown)

Install the Server

MCP Client Configuration

Claude Desktop

Tools

1. markdown_to_pdf

2. markdown_to_docx

3. docx_to_markdown

4. pdf_to_markdown

5. markdown_to_html

6. batch_convert

Usage Examples

Convert a single Markdown file to PDF

Convert Markdown to PDF with xelatex (for Chinese documents)

Extract text from a PDF for AI analysis

Batch convert all Markdown files to PDF

Security

Recommended Workflows

Good

Not recommended

Accurate recovery from PDF

推荐工作流

推荐

不推荐

从 PDF 精确恢复

Conversion Quality

Development

License

1. `markdown_to_pdf`

2. `markdown_to_docx`

3. `docx_to_markdown`

4. `pdf_to_markdown`

5. `markdown_to_html`

6. `batch_convert`