@lifeng688/document-converter-mcp
v1.0.1
Published
A local-first MCP server for converting documents between Markdown, PDF, DOCX, and HTML
Maintainers
Readme
@lifeng688/document-converter-mcp
A local-first MCP server for converting documents between Markdown, PDF, DOCX, and HTML, with AI-friendly Markdown output and safe file access.
English: This project focuses on AI-friendly document conversion, not pixel-perfect layout reconstruction.
中文: 本项目重点是 AI 友好的文档转换,而不是像素级版式还原。
Features
- 6 conversion tools: Markdown ↔ PDF, Markdown ↔ DOCX, Markdown ↔ HTML, PDF → Markdown
- Dual engine support: Pandoc (primary) + MarkItDown (enhanced PDF/DOCX extraction)
- Safe file access: Workspace-isolated path validation, sensitive file blocking, no-overwrite-by-default
- Secure command execution: Spawn-based, no shell injection, structured errors with timeouts
- AI-friendly output: Optional
cleanForLLMflag for cleaner Markdown - Batch processing: Convert entire directories with per-file error tolerance
- Structured results: Consistent JSON response format across all tools
Supported Formats
| Source | Targets |
|--------|---------|
| Markdown (.md) | PDF, DOCX, HTML |
| DOCX (.docx) | Markdown |
| PDF (.pdf) | Markdown |
Installation
Prerequisites
- Node.js >= 18.0.0
- Pandoc >= 3.0
- Python 3 >= 3.8 (optional, for MarkItDown)
PDF Engine (required for Markdown → PDF)
Pandoc can convert Markdown to PDF, but it requires an external PDF engine.
| Engine | Install | Notes |
|--------|---------|-------|
| pdflatex (default) | MiKTeX (Windows), TeX Live (Linux/macOS) | Most common, ~2 GB install |
| xelatex | TeX Live / MiKTeX | Recommended for Chinese/CJK documents |
| lualatex | TeX Live / MiKTeX | Lua-based LaTeX engine |
| wkhtmltopdf | apt install wkhtmltopdf / brew install wkhtmltopdf | Lightweight HTML-to-PDF engine |
| weasyprint | pip install weasyprint | Python-based HTML-to-PDF |
| typst | cargo install typst | Modern, fast typesetting system |
Chinese documents: Use
pdfEngine: "xelatex"with a TeX Live / MiKTeX installation that includes thectexpackage.
Install Pandoc
macOS:
brew install pandocUbuntu/Debian:
sudo apt-get update && sudo apt-get install -y pandocWindows: Download from https://pandoc.org/installing.html
Verify:
pandoc --versionInstall MarkItDown (optional, recommended for PDF → Markdown)
pip install markitdownVerify:
python3 -c "import markitdown; print('ok')"PDF support requires optional dependencies:
# For PDF extraction only: python -m pip install -U "markitdown[pdf]" # For all optional converters (PDF, EPUB, HTML, etc.): python -m pip install -U "markitdown[all]"
markitdownexists does not guarantee PDF support is installed.
Install the Server
npm install -g @lifeng688/document-converter-mcpOr use directly via npx:
npx @lifeng688/document-converter-mcpFor development, clone the repo and build locally:
git clone https://github.com/guanweiqiang/document-convert-mcp.git
cd document-convert-mcp
npm install
npm run buildMCP Client Configuration
Install the package globally first:
npm install -g @lifeng688/document-converter-mcpClaude Desktop
Edit your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS, or %APPDATA%\Claude\claude_desktop_config.json on Windows):
{
"mcpServers": {
"document-converter": {
"command": "npx",
"args": ["-y", "@lifeng688/document-converter-mcp"],
"env": {
"DOC_CONVERTER_WORKSPACE": "E:/MCPWorkDir"
}
}
}
}Or if installed globally, use the local path:
{
"mcpServers": {
"document-converter": {
"command": "document-converter-mcp",
"env": {
"DOC_CONVERTER_WORKSPACE": "E:/MCPWorkDir"
}
}
}
}Sample configs are in examples/:
mcp.json— MCP Inspector configclaude-desktop-config.json— Claude Desktop config
Tools
1. markdown_to_pdf
Convert Markdown to PDF using Pandoc.
Note: Pandoc requires an external PDF engine (LaTeX distribution or alternative) to generate PDFs. See Installation for setup instructions.
中文文档:
pdflatex不支持中文 Unicode 字符。中文 Markdown 转 PDF 请使用pdfEngine: "xelatex"(推荐)或lualatex。
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| inputPath | string | Yes | — | Input Markdown file path |
| outputPath | string | No | Auto-derived | Output PDF path |
| title | string | No | — | PDF document title |
| toc | boolean | No | false | Include table of contents |
| pageSize | enum | No | A4 | Page size: A4 or Letter |
| theme | enum | No | default | Theme: default, github, academic |
| pdfEngine | enum | No | Pandoc default | PDF engine: pdflatex, xelatex, lualatex, wkhtmltopdf, weasyprint, typst. Leave unset to let Pandoc choose. |
| cjkMainFont | string | No | — | CJK main font name for Chinese/Japanese/Korean documents (e.g. "Microsoft YaHei", "SimSun", "Noto Sans CJK SC"). Passed as -V CJKmainfont:<font>. |
| preserveSource | boolean | No | false | Save original Markdown as sidecar files (sample.pdf.source.md, sample.pdf.meta.json) for accurate PDF-to-Markdown recovery. |
| strictMarkdown | boolean | No | false | Reject input if Markdown has structural issues like unclosed code blocks. |
| overwrite | boolean | No | false | Allow overwriting existing files |
2. markdown_to_docx
Convert Markdown to DOCX using Pandoc.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| inputPath | string | Yes | — | Input Markdown file path |
| outputPath | string | No | Auto-derived | Output DOCX path |
| referenceDocx | string | No | — | Word template file |
| toc | boolean | No | false | Include table of contents. Note: when converting back to Markdown, the TOC may appear as plain content. |
| strictMarkdown | boolean | No | false | Reject input if Markdown has structural issues like unclosed code blocks. |
| overwrite | boolean | No | false | Allow overwriting existing files |
3. docx_to_markdown
Convert DOCX to Markdown using Pandoc or MarkItDown.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| inputPath | string | Yes | — | Input DOCX file path |
| outputPath | string | No | Auto-derived | Output Markdown path |
| extractImages | boolean | No | false | Extract embedded images |
| imageDir | string | No | — | Directory for extracted images |
| engine | enum | No | pandoc | Engine: pandoc or markitdown |
| markdownFlavor | enum | No | gfm | Markdown dialect: gfm (GitHub Flavored), commonmark, or pandoc |
| cleanForLLM | boolean | No | false | Clean Markdown for AI consumption |
| overwrite | boolean | No | false | Allow overwriting existing files |
4. pdf_to_markdown
Extract text from PDF to Markdown.
Warning: This is content extraction, not layout reconstruction. Scanned PDFs, complex tables, two-column papers, and mathematical formulas may not convert reliably. For scanned PDFs, OCR is required (not included).
PDF → Markdown is content extraction, not layout or semantic structure reconstruction.
PDF 转 Markdown 是内容提取,不是版式或语义结构还原。
普通 PDF 通常不保存 Markdown 语义。
标题、表格、代码块、列表、阅读顺序都可能无法可靠恢复。
MarkItDown PDF support: By default
pip install markitdowninstalls only core text/DOCX support. PDF extraction requires the optional[pdf]extra.Sidecar recovery: If the PDF was generated by this server with
preserveSource: true, the original Markdown is available as a sidecar file. The defaultpreferSourceSidecar: truewill automatically find and return it.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| inputPath | string | Yes | — | Input PDF file path |
| outputPath | string | No | Auto-derived | Output Markdown path |
| engine | enum | No | markitdown | Engine: markitdown or pandoc |
| cleanForLLM | boolean | No | false | Clean Markdown for AI consumption |
| preferSourceSidecar | boolean | No | true | First check for a .source.md sidecar file. If found, return original Markdown instead of extracting PDF text. |
| overwrite | boolean | No | false | Allow overwriting existing files |
5. markdown_to_html
Convert Markdown to HTML using Pandoc.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| inputPath | string | Yes | — | Input Markdown file path |
| outputPath | string | No | Auto-derived | Output HTML path |
| cssPath | string | No | — | External CSS file path |
| standalone | boolean | No | true | Generate complete HTML document |
| strictMarkdown | boolean | No | false | Reject input if Markdown has structural issues like unclosed code blocks. |
| overwrite | boolean | No | false | Allow overwriting existing files |
6. batch_convert
Convert all matching files in a directory.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| inputDir | string | Yes | — | Source directory |
| outputDir | string | Yes | — | Destination directory |
| from | enum | Yes | — | Source format: md, markdown, docx, pdf |
| to | enum | Yes | — | Target format: md, markdown, docx, pdf, html |
| recursive | boolean | No | false | Traverse subdirectories |
| overwrite | boolean | No | false | Overwrite existing files |
| cleanForLLM | boolean | No | false | Clean Markdown for AI consumption |
Usage Examples
Convert a single Markdown file to PDF
Tool: markdown_to_pdf
Args: {
"inputPath": "reports/quarterly.md",
"toc": true,
"pageSize": "Letter"
}Convert Markdown to PDF with xelatex (for Chinese documents)
pdflatex 不适合中文文档,会报
LaTeX Error: Unicode character not set up for use with LaTeX。 中文 Markdown 转 PDF 推荐使用xelatex,并指定 CJK 字体。
- Windows:
cjkMainFont: "Microsoft YaHei"或"SimSun"或"SimHei"- macOS:
cjkMainFont: "Songti SC"或"Heiti SC"- Linux:
cjkMainFont: "Noto Sans CJK SC"(需安装 fonts-noto-cjk 包)For Chinese Markdown documents, use
pdfEngine='xelatex'and setcjkMainFont. On Windows, recommended fonts are Microsoft YaHei, SimSun, or SimHei.
Tool: markdown_to_pdf
Args: {
"inputPath": "sample.md",
"outputPath": "sample.pdf",
"toc": true,
"pageSize": "A4",
"pdfEngine": "xelatex",
"cjkMainFont": "Microsoft YaHei",
"preserveSource": true,
"overwrite": true
}Extract text from a PDF for AI analysis
Tool: pdf_to_markdown
Args: {
"inputPath": "papers/research.pdf",
"engine": "markitdown",
"cleanForLLM": true
}Batch convert all Markdown files to PDF
Tool: batch_convert
Args: {
"inputDir": "docs/source",
"outputDir": "docs/published",
"from": "md",
"to": "pdf",
"recursive": true,
"overwrite": true
}Security
This server implements strict security measures:
- Workspace isolation: All file access is confined to a configured workspace directory
- Path traversal prevention:
..sequences and absolute path escapes are blocked - Sensitive file blocking:
.env,.ssh/,.npmrc, etc. are never accessible - File size limits: Input files over 50 MB are rejected by default
- No shell injection: All commands use
spawn()with argument arrays - No overwrite by default: Existing files are protected unless explicitly allowed
See docs/security.md for full details.
Recommended Workflows
Good
- Markdown → PDF — High-quality PDF output with Pandoc
- Markdown → DOCX — High-quality Word output
- Markdown → HTML — High-quality HTML output
- DOCX → Markdown — Good text extraction
- PDF → Markdown — For text extraction only. See Conversion Quality for limitations.
Not recommended
- Markdown → PDF → Markdown for structure recovery
- PDFs do not preserve Markdown semantics (headings, tables, code blocks, lists, reading order)
- The round-trip will lose structural information
Accurate recovery from PDF
If you need to recover the original Markdown from a PDF generated by this server, use preserveSource: true when calling markdown_to_pdf:
{
"inputPath": "sample.md",
"outputPath": "sample.pdf",
"preserveSource": true,
"overwrite": true
}This generates sidecar files (sample.pdf.source.md, sample.pdf.meta.json). Then when calling pdf_to_markdown, the default preferSourceSidecar: true will automatically find and return the original Markdown.
推荐工作流
推荐
- Markdown → PDF — 高质量的 PDF 输出
- Markdown → DOCX — 高质量的 Word 输出
- Markdown → HTML — 高质量的 HTML 输出
- DOCX → Markdown — 良好的文本提取
- PDF → Markdown — 仅用于内容提取。有关限制请参见 Conversion Quality 部分。
不推荐
- Markdown → PDF → Markdown 用于结构恢复
- PDF 不保存 Markdown 语义(标题、表格、代码块、列表、阅读顺序)
- 往返转换将丢失结构信息
从 PDF 精确恢复
如果需要从本工具生成的 PDF 恢复原始 Markdown,请在生成 PDF 时启用 preserveSource: true:
{
"inputPath": "sample.md",
"outputPath": "sample.pdf",
"preserveSource": true,
"overwrite": true
}这将生成 sidecar 文件(sample.pdf.source.md, sample.pdf.meta.json)。然后在调用 pdf_to_markdown 时,默认的 preferSourceSidecar: true 会自动查找并返回原始 Markdown。
Conversion Quality
This project focuses on AI-friendly document conversion, not pixel-perfect layout reconstruction.
See docs/conversion-quality.md for format-specific quality notes and engine comparisons.
Development
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run in development mode (hot reload)
npm run dev
# Type check without emitting
npm run typecheckLicense
MIT
