scmcp

v1.0.0

Published

3 months ago

Privacy-first document conversion & research MCP server — convert any document to Markdown and back, entirely offline

0High
0Medium
0Low

ms_maxz

mcp pdf docx markdown conversion privacy offline ocr

ScMCP — Privacy-First Document Conversion & Research MCP Server

Convert any document to Markdown (so models can read it) and Markdown back to human formats (PDF, DOCX, HTML) — 100% offline, zero telemetry, zero API calls.

Installation

Option 1: Install from npm (recommended)

npm install -g scmcp

Option 2: Use directly with npx (no install needed)

npx -y scmcp

Option 3: Build from source

git clone https://github.com/microsoft/scmcp.git
cd scmcp
npm install
npm run build

Requirements

Node.js 18 or later
npm 8 or later
GitHub Copilot CLI (for MCP integration) — install guide

Setting Up with Copilot CLI

Step 1: Launch Copilot CLI

copilot

Step 2: Add ScMCP as an MCP server

Inside the Copilot CLI interactive session, type:

/mcp

Then select "Add new MCP server" and fill in:

| Field | Value | |-------|-------| | Server Name | scmcp | | Server Type | 2 (STDIO) | | Command | npx -y scmcp | | Environment Variables | (leave empty) | | Tools | * |

If installed globally or built from source, use this command instead:
Global: scmcp
From source: node C:\path\to\ScMCP\dist\index.js

Step 3: Verify

Run /mcp again — you should see:

MCP Server: scmcp
 Status: ✓ Connected
 Tools: 17 tools available

That's it! You can now ask Copilot to convert documents, scrape web pages, summarize text, and more.

What It Does

Agents speak Markdown. Humans use PDFs, Word docs, spreadsheets, presentations. ScMCP bridges that gap with 17 tools:

📄 PDF → Markdown — chunked/paginated for large documents
📝 DOCX → Markdown — handles images, tables, headings, lists
📊 XLSX → Markdown — Excel sheets to Markdown tables
📑 PPTX → Markdown — slide text + speaker notes
🌐 HTML ↔ Markdown — bidirectional conversion
📋 CSV ↔ JSON — data format interchange
📖 RTF → Markdown — Rich Text Format support
📚 EPUB → Markdown — with chapter selection
🔤 OCR — image to text (tesseract.js, fully offline)
🖼️ Image conversion — PNG, JPG, WebP, GIF, TIFF (sharp)
📤 Markdown → PDF — styled PDF generation (puppeteer)
📤 Markdown → DOCX — Word document generation
🔍 Web scraping — URL to clean text/markdown/HTML
📝 Summarization — extractive summarization (TF-IDF, fully local)
📚 Citations — APA, MLA, Chicago style management

Privacy

| What | Network? | Details | |------|----------|---------| | All document conversions | ❌ None | Pure local file parsing | | OCR | ❌ None | WASM engine, no network | | Summarization | ❌ None | Local TF-IDF algorithm | | Image conversion | ❌ None | Pre-built native bindings | | Web scrape | ✅ User-initiated only | Only fetches URLs you provide |

No telemetry. No analytics. No phoning home. Ever.

Tools Reference

Document → Markdown

| Tool | Input | Output | |------|-------|--------| | pdf_to_md | { input_path, page_start?, page_end?, max_chars? } | Chunked Markdown + metadata | | docx_to_md | { input_path, output_path? } | Markdown content or file | | xlsx_to_md | { input_path, sheet_name?, output_path? } | Markdown tables | | pptx_to_md | { input_path, output_path? } | Markdown with slide text + notes | | html_to_md | { input_path \| html_string, output_path? } | Markdown content or file | | csv_to_json | { input_path, output_path? } | JSON array | | rtf_to_md | { input_path, output_path? } | Markdown content or file | | epub_to_md | { input_path, chapter?, output_path? } | Markdown with chapter support | | ocr_image | { input_path, language? } | Extracted text |

Markdown → Human Formats

| Tool | Input | Output | |------|-------|--------| | md_to_pdf | { input_path \| md_string, output_path?, css_path? } | Styled PDF | | md_to_html | { input_path \| md_string, output_path? } | HTML content or file | | md_to_docx | { input_path \| md_string, output_path? } | Word DOCX | | json_to_csv | { input_path, output_path? } | CSV content or file |

Research Tools

| Tool | Input | Output | |------|-------|--------| | web_scrape | { url, format: "text" \| "markdown" \| "html" } | Clean web content | | summarize_text | { text, sentence_count? } | Summary | | manage_citations | { action, format?, ...data } | Formatted citations |

Image Tools

| Tool | Input | Output | |------|-------|--------| | convert_image | { input_path, output_format, width?, height?, quality? } | Converted image file |

Large Document Handling

All document-to-text tools support chunked responses:

{
  "content": "extracted text...",
  "metadata": {
    "total_pages": 200,
    "page_start": 1,
    "page_end": 25,
    "has_more": true,
    "total_chars": 500000,
    "chunk_chars": 50000
  }
}

Default chunk: 50,000 characters
Request specific page ranges with page_start / page_end
Use output_path to write the full document without chunking

Usage Examples

Convert a PDF so the agent can read it:

"Read the PDF at C:\docs\report.pdf and summarize it"

Create a Word doc from Markdown:

"Convert my notes.md file to a DOCX"

Scrape a webpage:

"Scrape https://example.com and give me the content as markdown"

Extract text from an image:

"OCR this screenshot at C:\images\whiteboard.png"

Generate a styled PDF:

"Take this markdown and create a PDF with nice formatting"

Development

git clone https://github.com/microsoft/scmcp.git
cd scmcp
npm install
npm run build    # Compile TypeScript
npm run dev      # Watch mode with tsx
npm start        # Run the server

License

MIT