scmcp
v1.0.0
Published
Privacy-first document conversion & research MCP server — convert any document to Markdown and back, entirely offline
Maintainers
Readme
ScMCP — Privacy-First Document Conversion & Research MCP Server
Convert any document to Markdown (so models can read it) and Markdown back to human formats (PDF, DOCX, HTML) — 100% offline, zero telemetry, zero API calls.
Installation
Option 1: Install from npm (recommended)
npm install -g scmcpOption 2: Use directly with npx (no install needed)
npx -y scmcpOption 3: Build from source
git clone https://github.com/microsoft/scmcp.git
cd scmcp
npm install
npm run buildRequirements
- Node.js 18 or later
- npm 8 or later
- GitHub Copilot CLI (for MCP integration) — install guide
Setting Up with Copilot CLI
Step 1: Launch Copilot CLI
copilotStep 2: Add ScMCP as an MCP server
Inside the Copilot CLI interactive session, type:
/mcpThen select "Add new MCP server" and fill in:
| Field | Value |
|-------|-------|
| Server Name | scmcp |
| Server Type | 2 (STDIO) |
| Command | npx -y scmcp |
| Environment Variables | (leave empty) |
| Tools | * |
If installed globally or built from source, use this command instead:
- Global:
scmcp- From source:
node C:\path\to\ScMCP\dist\index.js
Step 3: Verify
Run /mcp again — you should see:
MCP Server: scmcp
Status: ✓ Connected
Tools: 17 tools availableThat's it! You can now ask Copilot to convert documents, scrape web pages, summarize text, and more.
What It Does
Agents speak Markdown. Humans use PDFs, Word docs, spreadsheets, presentations. ScMCP bridges that gap with 17 tools:
- 📄 PDF → Markdown — chunked/paginated for large documents
- 📝 DOCX → Markdown — handles images, tables, headings, lists
- 📊 XLSX → Markdown — Excel sheets to Markdown tables
- 📑 PPTX → Markdown — slide text + speaker notes
- 🌐 HTML ↔ Markdown — bidirectional conversion
- 📋 CSV ↔ JSON — data format interchange
- 📖 RTF → Markdown — Rich Text Format support
- 📚 EPUB → Markdown — with chapter selection
- 🔤 OCR — image to text (tesseract.js, fully offline)
- 🖼️ Image conversion — PNG, JPG, WebP, GIF, TIFF (sharp)
- 📤 Markdown → PDF — styled PDF generation (puppeteer)
- 📤 Markdown → DOCX — Word document generation
- 🔍 Web scraping — URL to clean text/markdown/HTML
- 📝 Summarization — extractive summarization (TF-IDF, fully local)
- 📚 Citations — APA, MLA, Chicago style management
Privacy
| What | Network? | Details | |------|----------|---------| | All document conversions | ❌ None | Pure local file parsing | | OCR | ❌ None | WASM engine, no network | | Summarization | ❌ None | Local TF-IDF algorithm | | Image conversion | ❌ None | Pre-built native bindings | | Web scrape | ✅ User-initiated only | Only fetches URLs you provide |
No telemetry. No analytics. No phoning home. Ever.
Tools Reference
Document → Markdown
| Tool | Input | Output |
|------|-------|--------|
| pdf_to_md | { input_path, page_start?, page_end?, max_chars? } | Chunked Markdown + metadata |
| docx_to_md | { input_path, output_path? } | Markdown content or file |
| xlsx_to_md | { input_path, sheet_name?, output_path? } | Markdown tables |
| pptx_to_md | { input_path, output_path? } | Markdown with slide text + notes |
| html_to_md | { input_path \| html_string, output_path? } | Markdown content or file |
| csv_to_json | { input_path, output_path? } | JSON array |
| rtf_to_md | { input_path, output_path? } | Markdown content or file |
| epub_to_md | { input_path, chapter?, output_path? } | Markdown with chapter support |
| ocr_image | { input_path, language? } | Extracted text |
Markdown → Human Formats
| Tool | Input | Output |
|------|-------|--------|
| md_to_pdf | { input_path \| md_string, output_path?, css_path? } | Styled PDF |
| md_to_html | { input_path \| md_string, output_path? } | HTML content or file |
| md_to_docx | { input_path \| md_string, output_path? } | Word DOCX |
| json_to_csv | { input_path, output_path? } | CSV content or file |
Research Tools
| Tool | Input | Output |
|------|-------|--------|
| web_scrape | { url, format: "text" \| "markdown" \| "html" } | Clean web content |
| summarize_text | { text, sentence_count? } | Summary |
| manage_citations | { action, format?, ...data } | Formatted citations |
Image Tools
| Tool | Input | Output |
|------|-------|--------|
| convert_image | { input_path, output_format, width?, height?, quality? } | Converted image file |
Large Document Handling
All document-to-text tools support chunked responses:
{
"content": "extracted text...",
"metadata": {
"total_pages": 200,
"page_start": 1,
"page_end": 25,
"has_more": true,
"total_chars": 500000,
"chunk_chars": 50000
}
}- Default chunk: 50,000 characters
- Request specific page ranges with
page_start/page_end - Use
output_pathto write the full document without chunking
Usage Examples
Convert a PDF so the agent can read it:
"Read the PDF at C:\docs\report.pdf and summarize it"
Create a Word doc from Markdown:
"Convert my notes.md file to a DOCX"
Scrape a webpage:
"Scrape https://example.com and give me the content as markdown"
Extract text from an image:
"OCR this screenshot at C:\images\whiteboard.png"
Generate a styled PDF:
"Take this markdown and create a PDF with nice formatting"
Development
git clone https://github.com/microsoft/scmcp.git
cd scmcp
npm install
npm run build # Compile TypeScript
npm run dev # Watch mode with tsx
npm start # Run the serverLicense
MIT
