doc2md-cli
v0.1.0
Published
Document conversion CLI - Convert PDF/EPUB/DOCX/HTML to KB-ready Markdown with AI vision transcription
Maintainers
Readme
doc2md
Document conversion CLI tool - Convert PDF/EPUB/DOCX/HTML to KB-ready Markdown with AI vision transcription.
专为 Coding Agent 设计的文档转换工具,支持 Claude Code、OpenCode 等 Agent 驱动完整的文档转换流程。
✨ Features
- 📄 多格式支持: PDF、EPUB、DOCX、HTML、URL
- 🎯 视觉转录: PDF 和旧版 DOC 使用 AI 视觉模型转录
- 🔧 智能清洗: 自动移除页码、目录、噪声等
- 📝 三件套输出: Full / Brief / QA Checklist
- ⚙️ 灵活配置: CLI 参数、配置文件、环境变量三层支持
- 🤖 Agent 友好: 适合集成到 Coding Agent 工作流
🚀 Quick Start
Installation
Option 1: Use with npx (Recommended - No installation)
npx doc2md-cli extract document.pdf -o ./outputOption 2: Global install
npm install -g doc2md-cli
doc2md extract document.pdf -o ./output
# 或
doc2md-cli extract document.pdf -o ./outputOption 3: Development install (from source)
# Clone repository
git clone https://github.com/Neuma-Inc/Neumina-doc2md.git
cd Neumina-doc2md
# Install dependencies
pnpm install
# Build
pnpm build
# Optional: Link globally
pnpm link --globalUsage
# Extract a PDF document
doc2md extract document.pdf -o ./output
# Extract with specific mode
doc2md extract document.pdf --mode both --provider openai
# Extract from URL
doc2md extract https://example.com/page.html -o ./output
# Dry run (preview only)
doc2md extract document.pdf --dry-run
# Show configuration
doc2md config
# Create sample config
doc2md init -o ./doc2md.config.json⚙️ Configuration
支持三种配置方式(优先级从高到低):
- CLI 参数:
--provider openai --mode both - 配置文件:
doc2md.config.json或.doc2mdrc.json - 环境变量:
OPENAI_API_KEY,GEMINI_API_KEY
Example Configuration
{
"outputDir": "./output",
"mode": "both",
"vision": {
"provider": "openai",
"openai": {
"model": "gpt-4o",
"apiKey": "sk-..."
},
"gemini": {
"model": "gemini-2.5-flash"
}
},
"extraction": {
"pdf": {
"dpi": 150,
"imageFormat": "png"
},
"html": {
"removeScripts": true,
"removeStyles": true
}
},
"defaultMetadata": {
"documentType": "GENERAL",
"sourceSubtype": "TEXTBOOK",
"authorityRank": 3,
"kbDomains": ["general"],
"language": "zh-CN"
}
}📋 Output Format
每个文档会生成三件套:
<name>.full.md: 完整 Markdown 内容<name>.brief.md: 前 24 段摘要<name>.qa-checklist.md: 质量检查清单
Frontmatter 字段
生成的 Markdown 包含完整的 SOP 元数据:
---
title: Document Title
document_type: GENERAL | GUIDELINE
source_subtype: TEXTBOOK | ARTICLE | GUIDELINE
authority_rank: 3
kb_domains: [health, nutrition]
question_modules: [symptom_assessment]
menopause_stages: [perimenopause]
special_populations: []
citation_short: "Source (2024)"
source_org: "Organization Name"
source_country: "CN"
language: "zh-CN"
publication_date: "2024-01-01"
version_label: "v1.0"
evidence_system: "GRADE"
---🔌 Vision Providers
支持三种视觉转录 Provider:
OpenAI (Recommended)
export OPENAI_API_KEY="sk-..."
doc2md extract document.pdf --provider openaiGoogle Gemini
export GEMINI_API_KEY="..."
doc2md extract document.pdf --provider geminiAgent CLI
自动检测并使用本地安装的 Agent CLI(如 opencode 或 codex):
doc2md extract document.pdf --provider agent🛠️ Development
# Install dependencies
pnpm install
# Development mode (watch)
pnpm dev
# Run tests
pnpm test
# Type check
pnpm typecheck
# Lint
pnpm lint📚 Documentation
📄 License
MIT
🤝 Contributing
Contributions are welcome! Please read our Contributing Guide for details.
Made for AI Agents 🤖
