crawlmd
v2.0.0
Published
Recursive website-to-markdown crawler with stealth anti-bot, Readability article extraction OR full-page capture, contact info harvesting, and image download.
Maintainers
Readme
🕷️ Crawlmd
Recursive website-to-markdown crawler. Readability article extraction OR full-page capture, contact harvesting, stealth bypass, and image download.
Crawlmd is a recursive crawler that turns websites into clean, LLM-ready markdown. Works as CLI or MCP tool.
Quick Start
# Article mode (Readability — strips nav/ads, keeps core content)
npx crawlmd https://example.com --bundle
# Full page mode (preserves navigation, footer, CTAs, everything)
npx crawlmd https://example.com --bundle --full
# Extract contact info (emails, phones, addresses, social links)
npx crawlmd https://example.com --bundle --extract --full
# Custom output directory
npx crawlmd https://example.com --output ./my-dataMCP (for Claude Desktop / Cursor)
{
"mcpServers": {
"crawlmd": {
"command": "npx",
"args": ["-y", "crawlmd", "--mcp"]
}
}
}Options
| Flag | Default | Description |
|------|---------|-------------|
| --bundle | off | Single markdown file with all pages |
| --full | off | Capture entire page (not just Readability article) |
| --extract | off | Extract emails, phones, addresses, social links |
| --output <dir> | output/domain/ | Custom output directory |
| --max-pages <n> | 50 | Page crawl limit |
| --mcp | off | Run as MCP server |
Output
output/example.com/
├── markdown/ # Individual .md files (when --bundle off)
├── images/ # Downloaded images
├── example.com_bundle.md # All pages combined (when --bundle on)
└── example.com_extracted.json # Contact info (when --extract on)Extracted JSON example
{
"_domain": "example.com",
"email": ["[email protected]", "[email protected]"],
"phone": ["+32 477 12 34 56"],
"address": ["22 Grand Place, 1400 Nivelles"],
"social": ["facebook.com/example"]
}Build
git clone https://github.com/antoineghigny/crawlmd
cd crawlmd
npm install
npx playwright install chromiumLicense
MIT — Antoine Ghigny
