@leoflores/web2md
v1.2.0
Published
Render web pages with a locally installed Chromium-family browser (Puppeteer) and convert the main content to clean Markdown.
Maintainers
Readme
web2md
Convert web pages to clean Markdown for AI/LLM consumption.
About
Built for feeding documentation, articles, and web content into AI assistants and LLMs. Renders JavaScript-heavy sites with a real browser, extracts the main content using Mozilla's Readability, and outputs clean GitHub-flavored Markdown.
Use cases:
- Feed documentation pages to Claude, ChatGPT, or other LLMs
- Build knowledge bases from web content for RAG pipelines
- Convert technical docs for AI-assisted coding workflows
- Archive articles in a format that's easy to search and process
Features
- AI-Ready Output - Clean markdown optimized for LLM context windows
- Smart Extraction - Uses Readability to identify and extract main article content
- Clean Markdown - Converts to GFM with tables, strikethrough, and code blocks
- JS-Rendered Pages - Full Puppeteer support for SPAs and dynamic content
- Interactive Mode - Pause for captchas, logins, or human verification
- YAML Frontmatter - Includes title, source URL, and publication date
- Lazy Load Support - Auto-scroll triggers lazy-loaded images and content
- No Bundled Browser - Uses your existing Chrome/Chromium installation
Installation
Run directly without installing:
npx @leoflores/web2md <url> --printOr install globally:
npm install -g @leoflores/web2mdRequirements
- Node.js >= 20
- Chrome, Chromium, or Edge installed locally
Usage
web2md <url> [options]Quick Examples
# Print markdown to stdout
web2md https://example.com/article --print
# Save to specific file
web2md https://example.com/article --out ./article.md
# Save to directory (auto-filename from title)
web2md https://example.com/article --out ./out/
# JS-heavy pages: wait for content to load
web2md https://example.com/app --wait-until domcontentloaded --wait-ms 2000 --out ./app.md
# Interactive mode for login/captcha
web2md https://example.com --interactive --user-data-dir ./tmp/chrome-profile --out ./out/Options Reference
Output Options
| Option | Description | Default |
|--------|-------------|---------|
| --out <path> | Output file or directory | stdout |
| --print | Print to stdout | false |
| --frontmatter | Include YAML frontmatter | true |
| --no-frontmatter | Omit YAML frontmatter | - |
| --title <title> | Override title in output | - |
Browser Options
| Option | Description | Default |
|--------|-------------|---------|
| --chrome-path <path> | Chrome executable path | auto-detect |
| --headful | Show browser window | false (headless) |
| --interactive | Pause for human verification | false |
| --user-data-dir <path> | Chrome profile directory | - |
| --user-agent <ua> | Override user agent | - |
| --no-sandbox | Disable Chrome sandbox (CI/containers) | - |
Navigation Options
| Option | Description | Default |
|--------|-------------|---------|
| --wait-until <event> | load, domcontentloaded, networkidle0, networkidle2 | networkidle2 |
| --timeout-ms <ms> | Navigation timeout | 45000 |
| --wait-for <css> | Wait for CSS selector | - |
| --wait-ms <ms> | Extra wait after navigation | 0 |
| --no-auto-scroll | Disable lazy-load triggering | - |
Common Workflows
Save Documentation
# Single page
web2md https://docs.example.com/guide --out ./docs/guide.md
# With shorter timeout for fast sites
web2md https://docs.example.com/api --timeout-ms 15000 --out ./docs/api.mdHandle JavaScript-Heavy Sites
# Wait for specific element
web2md https://spa.example.com --wait-for ".article-content" --out ./article.md
# Wait for network to settle + extra time
web2md https://spa.example.com --wait-until networkidle0 --wait-ms 3000 --out ./article.mdSites with Login/Captcha
# First run: complete verification manually
web2md https://protected.example.com --interactive --user-data-dir ./tmp/profile --out ./article.md
# Browser opens, you complete login/captcha, press Enter to continue
# Subsequent runs: reuse the saved session
web2md https://protected.example.com --user-data-dir ./tmp/profile --out ./another.mdCI/Docker Environments
# Disable sandbox for containers
web2md https://example.com --no-sandbox --out ./output.mdClaude Code Skill
An optional skill wrapper is included for Claude Code users:
mkdir -p ~/.claude/skills
cp -R ./claude/web-to-markdown ~/.claude/skills/web-to-markdownThen invoke in Claude:
use the skill web-to-markdown to convert https://example.com to markdownDevelopment
# Install dependencies
npm install
# Run CLI in development
npm run dev -- https://example.com --print
# Type check
npm run typecheck
# Build distributable
npm run buildHow It Works
- Render - Puppeteer launches Chrome and navigates to the URL
- Wait - Configurable waiting strategy for JS content to load
- Extract - Readability identifies and extracts the main article
- Convert - Turndown transforms HTML to GitHub-flavored Markdown
- Output - Write to file or stdout with optional YAML frontmatter
Acknowledgments
Built with:
- puppeteer-core - Browser automation
- @mozilla/readability - Content extraction
- turndown - HTML to Markdown conversion
License
MIT License. See LICENSE for details.
