@jldb/web-to-md
v0.1.0
Published
A CLI tool to crawl (for example) documentation websites and convert them to Markdown.
Downloads
6
Maintainers
Readme
🕷️ Web-to-MD: Your Friendly Neighborhood Web Crawler and Markdown Converter 🕸️
Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀
🌟 Why Web-to-MD?
Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!
🎭 Features That'll Make You Go "Wow!"
- 🔍 Crawls websites like a pro detective
- 🧙♂️ Magically transforms HTML into beautiful Markdown
- 🏃♂️ Resumes interrupted crawls (because life happens!)
- 📚 Creates separate Markdown files or one big book of knowledge
- 🎨 Shows fancy progress bars (because who doesn't love those?)
- 🚦 Respects rate limits (we're polite crawlers here!)
- 🌳 Preserves directory structure (if you're into that sort of thing)
- 🔒 Handles authentication gracefully (no trespassing allowed!)
- 👥 Multi-worker support (because teamwork makes the dream work!)
- 🔄 Smart content change detection (no need to crawl what hasn't changed!)
🛠️ Installation
- Clone this repo (it won't bite, promise!)
- Run
npm install(sit back and watch the magic happen) - Run
npm run buildto compile the TypeScript code
🚀 Usage
Fire up Web-to-MD with this incantation:
npm start -- -u <url> -o <output_directory> [options]🎛️ Options (Mix and Match to Your Heart's Content)
-u, --url <url>: The URL of your web treasure trove (required)-o, --output <output>: Where to stash your Markdown gold (required)-c, --combine: Merge all pages into one massive scroll of knowledge-e, --exclude <paths>: Comma-separated list of paths to skip (shh, we won't tell)-r, --rate <rate>: Max pages per second (default: 5, for the speed demons)-d, --depth <depth>: How deep should we dig? (default: 3, watch out for dragons)-m, --max-file-size <size>: Max file size in MB for combined output (default: 2)-n, --name <name>: Name your combined file (get creative!)-p, --preserve-structure: Keep the directory structure (for the neat freaks)-t, --timeout <timeout>: Timeout in seconds for page navigation (default: 3.5)-i, --initial-timeout <initialTimeout>: Initial timeout for the first page (default: 60)-re, --retries <retries>: Number of retries for initial page load (default: 3)-w, --workers <workers>: Number of concurrent workers (default: 1, for the multitaskers)
🌟 Example (Because We All Need a Little Guidance)
npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3This will:
- Crawl https://docs.example.com
- Save Markdown files to ./my_docs
- Combine all pages into one file
- Crawl up to 5 levels deep
- Respect a rate limit of 3 pages per second
- Name the combined file "ExampleDocs"
- Use 3 concurrent workers for faster crawling
🔧 Config Magic: Resuming and Customizing Your Crawls
Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:
📁 Config File
After a crawl (complete or interrupted), Web-to-MD saves a config.json file in your output directory. This file contains all the settings and state information from your last crawl.
🔄 Resuming a Crawl
To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json file and pick up where it left off.
🎛️ Customizing Your Crawl
You can manually edit the config.json file to customize your next crawl. Here are the available options and their default values:
| Option | Description | Default Value |
|--------|-------------|---------------|
| url | Starting URL for the crawl | (Required) |
| outputDir | Output directory for Markdown files | (Required) |
| excludePaths | Paths to exclude from crawling | [] |
| maxPagesPerSecond | Maximum pages to crawl per second | 5 |
| maxDepth | Maximum depth to crawl | 3 |
| maxFileSizeMB | Maximum file size in MB for combined output | 2 |
| combine | Combine all pages into a single file | false |
| name | Name for the combined output file | undefined |
| preserveStructure | Preserve directory structure | false |
| timeout | Timeout in seconds for page navigation | 3.5 |
| initialTimeout | Initial timeout in seconds for the first page load | 60 |
| retries | Number of retries for initial page load | 3 |
| numWorkers | Number of concurrent workers | 1 |
You can modify these settings in the config.json file to customize your crawl. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}🌟 Example Workflow
Start an initial crawl:
npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2If the crawl is interrupted, Web-to-MD will save the state in
./my_docs/config.json.To resume, simply run:
npm start -- -o ./my_docsTo customize, edit
./my_docs/config.jsonto change the crawl settings as needed. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}- Run the crawl again with the updated config:
npm start -- -o ./my_docs
This workflow allows you to fine-tune your crawls and easily pick up where you left off!
🎭 Contributing
Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝
📜 License
ISC (It's So Cool) License
🙏 Acknowledgements
A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸
Now go forth and crawl some docs! 🕷️📚
