@suniltaneja/link-crawler-cli
v1.0.9
Published
A CLI tool to crawl websites and extract links to PDF and other downloadable files
Maintainers
Readme
Link Crawler CLI
A TypeScript-based command-line tool to crawl websites and extract links to downloadable files (PDFs, documents, archives, etc.). The crawler intelligently stays within the target domain and organizes findings into categorized lists.
✨ Features
- 🔍 Smart Crawling: Processes HTML pages efficiently, records file links without downloading
- 🎯 Domain-Focused: Stays within the target domain, automatically filters external links
- 📁 Organized Output: Categorizes files by type (PDF, Excel, Word, archives, etc.)
- ⚡ High Performance: Configurable concurrency, timeouts, and crawl limits
- 📊 Detailed Reporting: Real-time progress tracking and comprehensive summaries
- 🛡️ Type Safe: Built with TypeScript for reliability and maintainability
- 🔧 Easy to Use: Simple CLI interface with comprehensive help
🚀 Quick Start
Installation
# Install globally
npm install -g @suniltaneja/link-crawler-cli
# Or use with npx (no installation required)
npx @suniltaneja/link-crawler-cli --helpBasic Usage
# Crawl a website
npx @suniltaneja/link-crawler-cli https://example.com
# With custom options
npx @suniltaneja/link-crawler-cli https://example.com --max-pages 1000 --verbose📖 Command Line Options
| Option | Alias | Description | Default |
|--------|-------|-------------|---------|
| --help | -h | Show help message | - |
| --url <URL> | -u | Starting URL to crawl | Required |
| --max-pages <N> | -m | Maximum pages to crawl | 10000 |
| --concurrency <N> | -c | Concurrent requests | 10 |
| --timeout <N> | -t | Request timeout (seconds) | 120 |
| --output <DIR> | -o | Output directory | ./data |
| --verbose | -v | Enable detailed logging | false |
💡 Examples
# Basic crawl
link-crawler https://example.com
# Limit pages with verbose output
link-crawler https://example.com --max-pages 500 --verbose
# Custom settings
link-crawler https://example.com -o ./results -c 5 -t 60
# Quick survey (limited scope)
link-crawler https://example.com -m 100 -c 3 -v
# Show help
link-crawler --help📁 Output Structure
The crawler creates organized output in your specified directory:
data/
├── lists/
│ ├── html_files.txt # All HTML pages discovered
│ ├── all_files.txt # Complete list of downloadable files
│ ├── pdf_links.txt # PDF files only
│ ├── excel_files.txt # Excel files (.xlsx, .xls)
│ ├── word_files.txt # Word documents (.docx, .doc)
│ ├── other_files.txt # Other supported file types
│ └── crawl_summary.json # Detailed summary with metadata
└── pdfs/ # (Reserved for future features)📄 Supported File Types
| Category | Extensions | Description |
|----------|------------|-------------|
| Documents | .pdf, .docx, .doc, .pptx, .ppt | PDF and Office documents |
| Spreadsheets | .xlsx, .xls, .csv | Excel and CSV files |
| Archives | .zip, .rar, .7z, .tar, .gz | Compressed archives |
| Text | .txt, .rtf | Plain text documents |
| OpenDocument | .odt, .ods, .odp | Open Office formats |
⚙️ Configuration
Performance Tuning
| Setting | Recommendation | Use Case |
|---------|----------------|----------|
| Concurrency (-c) | 1-5 | Respectful crawling, small sites |
| | 10-15 | Balanced performance |
| | 15-20 | Fast crawling, robust sites |
| Timeout (-t) | 30-60s | Fast sites |
| | 120s (default) | Most sites |
| | 180-300s | Slow-responding sites |
| Max Pages (-m) | 50-200 | Quick surveys |
| | 1000-5000 | Medium sites |
| | 10000+ | Comprehensive crawls |
🔧 How It Works
- Initialize: Starts crawling from the provided URL
- Discover: Processes HTML pages within the same domain
- Extract: Finds downloadable file links using intelligent parsing
- Filter: Excludes external links, keeps only same-domain files
- Categorize: Organizes files by type and extension
- Report: Generates organized lists and detailed summaries
🐛 Troubleshooting
Common Issues
| Problem | Solution |
|---------|----------|
| Timeouts | Increase timeout: --timeout 300 |
| Too few results | Increase max pages: --max-pages 5000 |
| Too many results | Decrease max pages: --max-pages 500 |
| Permission errors | Check write permissions for output directory |
| External links | Use --verbose to debug link filtering |
Debug Mode
Enable detailed logging to troubleshoot issues:
link-crawler https://example.com --verboseVerbose output includes:
- ✅ Each HTML page processed
- 🔗 File links found or skipped
- 🚫 External links filtered out
- 📊 Real-time progress updates
🚀 Development
Local Development
# Clone the repository
git clone https://github.com/suniltaneja/link-crawler-cli.git
cd link-crawler-cli
# Install dependencies
npm install
# Run in development mode
npm run dev -- https://example.com --verbose
# Build for production
npm run build
# Test the built version
npm start -- https://example.comScripts
npm run build- Compile TypeScript to JavaScriptnpm run dev- Run directly from TypeScript sourcenpm start- Build and run compiled versionnpm run clean- Remove compiled files
📝 License
MIT License - see LICENSE file for details.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📧 Support
If you encounter any issues or have questions, please open an issue on GitHub.
