@suniltaneja/link-crawler-cli

v1.0.9

Published

5 months ago

A CLI tool to crawl websites and extract links to PDF and other downloadable files

0High
0Medium
0Low

suniltaneja

crawler pdf files scraper cli web-crawler typescript crawlee link-extractor file-discovery

Link Crawler CLI

A TypeScript-based command-line tool to crawl websites and extract links to downloadable files (PDFs, documents, archives, etc.). The crawler intelligently stays within the target domain and organizes findings into categorized lists.

✨ Features

🔍 Smart Crawling: Processes HTML pages efficiently, records file links without downloading
🎯 Domain-Focused: Stays within the target domain, automatically filters external links
📁 Organized Output: Categorizes files by type (PDF, Excel, Word, archives, etc.)
⚡ High Performance: Configurable concurrency, timeouts, and crawl limits
📊 Detailed Reporting: Real-time progress tracking and comprehensive summaries
🛡️ Type Safe: Built with TypeScript for reliability and maintainability
🔧 Easy to Use: Simple CLI interface with comprehensive help

🚀 Quick Start

Installation

# Install globally
npm install -g @suniltaneja/link-crawler-cli

# Or use with npx (no installation required)
npx @suniltaneja/link-crawler-cli --help

Basic Usage

# Crawl a website
npx @suniltaneja/link-crawler-cli https://example.com

# With custom options
npx @suniltaneja/link-crawler-cli https://example.com --max-pages 1000 --verbose

📖 Command Line Options

| Option | Alias | Description | Default | |--------|-------|-------------|---------| | --help | -h | Show help message | - | | --url <URL> | -u | Starting URL to crawl | Required | | --max-pages <N> | -m | Maximum pages to crawl | 10000 | | --concurrency <N> | -c | Concurrent requests | 10 | | --timeout <N> | -t | Request timeout (seconds) | 120 | | --output <DIR> | -o | Output directory | ./data | | --verbose | -v | Enable detailed logging | false |

💡 Examples

# Basic crawl
link-crawler https://example.com

# Limit pages with verbose output
link-crawler https://example.com --max-pages 500 --verbose

# Custom settings
link-crawler https://example.com -o ./results -c 5 -t 60

# Quick survey (limited scope)
link-crawler https://example.com -m 100 -c 3 -v

# Show help
link-crawler --help

📁 Output Structure

The crawler creates organized output in your specified directory:

data/
├── lists/
│   ├── html_files.txt      # All HTML pages discovered
│   ├── all_files.txt       # Complete list of downloadable files
│   ├── pdf_links.txt       # PDF files only
│   ├── excel_files.txt     # Excel files (.xlsx, .xls)
│   ├── word_files.txt      # Word documents (.docx, .doc)
│   ├── other_files.txt     # Other supported file types
│   └── crawl_summary.json  # Detailed summary with metadata
└── pdfs/                   # (Reserved for future features)

📄 Supported File Types

| Category | Extensions | Description | |----------|------------|-------------| | Documents | .pdf, .docx, .doc, .pptx, .ppt | PDF and Office documents | | Spreadsheets | .xlsx, .xls, .csv | Excel and CSV files | | Archives | .zip, .rar, .7z, .tar, .gz | Compressed archives | | Text | .txt, .rtf | Plain text documents | | OpenDocument | .odt, .ods, .odp | Open Office formats |

⚙️ Configuration

Performance Tuning

| Setting | Recommendation | Use Case | |---------|----------------|----------| | Concurrency (-c) | 1-5 | Respectful crawling, small sites | | | 10-15 | Balanced performance | | | 15-20 | Fast crawling, robust sites | | Timeout (-t) | 30-60s | Fast sites | | | 120s (default) | Most sites | | | 180-300s | Slow-responding sites | | Max Pages (-m) | 50-200 | Quick surveys | | | 1000-5000 | Medium sites | | | 10000+ | Comprehensive crawls |

🔧 How It Works

Initialize: Starts crawling from the provided URL
Discover: Processes HTML pages within the same domain
Extract: Finds downloadable file links using intelligent parsing
Filter: Excludes external links, keeps only same-domain files
Categorize: Organizes files by type and extension
Report: Generates organized lists and detailed summaries

🐛 Troubleshooting

Common Issues

| Problem | Solution | |---------|----------| | Timeouts | Increase timeout: --timeout 300 | | Too few results | Increase max pages: --max-pages 5000 | | Too many results | Decrease max pages: --max-pages 500 | | Permission errors | Check write permissions for output directory | | External links | Use --verbose to debug link filtering |

Debug Mode

Enable detailed logging to troubleshoot issues:

link-crawler https://example.com --verbose

Verbose output includes:

✅ Each HTML page processed
🔗 File links found or skipped
🚫 External links filtered out
📊 Real-time progress updates

🚀 Development

Local Development

# Clone the repository
git clone https://github.com/suniltaneja/link-crawler-cli.git
cd link-crawler-cli

# Install dependencies
npm install

# Run in development mode
npm run dev -- https://example.com --verbose

# Build for production
npm run build

# Test the built version
npm start -- https://example.com

Scripts

npm run build - Compile TypeScript to JavaScript
npm run dev - Run directly from TypeScript source
npm start - Build and run compiled version
npm run clean - Remove compiled files

📝 License

MIT License - see LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📧 Support

If you encounter any issues or have questions, please open an issue on GitHub.