@harshvz/crawler
v1.3.0
Published
A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content
Maintainers
Readme

🕷️ @harshvz/crawler
A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content.
📋 Table of Contents
- Features
- Installation
- Usage
- CLI Commands
- API Documentation
- Configuration
- Output Structure
- Examples
- Development
- Contributing
- License
✨ Features
- 🔍 Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
- 📸 Full Page Screenshots: Automatically captures full-page screenshots of each visited page
- 📝 Content Extraction: Extracts metadata, headings, paragraphs, and text content
- 🎯 Domain-Scoped: Only crawls internal links within the same domain
- 🚀 Interactive CLI: User-friendly command-line interface with input validation
- 💾 Organized Storage: Saves screenshots and content in a structured directory format
- 🔄 Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
- 🎨 SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
- ⏱️ Timeout Handling: Built-in timeout management for unresponsive pages
📦 Installation
As a Global CLI Tool
npm install -g @harshvz/crawlerNote: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.
As a Project Dependency
npm install @harshvz/crawlerNote: The postinstall script will automatically download the Chromium browser.
Manual Browser Installation (if needed)
If the automatic installation fails, you can manually install browsers:
npx playwright install chromiumFrom Source
git clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .🚀 Usage
CLI Mode (Interactive)
Simply run the command and follow the prompts:
# Primary command (recommended)
crawler
# Alternative (for backward compatibility)
scraperYou'll be prompted to enter:
- URL: The website URL to scrape (e.g.,
https://example.com) - Algorithm: Choose between
bfsordfs(default: bfs) - Output Directory: Custom save location (default:
~/knowledgeBase)
Command-Line Flags
# Show version
crawler --version
crawler -v
# Show help
crawler --help
crawler -hNote: Both
crawlerandscrapercommands work identically. We recommend usingcrawlerfor new projects.
Programmatic Usage
import ScrapperServices from '@harshvz/crawler';
const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2
// Using BFS
await scraper.bfsScrape('/');
// Using DFS
await scraper.dfsScrape('/');🛠️ CLI Commands
Development
# Run in development mode with auto-reload
npm run dev
# Build the project
npm run build
# Start the built version (uses crawler command)
npm start📚 API Documentation
ScrapperServices
Main class for web scraping operations.
Constructor
new ScrapperServices(website: string, depth?: number, customPath?: string)Parameters:
website(string): The base URL of the website to scrapedepth(number, optional): Maximum depth to crawl (0 = unlimited, default: 0)customPath(string, optional): Custom output directory path (default:~/knowledgeBase)
Methods
bfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>
Crawls the website using Breadth-First Search algorithm.
Parameters:
endpoint(string): Starting path (default: "/")results(string[]): Array to collect visited endpointsvisited(Record<string, boolean>): Object to track visited URLs
dfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>
Crawls the website using Depth-First Search algorithm.
Parameters:
endpoint(string): Starting path (default: "/")results(string[]): Array to collect visited endpointsvisited(Record<string, boolean>): Object to track visited URLs
buildFilePath(endpoint: string): string
Generates a file path for storing screenshots.
buildContentPath(endpoint: string): string
Generates a file path for storing extracted content.
getLinks(page: Page): Promise<string[]>
Extracts all internal links from the current page.
⚙️ Configuration
Timeout
The default timeout for page navigation is 60 seconds. You can modify this by editing the timeout property in the ScrapperServices class:
const scraper = new ScrapperServices('https://example.com');
scraper.timeout = 30000; // 30 secondsStorage Location
By default, all scraped data is stored in:
~/knowledgeBase/Each website gets its own folder based on its hostname.
📁 Output Structure
~/knowledgeBase/
└── examplecom/
├── home.png # Screenshot of homepage
├── home.md # Extracted content from homepage
├── _about.png # Screenshot of /about page
├── _about.md # Extracted content from /about
├── _contact.png # Screenshot of /contact page
└── _contact.md # Extracted content from /contactContent File Format (.md)
Each .md file contains:
- JSON metadata (first line):
- Page title
- Meta description
- Robots directives
- Open Graph tags
- Twitter Card tags
- Extracted text content (subsequent lines):
- All text from h1-h6, p, and span elements
📖 Examples
Example 1: Basic Usage
import ScrapperServices from '@harshvz/crawler';
const scraper = new ScrapperServices('https://docs.example.com');
await scraper.bfsScrape('/');Example 2: Limited Depth Crawl
const scraper = new ScrapperServices('https://blog.example.com', 2);
await scraper.dfsScrape('/');
// Only crawls 2 levels deep from the starting pageExample 3: Custom Endpoint
const scraper = new ScrapperServices('https://example.com');
const results = [];
const visited = {};
await scraper.bfsScrape('/docs', results, visited);
console.log(`Scraped ${results.length} pages`);Example 4: Custom Output Directory
const scraper = new ScrapperServices(
'https://example.com',
0, // No depth limit
'/custom/output/path' // Custom save location
);
await scraper.bfsScrape('/');
// Files will be saved to /custom/output/path instead of ~/knowledgeBase🔧 Development
Prerequisites
- Node.js >= 16.x
- npm >= 7.x
Setup
# Clone the repository
git clone https://github.com/harshvz/crawler.git
# Navigate to directory
cd crawler
# Install dependencies
npm install
# Run in development mode
npm run devProject Structure
crawler/
├── src/
│ ├── index.ts # CLI entry point
│ └── Services/
│ └── ScrapperServices.ts # Main scraping logic
├── dist/ # Compiled JavaScript
├── package.json
├── tsconfig.json
└── README.mdBuilding
npm run buildThis compiles TypeScript files to JavaScript in the dist/ directory.
🤝 Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📝 License
ISC © Harshvz
🙏 Acknowledgments
- Built with Playwright
- CLI powered by Inquirer.js
Made with ❤️ by harshvz
