npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@harshvz/crawler

v1.3.0

Published

A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content

Readme

npm version

🕷️ @harshvz/crawler

A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content.

npm version License: ISC

📋 Table of Contents

✨ Features

  • 🔍 Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
  • 📸 Full Page Screenshots: Automatically captures full-page screenshots of each visited page
  • 📝 Content Extraction: Extracts metadata, headings, paragraphs, and text content
  • 🎯 Domain-Scoped: Only crawls internal links within the same domain
  • 🚀 Interactive CLI: User-friendly command-line interface with input validation
  • 💾 Organized Storage: Saves screenshots and content in a structured directory format
  • 🔄 Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
  • 🎨 SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
  • ⏱️ Timeout Handling: Built-in timeout management for unresponsive pages

📦 Installation

As a Global CLI Tool

npm install -g @harshvz/crawler

Note: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.

As a Project Dependency

npm install @harshvz/crawler

Note: The postinstall script will automatically download the Chromium browser.

Manual Browser Installation (if needed)

If the automatic installation fails, you can manually install browsers:

npx playwright install chromium

From Source

git clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .

🚀 Usage

CLI Mode (Interactive)

Simply run the command and follow the prompts:

# Primary command (recommended)
crawler

# Alternative (for backward compatibility)
scraper

You'll be prompted to enter:

  1. URL: The website URL to scrape (e.g., https://example.com)
  2. Algorithm: Choose between bfs or dfs (default: bfs)
  3. Output Directory: Custom save location (default: ~/knowledgeBase)

Command-Line Flags

# Show version
crawler --version
crawler -v

# Show help
crawler --help
crawler -h

Note: Both crawler and scraper commands work identically. We recommend using crawler for new projects.

Programmatic Usage

import ScrapperServices from '@harshvz/crawler';

const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2

// Using BFS
await scraper.bfsScrape('/');

// Using DFS
await scraper.dfsScrape('/');

🛠️ CLI Commands

Development

# Run in development mode with auto-reload
npm run dev

# Build the project
npm run build

# Start the built version (uses crawler command)
npm start

📚 API Documentation

ScrapperServices

Main class for web scraping operations.

Constructor

new ScrapperServices(website: string, depth?: number, customPath?: string)

Parameters:

  • website (string): The base URL of the website to scrape
  • depth (number, optional): Maximum depth to crawl (0 = unlimited, default: 0)
  • customPath (string, optional): Custom output directory path (default: ~/knowledgeBase)

Methods

bfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>

Crawls the website using Breadth-First Search algorithm.

Parameters:

  • endpoint (string): Starting path (default: "/")
  • results (string[]): Array to collect visited endpoints
  • visited (Record<string, boolean>): Object to track visited URLs
dfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>

Crawls the website using Depth-First Search algorithm.

Parameters:

  • endpoint (string): Starting path (default: "/")
  • results (string[]): Array to collect visited endpoints
  • visited (Record<string, boolean>): Object to track visited URLs
buildFilePath(endpoint: string): string

Generates a file path for storing screenshots.

buildContentPath(endpoint: string): string

Generates a file path for storing extracted content.

getLinks(page: Page): Promise<string[]>

Extracts all internal links from the current page.

⚙️ Configuration

Timeout

The default timeout for page navigation is 60 seconds. You can modify this by editing the timeout property in the ScrapperServices class:

const scraper = new ScrapperServices('https://example.com');
scraper.timeout = 30000; // 30 seconds

Storage Location

By default, all scraped data is stored in:

~/knowledgeBase/

Each website gets its own folder based on its hostname.

📁 Output Structure

~/knowledgeBase/
└── examplecom/
    ├── home.png                 # Screenshot of homepage
    ├── home.md                  # Extracted content from homepage
    ├── _about.png              # Screenshot of /about page
    ├── _about.md               # Extracted content from /about
    ├── _contact.png            # Screenshot of /contact page
    └── _contact.md             # Extracted content from /contact

Content File Format (.md)

Each .md file contains:

  1. JSON metadata (first line):
    • Page title
    • Meta description
    • Robots directives
    • Open Graph tags
    • Twitter Card tags
  2. Extracted text content (subsequent lines):
    • All text from h1-h6, p, and span elements

📖 Examples

Example 1: Basic Usage

import ScrapperServices from '@harshvz/crawler';

const scraper = new ScrapperServices('https://docs.example.com');
await scraper.bfsScrape('/');

Example 2: Limited Depth Crawl

const scraper = new ScrapperServices('https://blog.example.com', 2);
await scraper.dfsScrape('/');
// Only crawls 2 levels deep from the starting page

Example 3: Custom Endpoint

const scraper = new ScrapperServices('https://example.com');
const results = [];
const visited = {};
await scraper.bfsScrape('/docs', results, visited);
console.log(`Scraped ${results.length} pages`);

Example 4: Custom Output Directory

const scraper = new ScrapperServices(
    'https://example.com',
    0,  // No depth limit
    '/custom/output/path'  // Custom save location
);
await scraper.bfsScrape('/');
// Files will be saved to /custom/output/path instead of ~/knowledgeBase

🔧 Development

Prerequisites

  • Node.js >= 16.x
  • npm >= 7.x

Setup

# Clone the repository
git clone https://github.com/harshvz/crawler.git

# Navigate to directory
cd crawler

# Install dependencies
npm install

# Run in development mode
npm run dev

Project Structure

crawler/
├── src/
│   ├── index.ts                    # CLI entry point
│   └── Services/
│       └── ScrapperServices.ts     # Main scraping logic
├── dist/                           # Compiled JavaScript
├── package.json
├── tsconfig.json
└── README.md

Building

npm run build

This compiles TypeScript files to JavaScript in the dist/ directory.

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📝 License

ISC © Harshvz

🙏 Acknowledgments


Made with ❤️ by harshvz