npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

url-crawler

v1.2.0

Published

A library to crawl and extract cleaned HTML content from URLs.

Readme

URL Crawler Library

⚠️ EDUCATIONAL PURPOSES ONLY ⚠️

This library is NOT production-ready and should only be used for educational and learning purposes. It lacks critical production features such as robust error handling, retry logic, graceful shutdown, memory management for large crawls, monitoring, and observability.

For production use cases, consider battle-tested alternatives like Puppeteer, Playwright, Scrapy (Python), or other established web crawling frameworks.


A TypeScript library for crawling and extracting cleaned HTML content from URLs. This library provides tools to fetch HTML content, parse it, extract links, and recursively crawl pages with configurable depth and page limits.

Features

  • Event-Driven Crawling: Built on Node.js EventEmitter for real-time progress tracking and error handling
  • Recursive URL Crawling: Crawl URLs with configurable depth and page limits
  • Parallel Processing: Efficiently crawl multiple pages concurrently
  • HTML Parsing: Extract cleaned text content, titles, and links from HTML
  • HTML Cleaning: Remove scripts, styles, and unnecessary whitespace using sanitize-html
  • Link Extraction: Parse and normalize URLs found in HTML content
  • HTTP Fetching: Robust HTTP helper for fetching HTML content with error handling
  • Race Condition Prevention: Safe URL tracking to prevent duplicate crawls
  • Configurable Limits: Control MAX_DEPTH and MAX_PAGES per-crawler instance
  • Robots.txt Support: Automatic robots.txt compliance with crawl delay throttling
  • Per-Host Rate Limiting: Respects crawl delays from robots.txt files using host-based throttling

Installation

npm install url-crawler

Import Patterns

The library exports the Crawler class, type definitions, and configuration:

import { Crawler } from 'url-crawler';
import type { CrawlStartedEventPayload } from 'url-crawler';
import { EVENT_TYPES, config } from 'url-crawler';

Usage

Event-Driven Crawling

The Crawler class extends EventEmitter and provides event-driven crawling with proper error handling and progress tracking:

import { Crawler } from 'url-crawler';

const crawler = new Crawler('https://example.com');

// Listen to crawl events
crawler.on('crawl:started', (data) => {
    console.log('Crawl started:', data);
});

crawler.on('page:processed', (data) => {
    console.log('Page processed:', data.url, data.title);
});

crawler.on('page:error', (data) => {
    console.log('Page error:', data.url, data.message);
});

crawler.on('crawl:completed', (data) => {
    console.log('Crawl completed. Total pages:', data.totalPages, 'Duration:', data.durationMs, 'ms');
});

crawler.on('crawl:error', (data) => {
    console.error('Critical crawl error:', data.message);
});

// Start crawling
await crawler.startCrawl();

Configuration Overrides

Pass configuration overrides to customize behavior per-crawler instance:

import { Crawler } from 'url-crawler';

const crawler = new Crawler('https://example.com', {
    MAX_DEPTH: 5,
    MAX_PAGES: 20
});

await crawler.startCrawl();

HTML Parsing

// HtmlParser is used internally by Crawler
// For custom HTML parsing, use the Crawler to fetch and process pages
const crawler = new Crawler('https://example.com');

crawler.on('page:processed', (data) => {
    const { url, title, cleanedContent, otherPageUrls } = data;
    console.log(title);
    console.log(cleanedContent);
    console.log(otherPageUrls);
});

await crawler.startCrawl();

HTTP Fetching

// HttpHelper is used internally by Crawler
// Use Crawler for HTTP operations with built-in rate limiting and robots.txt support

Robots.txt Parser

// RobotsTxtParser is used internally by Crawler
// Robots.txt checking is automatic - disallowed URLs are skipped during crawling

Robots.txt Crawl Delay with Throttling

The crawler automatically respects the Crawl-delay directive specified in a website's robots.txt file. This feature ensures ethical crawling practices and prevents server overload:

import { Crawler } from 'url-crawler';

const crawler = new Crawler('https://example.com');

// The crawler automatically:
// 1. Fetches the robots.txt file from the target domain
// 2. Extracts the Crawl-delay directive for the crawler's user agent
// 3. Applies per-host rate limiting based on the specified delay

crawler.on('page:processed', (data) => {
    console.log('Page crawled with respect to robots.txt crawl delay:', data.url);
});

await crawler.startCrawl();

How It Works:

  • robots.txt Fetching: On first access to a domain, the crawler fetches and caches the robots.txt file from the host
  • Crawl Delay Extraction: The Crawl-delay directive is parsed from the robots.txt file. If specified, this value (in seconds) determines the minimum delay between requests to that host
  • Per-Host Throttling: Using the Bottleneck rate limiting library, each host maintains its own queue with a minTime interval set to the crawl delay value
  • Default Fallback: If no Crawl-delay is specified in robots.txt, the crawler uses a default CRAWL_DELAY_MS (typically 1000ms) to avoid aggressive crawling
  • Caching: robots.txt files are cached with a configurable TTL (default ROBOTS_CACHE_TTL_MS) to avoid repeated fetches

Key Features:

  • Single Concurrent Request per Host: Only one request is made to each host at a time, ensuring sequential crawling
  • Robots.txt Compliance: Disallowed URLs (as per robots.txt Disallow directives) are automatically skipped
  • Efficient Resource Usage: Multiple hosts can be crawled in parallel, but each host respects its rate limits
  • Graceful Degradation: If robots.txt cannot be fetched, the crawler proceeds with the default crawl delay

Example robots.txt Entry:

User-agent: *
Crawl-delay: 2
Disallow: /admin/
Disallow: /private/

In this example, the crawler will wait 2 seconds between requests to this domain and will not attempt to crawl /admin/ or /private/ URLs.

Configuration

Configure the crawler behavior using environment variables:

  • MAX_DEPTH: Maximum crawl depth (default: 3)
    • Alternative: RECURSION_LIMIT
  • MAX_PAGES: Maximum number of pages to crawl (default: 3)
    • Alternative: MAX_PAGES_LIMIT
  • CRAWL_DELAY_MS: Default delay between requests per host in milliseconds when robots.txt doesn't specify a crawl delay (default: 1000)
  • ROBOTS_CACHE_TTL_MS: Time-to-live for robots.txt cache entries in milliseconds (default: 86400000 = 24 hours)

Setting via Environment Variables:

# Set maximum crawl depth
export MAX_DEPTH=5

# Set maximum pages to crawl
export MAX_PAGES=20

# Set default crawl delay (in milliseconds)
export CRAWL_DELAY_MS=2000

# Set robots.txt cache TTL (in milliseconds)
export ROBOTS_CACHE_TTL_MS=3600000  # 1 hour

Accessing Configuration in Code:

import { config, ROBOTS_CACHE_TTL_MS } from 'url-crawler';

console.log(config.MAX_DEPTH);          // From MAX_DEPTH env var, or RECURSION_LIMIT fallback
console.log(config.MAX_PAGES);          // From MAX_PAGES env var, or MAX_PAGES_LIMIT fallback
console.log(config.CRAWL_DELAY_MS);     // From CRAWL_DELAY_MS env var
console.log(ROBOTS_CACHE_TTL_MS);       // From ROBOTS_CACHE_TTL_MS env var or 24 hour default

Per-Instance Configuration Overrides:

You can also override configuration for individual crawler instances without modifying environment variables:

import { Crawler } from 'url-crawler';

const crawler = new Crawler('https://example.com', {
    MAX_DEPTH: 5,
    MAX_PAGES: 20,
    CRAWL_DELAY_MS: 2000
});

await crawler.startCrawl();

Events

The Crawler class emits the following events:

  • crawl:started: Emitted when crawling begins. Payload: { url, timestamp }
  • page:processed: Emitted when a page is successfully crawled. Payload: { url, title, content }
  • page:error: Emitted when a page fails to crawl. Payload: { url, message }
  • crawl:completed: Emitted when crawling completes successfully. Payload: { url, totalPages, durationMs }
  • crawl:error: Emitted when a critical crawl error occurs. Payload: { url, message }

Available Exports

The library exports:

  • Crawler - The main web crawler class
  • EVENT_TYPES - Event type constants for event listeners
  • config - Configuration object with default values
  • ROBOTS_CACHE_TTL_MS - Cache TTL for robots.txt
  • CRAWLER_USER_AGENT - User agent string used for HTTP requests
  • Type definitions: CrawlStartedEventPayload, PageProcessedEventPayload, CrawlCompletedEventPayload, PageErrorEventPayload, CrawlErrorEventPayload, EventData

All utility services (HtmlParser, HttpHelper, RobotsTxtParser) are used internally by the Crawler and not exposed as separate exports.

Project Structure

src
├── index.ts              # Library entry point
├── config
│   └── index.ts          # Configuration and event types
├── entries
│   ├── crawler.ts        # Crawler export
│   ├── config.ts         # Config export
│   └── types.ts          # Types export
├── services
│   ├── crawler.ts        # Event-driven crawler implementation
│   ├── httpHelper.ts     # HTTP utilities (fetch, URL parsing)
│   └── robotsTxt.ts      # Robots.txt parsing and URL validation
├── types
│   └── index.ts          # TypeScript interfaces and types
└── utils
    └── htmlParser.ts     # HTML parsing and link extraction utilities

Types



type ParsedContentDetails = {
    title: string;
    cleanedContent: string;
    otherPageUrls: string[];
};

Building

npm run build

This generates compiled JavaScript and TypeScript declaration files in the dist directory.

License

This project is licensed under the MIT License.