npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

website-scrap-engine

v0.9.0

Published

Configurable website scraper in typescript

Downloads

444

Readme

website-scrap-engine

Configurable website scraper library in TypeScript. Consumers provide a DownloadOptions config (which includes a ProcessingLifeCycle) and instantiate a downloader to recursively scrape websites to local disk.

Features

  • Configurable processing pipeline with hook arrays at every stage
  • Single-thread and multi-thread (native worker_threads) downloaders
  • HTML, CSS, SVG, and sitemap parsing with automatic link discovery
  • CSS url() extraction and rewriting
  • srcset, Open Graph meta tags, inline styles, and SVG xlink:href support
  • Automatic URL-to-relative-path rewriting so saved sites work offline
  • Streaming download support for large binary resources
  • PQueue-based concurrency with runtime adjustment
  • URL deduplication with configurable search-param stripping
  • Configurable retry with exponential backoff, jitter, and Retry-After header support
  • Local file:// source support for re-processing previously saved sites
  • Configurable logging via log4js with dedicated categories (skip, retry, error, notFound, etc.)

Installation

npm install website-scrap-engine

Requires Node.js >= 18.17.0.

Usage

The downloader takes a path (or file:// URL) to a module that default-exports a DownloadOptions object. This pattern allows worker threads to independently load the same configuration.

Step 1: Create an options module (e.g. my-options.js)

import {lifeCycle, options, resource} from 'website-scrap-engine';

const {defaultLifeCycle} = lifeCycle;
const {defaultDownloadOptions} = options;
const {ResourceType} = resource;

const lc = defaultLifeCycle();

// Example: skip binary resources deeper than depth 2
lc.processBeforeDownload.push((res) => {
  if (res.depth > 2 && res.type === ResourceType.Binary) return;
  return res;
});

export default defaultDownloadOptions({
  ...lc,
  localRoot: '/path/to/save',
  maxDepth: 3,
  initialUrl: ['https://example.com'],
});

Step 2: Create and run the downloader

import path from 'path';
import {downloader} from 'website-scrap-engine';

const {SingleThreadDownloader} = downloader;

const d = new SingleThreadDownloader(
  'file://' + path.resolve('my-options.js')
);
d.start();
d.onIdle().then(() => d.dispose());

For CPU-intensive workloads, use MultiThreadDownloader instead (see Multi-Thread Processing).

You can also pass override options as the second argument to the downloader constructor, which are merged into the options module's export:

new SingleThreadDownloader('file://' + path.resolve('my-options.js'), {
  localRoot: '/different/path',
  concurrency: 8,
});

Adapter Helpers

The library provides adapter functions in lifeCycle.adapter for common customization patterns:

| Adapter | Stage | Description | |---|---|---| | skipProcess(fn) | linkRedirect | Skip URLs matching a predicate | | dropResource(fn) | processBeforeDownload | Mark matching resources as discard-only (replace link but don't download) | | preProcess(fn) | processBeforeDownload | Inspect/modify resources before download | | requestRedirect(fn) | processBeforeDownload | Rewrite the download URL | | redirectFilter(fn) | processAfterDownload | Rewrite or discard redirect URLs | | processHtml(fn) | processAfterDownload | Transform the parsed HTML (cheerio $) | | processHtmlAsync(fn) | processAfterDownload | Async version of processHtml |

import {lifeCycle} from 'website-scrap-engine';

const lc = lifeCycle.defaultLifeCycle();

// Skip all URLs containing "/api/"
lc.linkRedirect.push(lifeCycle.adapter.skipProcess(
  (url) => url.includes('/api/')
));

// Drop images from download but still rewrite their links
lc.processBeforeDownload.push(lifeCycle.adapter.dropResource(
  (res) => res.type === ResourceType.Binary && res.url.endsWith('.png')
));

Architecture

Pipeline Life Cycle

Resources are processed through a sequential pipeline of hook arrays. Each stage is an array of functions executed in order. Returning void/undefined from any function discards the resource from that stage onward.

init (once per downloader/worker startup)
 |
 v
URL
 |
 v
1. linkRedirect -----> skip or redirect URLs before processing
 |
 v
2. detectResourceType -> determine type (Html, Css, Binary, Svg, SiteMap, etc.)
 |
 v
3. createResource ----> build a Resource with save paths and relative replacement paths
 |
 v
4. processBeforeDownload -> filter/modify resources; link replacement in parent happens after this
 |
 v
5. download ----------> fetch resource via HTTP (loop ends early once body is set)
 |
 v
6. processAfterDownload -> parse content, discover child resources via submit() callback
 |
 v
7. saveToDisk --------> write to local filesystem
 |
 v
dispose (once per downloader shutdown / worker exit)

Consumers extend the pipeline by prepending or appending functions to any stage array via defaultLifeCycle(). See Usage for examples.

Default Pipeline Handlers

| Stage | Default handlers | |---|---| | linkRedirect | skipLinks - filters out non-HTTP URI schemes (mailto, javascript, data, etc.) | | detectResourceType | detectResourceType - infers type from element/context | | createResource | createResource - builds Resource with URL resolution, save path, and replace path | | download | downloadResource, downloadStreamingResource, readOrCopyLocalResource | | processAfterDownload | processRedirectedUrl, processHtml, processHtmlMetaRefresh, processSvg, processCss, processSiteMap | | saveToDisk | saveHtmlToDisk, saveResourceToDisk |

Resource Types

Defined in ResourceType enum:

| Type | Encoding | Description | |---|---|---| | Binary | null | Not parsed, saved as-is | | Html | utf8 | Parsed with cheerio, links discovered and rewritten | | Css | utf8 | CSS url() references extracted and rewritten | | CssInline | utf8 | Inline <style> blocks and style attributes | | SiteMap | utf8 | URLs discovered but not rewritten | | Svg | utf8 | Parsed with cheerio (same as HTML) | | StreamingBinary | null | Streamed directly to disk, for large files |

HTML Source Definitions

The scraper discovers linked resources from HTML using configurable source definitions. The defaults cover:

  • Images: img[src], img[srcset], picture source[srcset]
  • Styles: link[rel="stylesheet"], <style> blocks, [style] attributes
  • Scripts: script[src]
  • Links: a[href], frame[src], iframe[src]
  • Media: video[src], video[poster], audio[src], source[src], track[src]
  • SVG: *[xlink:href], *[href]
  • Meta: meta[property="og:image"], og:audio, og:video and their variants
  • Other: embed[src], object[data], input[src], [background], link[rel*="icon"], link[rel*="preload"]

Override via options.sources with an array of {selector, attr, type} definitions.

Key Abstractions

  • Resource (src/resource.ts) - Central data object carrying URL, save path, replacement path, body, and metadata. RawResource is the serializable subset used for cross-thread communication.
  • PipelineExecutor (interface in src/life-cycle/pipeline-executor.ts, impl in src/downloader/pipeline-executor-impl.ts) - Orchestrates life cycle execution. createAndProcessResource() runs stages 1-4 in one call.
  • AbstractDownloader (src/downloader/main.ts) - Base class with PQueue-based concurrency, URL deduplication, and the download loop.
  • SingleThreadDownloader (src/downloader/single.ts) - Runs all pipeline stages in the main thread.
  • MultiThreadDownloader (src/downloader/multi.ts) - Downloads in main thread, sends to worker pool for post-processing.

Multi-Thread Processing

Use multi-thread processing when post-download work (HTML/CSS parsing, link discovery) is CPU-intensive.

Main thread:

  • Runs the download queue with PQueue concurrency control
  • Executes stages 1-5 (linkRedirect through download)
  • Transfers downloaded resources to worker threads
  • Receives discovered child resources back and enqueues non-duplicates

Worker threads:

  • Receive downloaded resources from the main thread
  • Execute stages 6-7 (processAfterDownload + saveToDisk)
  • Parse HTML/CSS/SVG, discover child resources
  • Run stages 1-4 on discovered children to prepare them
  • Send prepared child resources back to the main thread as RawResource[]

Worker count defaults to Math.min(concurrency, workerCount). The worker pool uses a 2-pass water-fill algorithm to balance tasks across workers by load.

Logging

The library uses log4js with dedicated logger categories:

| Logger | Purpose | |---|---| | skip | Resources filtered/discarded at any pipeline stage | | skipExternal | External resources skipped by scope | | retry | HTTP retry attempts with backoff details | | error | Download and processing errors | | notFound | 404 responses | | request / response | HTTP request/response logging | | complete | Successfully processed resources | | mkdir | Directory creation | | adjustConcurrency | Runtime concurrency changes |

Configure logging via options.configureLogger and options.logSubDir.

Key Dependencies

  • cheerio - HTML/SVG parsing and manipulation
  • got - HTTP client with retry logic
  • p-queue - Download concurrency control
  • urijs - URL resolution and path generation
  • css-url-parser - CSS url() extraction
  • srcset - srcset attribute parsing

License

ISC