npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

web-structure

v1.0.2

Published

A powerful and flexible web scraping library with concurrent processing and DOM hierarchy awareness

Readme

Web structure

A powerful and flexible web scraping library built with TypeScript and Puppeteer. It supports concurrent scraping, recursive crawling, and intelligent content extraction with DOM hierarchy awareness.

Features

  • Concurrent Processing: Parallel processing of multiple selectors and pages
  • DOM Hierarchy Aware: Smart content extraction that respects DOM structure
  • Recursive Crawling: Ability to crawl through child pages with depth control
  • Flexible Selectors: Support for both single and multiple CSS selectors
  • Retry Mechanism: Built-in retry with exponential backoff for reliability
  • Deduplication: Automatic deduplication of content and URLs
  • Structured Output: Clean, structured JSON output with metadata

Installation

npm install web-structure

Quick Start

import { scraping } from 'web-structure';

// Basic usage
const result = await scraping('https://example.com');

// Advanced usage with options
const result = await scraping('https://example.com', {
  maxDepth: 2,
  selectors: {
    headings: ['h1', 'h2', 'h3'],
    content: '.article-content',
    links: 'a.important-link'
  },
  excludeChildPage: (url) => url.includes('login'),
  withConsole: true
});

Configuration Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | maxDepth | number | 0 | Maximum depth for recursive crawling | | excludeChildPage | (url: string) => boolean | () => false | Function to determine if a URL should be skipped | | selectors | { [key: string]: string \| string[] } | See below | Selectors to extract content | | withConsole | boolean | true | Whether to show console information | | breakWhenFailed | boolean | false | Whether to break when a page fails | | retryCount | number | 3 | Number of retries when scraping fails | | waitForSelectorTimeout | number | 12000 | Timeout for waiting for a selector (ms) | | waitForPageLoadTimeout | number | 12000 | Timeout for waiting for page load (ms) |

Default Selectors

{
  headings: ['h1', 'h2', 'h3', 'h4', 'h5'],
  paragraphs: 'p',
  articles: 'article',
  spans: 'span',
  orderLists: 'ol',
  lists: 'ul'
}

Output Structure

interface ScrapingResult {
  url: string;          // URL of the scraped page
  title: string;        // Page title
  data: {              // Extracted content
    [key: string]: string | string[];
  };
  timestamp: string;    // ISO timestamp of when the page was scraped
  childPages?: ScrapingResult[]; // Results from child pages (if maxDepth > 0)
}

Advanced Features

DOM Hierarchy Awareness

The library intelligently handles nested elements to prevent duplicate content. If a parent element is selected, its child elements won't be included separately in the results.

Concurrent Processing

  • Multiple selectors are processed concurrently
  • Array selectors (e.g., ['h1', 'h2', 'h3']) are processed in parallel
  • Child pages are processed sequentially to prevent overwhelming the target server

Retry Mechanism

Built-in retry mechanism with exponential backoff:

  • Retries failed operations with increasing delays
  • Configurable retry count
  • Includes random jitter to prevent thundering herd problems

Error Handling

The library provides robust error handling:

  • Failed selector extractions don't stop the entire process
  • Each selector and page has independent error handling
  • Detailed error logging when withConsole is enabled
  • Option to break on failures with breakWhenFailed

Limitations

  • Maximum crawling depth is limited to 10 levels
  • Maximum of 5 child links per page are processed
  • Respects robots.txt and rate limiting by default
  • Requires JavaScript to be enabled on target pages

License

MIT