web-structure

v1.0.2

Published

a year ago

A powerful and flexible web scraping library with concurrent processing and DOM hierarchy awareness

0High
0Medium
0Low

kilicmu

web-scraping puppeteer typescript crawler dom concurrent scraper html extraction

Web structure

A powerful and flexible web scraping library built with TypeScript and Puppeteer. It supports concurrent scraping, recursive crawling, and intelligent content extraction with DOM hierarchy awareness.

Features

Concurrent Processing: Parallel processing of multiple selectors and pages
DOM Hierarchy Aware: Smart content extraction that respects DOM structure
Recursive Crawling: Ability to crawl through child pages with depth control
Flexible Selectors: Support for both single and multiple CSS selectors
Retry Mechanism: Built-in retry with exponential backoff for reliability
Deduplication: Automatic deduplication of content and URLs
Structured Output: Clean, structured JSON output with metadata

Installation

npm install web-structure

Quick Start

import { scraping } from 'web-structure';

// Basic usage
const result = await scraping('https://example.com');

// Advanced usage with options
const result = await scraping('https://example.com', {
  maxDepth: 2,
  selectors: {
    headings: ['h1', 'h2', 'h3'],
    content: '.article-content',
    links: 'a.important-link'
  },
  excludeChildPage: (url) => url.includes('login'),
  withConsole: true
});

Configuration Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | maxDepth | number | 0 | Maximum depth for recursive crawling | | excludeChildPage | (url: string) => boolean | () => false | Function to determine if a URL should be skipped | | selectors | { [key: string]: string \| string[] } | See below | Selectors to extract content | | withConsole | boolean | true | Whether to show console information | | breakWhenFailed | boolean | false | Whether to break when a page fails | | retryCount | number | 3 | Number of retries when scraping fails | | waitForSelectorTimeout | number | 12000 | Timeout for waiting for a selector (ms) | | waitForPageLoadTimeout | number | 12000 | Timeout for waiting for page load (ms) |

Default Selectors

{
  headings: ['h1', 'h2', 'h3', 'h4', 'h5'],
  paragraphs: 'p',
  articles: 'article',
  spans: 'span',
  orderLists: 'ol',
  lists: 'ul'
}

Output Structure

interface ScrapingResult {
  url: string;          // URL of the scraped page
  title: string;        // Page title
  data: {              // Extracted content
    [key: string]: string | string[];
  };
  timestamp: string;    // ISO timestamp of when the page was scraped
  childPages?: ScrapingResult[]; // Results from child pages (if maxDepth > 0)
}

Advanced Features

DOM Hierarchy Awareness

The library intelligently handles nested elements to prevent duplicate content. If a parent element is selected, its child elements won't be included separately in the results.

Concurrent Processing

Multiple selectors are processed concurrently
Array selectors (e.g., ['h1', 'h2', 'h3']) are processed in parallel
Child pages are processed sequentially to prevent overwhelming the target server

Retry Mechanism

Built-in retry mechanism with exponential backoff:

Retries failed operations with increasing delays
Configurable retry count
Includes random jitter to prevent thundering herd problems

Error Handling

The library provides robust error handling:

Failed selector extractions don't stop the entire process
Each selector and page has independent error handling
Detailed error logging when withConsole is enabled
Option to break on failures with breakWhenFailed

Limitations

Maximum crawling depth is limited to 10 levels
Maximum of 5 child links per page are processed
Respects robots.txt and rate limiting by default
Requires JavaScript to be enabled on target pages

License

MIT