web-structure
v1.0.2
Published
A powerful and flexible web scraping library with concurrent processing and DOM hierarchy awareness
Maintainers
Readme
Web structure
A powerful and flexible web scraping library built with TypeScript and Puppeteer. It supports concurrent scraping, recursive crawling, and intelligent content extraction with DOM hierarchy awareness.
Features
- Concurrent Processing: Parallel processing of multiple selectors and pages
- DOM Hierarchy Aware: Smart content extraction that respects DOM structure
- Recursive Crawling: Ability to crawl through child pages with depth control
- Flexible Selectors: Support for both single and multiple CSS selectors
- Retry Mechanism: Built-in retry with exponential backoff for reliability
- Deduplication: Automatic deduplication of content and URLs
- Structured Output: Clean, structured JSON output with metadata
Installation
npm install web-structureQuick Start
import { scraping } from 'web-structure';
// Basic usage
const result = await scraping('https://example.com');
// Advanced usage with options
const result = await scraping('https://example.com', {
maxDepth: 2,
selectors: {
headings: ['h1', 'h2', 'h3'],
content: '.article-content',
links: 'a.important-link'
},
excludeChildPage: (url) => url.includes('login'),
withConsole: true
});Configuration Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| maxDepth | number | 0 | Maximum depth for recursive crawling |
| excludeChildPage | (url: string) => boolean | () => false | Function to determine if a URL should be skipped |
| selectors | { [key: string]: string \| string[] } | See below | Selectors to extract content |
| withConsole | boolean | true | Whether to show console information |
| breakWhenFailed | boolean | false | Whether to break when a page fails |
| retryCount | number | 3 | Number of retries when scraping fails |
| waitForSelectorTimeout | number | 12000 | Timeout for waiting for a selector (ms) |
| waitForPageLoadTimeout | number | 12000 | Timeout for waiting for page load (ms) |
Default Selectors
{
headings: ['h1', 'h2', 'h3', 'h4', 'h5'],
paragraphs: 'p',
articles: 'article',
spans: 'span',
orderLists: 'ol',
lists: 'ul'
}Output Structure
interface ScrapingResult {
url: string; // URL of the scraped page
title: string; // Page title
data: { // Extracted content
[key: string]: string | string[];
};
timestamp: string; // ISO timestamp of when the page was scraped
childPages?: ScrapingResult[]; // Results from child pages (if maxDepth > 0)
}Advanced Features
DOM Hierarchy Awareness
The library intelligently handles nested elements to prevent duplicate content. If a parent element is selected, its child elements won't be included separately in the results.
Concurrent Processing
- Multiple selectors are processed concurrently
- Array selectors (e.g.,
['h1', 'h2', 'h3']) are processed in parallel - Child pages are processed sequentially to prevent overwhelming the target server
Retry Mechanism
Built-in retry mechanism with exponential backoff:
- Retries failed operations with increasing delays
- Configurable retry count
- Includes random jitter to prevent thundering herd problems
Error Handling
The library provides robust error handling:
- Failed selector extractions don't stop the entire process
- Each selector and page has independent error handling
- Detailed error logging when
withConsoleis enabled - Option to break on failures with
breakWhenFailed
Limitations
- Maximum crawling depth is limited to 10 levels
- Maximum of 5 child links per page are processed
- Respects robots.txt and rate limiting by default
- Requires JavaScript to be enabled on target pages
License
MIT
