html-data-scraper

v2.0.0

Published

9 days ago

A resilient, stealth-enabled web scraper built on Playwright with declarative CSS selectors and concurrent page processing.

0High
0Medium
0Low

ravidhu

scraper scraping webscraper webscraping playwright stealth resilient html dom javascript rag selector

html-data-scraper

A resilient, stealth-enabled web scraper built on Playwright. Scrape data from multiple URLs concurrently using declarative CSS selectors or custom JavaScript evaluation, with built-in anti-detection and automatic retries.

Features

Playwright-powered -- uses Chromium via Playwright for reliable, modern browser automation
Concurrent scraping -- distributes URLs across multiple browser tabs automatically
Declarative CSS selectors -- extract data using simple selector strings alongside custom functions
Stealth by default -- randomized user agents, viewports, and human-like delays
Resilient crawling -- automatic retries with exponential backoff, rate limiting, and error collection
TypeScript-first -- fully typed API with strict null checks

Quick start

npm install html-data-scraper

import htmlDataScraper from 'html-data-scraper';

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',                  // CSS selector -> textContent
        title: () => document.title,               // function -> page.evaluate()
    },
});

console.log(results[0].evaluates);
// { heading: 'Web scraping', title: 'Web scraping - Wikipedia' }

await browserInstance.close();

Documentation

| Guide | Description | |---|---| | Getting Started | Installation, basic usage, and first scraper | | API Reference | Full API documentation with all types and options | | Stealth | Anti-detection features and configuration | | Resilience | Retries, rate limiting, and error handling | | Migration from v1 | Upgrading from v1 (Puppeteer) to v2 (Playwright) | | Contributing | Development setup and contribution guidelines |

Examples

Ready-to-run example projects in the examples/ folder:

| Example | Description | |---|---| | Wikipedia Scraper | Scrape structured data from 6 Wikipedia articles across 3 concurrent tabs using CSS selectors, functions, route interception, and progress tracking | | News Monitor | Monitor headlines from 5 international news sites with stealth, rate limiting, retries, screenshots, and graceful error handling |

cd examples/wikipedia-scraper
npm install && npx playwright install chromium
npm start

Example: scrape with error tolerance

const { results } = await htmlDataScraper([
    'https://example.com/page-1',
    'https://example.com/page-2',
    'https://this-will-fail.invalid',
], {
    onEvaluateForEachUrl: {
        title: 'h1',
    },
    resilience: {
        retries: 2,
        continueOnError: true,
    },
});

// Failed URLs have an error field instead of crashing the batch
for (const result of results) {
    if (result.error) {
        console.error(`Failed: ${result.url} - ${result.error}`);
    } else {
        console.log(`${result.url}: ${result.evaluates?.title}`);
    }
}

License

MIT -- Ravidhu Dissanayake

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

html-data-scraper

Features

Quick start

Documentation

Examples

Example: scrape with error tolerance

License