html-data-scraper
v2.0.0
Published
A resilient, stealth-enabled web scraper built on Playwright with declarative CSS selectors and concurrent page processing.
Maintainers
Readme
html-data-scraper
A resilient, stealth-enabled web scraper built on Playwright. Scrape data from multiple URLs concurrently using declarative CSS selectors or custom JavaScript evaluation, with built-in anti-detection and automatic retries.
Features
- Playwright-powered -- uses Chromium via Playwright for reliable, modern browser automation
- Concurrent scraping -- distributes URLs across multiple browser tabs automatically
- Declarative CSS selectors -- extract data using simple selector strings alongside custom functions
- Stealth by default -- randomized user agents, viewports, and human-like delays
- Resilient crawling -- automatic retries with exponential backoff, rate limiting, and error collection
- TypeScript-first -- fully typed API with strict null checks
Quick start
npm install html-data-scraperimport htmlDataScraper from 'html-data-scraper';
const { results, browserInstance } = await htmlDataScraper([
'https://en.wikipedia.org/wiki/Web_scraping',
], {
onEvaluateForEachUrl: {
heading: '#firstHeading', // CSS selector -> textContent
title: () => document.title, // function -> page.evaluate()
},
});
console.log(results[0].evaluates);
// { heading: 'Web scraping', title: 'Web scraping - Wikipedia' }
await browserInstance.close();Documentation
| Guide | Description | |---|---| | Getting Started | Installation, basic usage, and first scraper | | API Reference | Full API documentation with all types and options | | Stealth | Anti-detection features and configuration | | Resilience | Retries, rate limiting, and error handling | | Migration from v1 | Upgrading from v1 (Puppeteer) to v2 (Playwright) | | Contributing | Development setup and contribution guidelines |
Examples
Ready-to-run example projects in the examples/ folder:
| Example | Description | |---|---| | Wikipedia Scraper | Scrape structured data from 6 Wikipedia articles across 3 concurrent tabs using CSS selectors, functions, route interception, and progress tracking | | News Monitor | Monitor headlines from 5 international news sites with stealth, rate limiting, retries, screenshots, and graceful error handling |
cd examples/wikipedia-scraper
npm install && npx playwright install chromium
npm startExample: scrape with error tolerance
const { results } = await htmlDataScraper([
'https://example.com/page-1',
'https://example.com/page-2',
'https://this-will-fail.invalid',
], {
onEvaluateForEachUrl: {
title: 'h1',
},
resilience: {
retries: 2,
continueOnError: true,
},
});
// Failed URLs have an error field instead of crashing the batch
for (const result of results) {
if (result.error) {
console.error(`Failed: ${result.url} - ${result.error}`);
} else {
console.log(`${result.url}: ${result.evaluates?.title}`);
}
}License
MIT -- Ravidhu Dissanayake
