@p-yiush07/scraper-engine
v0.2.2
Published
High-performance web scraper with anti-bot bypass, built in Rust
Maintainers
Readme
@p-yiush07/scraper-engine
High-performance web scraper with built-in anti-bot bypass for Node.js.
Handles 401/403 Forbidden errors automatically by emulating real browser TLS fingerprints, rotating headers across Chrome, Firefox, Safari, and Edge, and retrying with different browser profiles.
Installation
npm install @p-yiush07/scraper-enginePrebuilt native binaries are included for macOS, Linux, and Windows.
Quick Start
const { Scraper } = require('@p-yiush07/scraper-engine');
const scraper = new Scraper();
// Auto-extract article content — no selectors needed
const content = await scraper.extractFromUrl('https://www.reuters.com/some-article');
console.log(content.headline); // "Article Title"
console.log(content.paragraphs); // ["First paragraph...", ...]
console.log(content.links); // [{ text: "...", url: "..." }, ...]
console.log(content.images); // [{ src: "...", alt: "..." }, ...]API
new Scraper(config?)
Creates a scraper instance with optional configuration.
const scraper = new Scraper({
browserEmulation: 'chrome', // "chrome" | "firefox" | "safari" | "edge"
maxRetries: 5, // Retries on 401/403/429 with different browser profiles (default: 3)
retryBaseDelayMs: 1000, // Base delay for exponential backoff (default: 1000)
timeoutMs: 30000, // Request timeout in ms (default: 30000)
rateLimitPerSecond: 2, // Max requests/sec (default: unlimited)
followRedirects: true, // Follow redirects (default: true)
maxRedirects: 10, // Max redirect hops (default: 10)
cookieStore: true, // Enable cookie jar (default: true)
proxy: { // Optional proxy
url: 'socks5://127.0.0.1:1080',
username: 'user',
password: 'pass',
},
customHeaders: [ // Optional extra headers
{ name: 'X-Custom', value: 'value' },
],
userAgents: [ // Optional custom UA list (overrides built-in rotation)
'Mozilla/5.0 ...',
],
});scraper.extractFromUrl(url): Promise<ExtractedContent>
Fetches a URL and auto-extracts structured content. Works on any website — no CSS selectors needed.
const content = await scraper.extractFromUrl('https://example.com/article');Returns:
{
title: string; // Page title
headline: string; // Article headline (h1)
description: string; // Meta description
paragraphs: string[]; // Article body paragraphs
links: LinkInfo[]; // All links: { text, url }
images: ImageInfo[]; // All images: { src, alt }
}The extraction uses a multi-strategy cascade that handles React sites, semantic HTML, common CMS patterns, and falls back to content-density analysis.
scraper.fetch(url): Promise<ScrapeResponse>
Fetches a URL with full anti-bot measures and returns raw HTML.
const response = await scraper.fetch('https://example.com');
console.log(response.status); // 200
console.log(response.body); // Raw HTML string
console.log(response.url); // Final URL after redirectsscraper.scrape(url, selector): Promise<ParsedElement[]>
Fetches a URL and extracts elements matching a CSS selector.
const elements = await scraper.scrape('https://news.ycombinator.com', '.titleline > a');
for (const el of elements) {
console.log(el.text); // "Article title"
console.log(el.tag); // "a"
console.log(el.html); // Inner HTML
console.log(el.attributes); // [{ name: "href", value: "https://..." }]
}scraper.fetchMany(urls, concurrency?): Promise<ScrapeResponse[]>
Fetches multiple URLs concurrently with bounded concurrency.
const urls = ['https://example.com/1', 'https://example.com/2', 'https://example.com/3'];
const responses = await scraper.fetchMany(urls, 5); // 5 concurrent requestsScraper.parse(html, selector): ParsedElement[] (static)
Parses existing HTML with a CSS selector. No network request.
const elements = Scraper.parse(htmlString, 'div.product > h2');Scraper.extract(html): ExtractedContent (static)
Auto-extracts structured content from existing HTML. No network request.
const content = Scraper.extract(htmlString);How Anti-Bot Bypass Works
When a request gets blocked (401/403/429), the scraper automatically:
- Switches browser identity — rotates to a completely different browser's TLS fingerprint (e.g., Chrome → Firefox → Safari)
- Updates all headers — User-Agent, Sec-Ch-Ua, Accept, and other headers match the new browser so nothing looks suspicious
- Handles cookies — automatically stores and resends session cookies to solve challenge-response flows
- Backs off — waits with exponential backoff + jitter before retrying
- Repeats — tries up to
maxRetriestimes with a different browser profile each attempt
This works because many anti-bot systems identify scrapers by their TLS fingerprint (JA3/JA4) — the scraper emulates 100+ real browser fingerprints at the TLS level, not just the User-Agent string.
TypeScript Support
Full TypeScript definitions are included. All types are auto-generated:
import { Scraper, ScraperConfig, ExtractedContent, ScrapeResponse, ParsedElement } from '@p-yiush07/scraper-engine';Supported Platforms
| Platform | Architecture | Status | |----------|-------------|--------| | macOS | ARM64 (Apple Silicon) | Available | | macOS | x86_64 (Intel) | Available | | Linux | x86_64 (GNU) | Available | | Windows | x86_64 (MSVC) | Available |
Prebuilt binaries are included for all platforms
Issues
Report bugs and request features at scraper-engine-issues.
License
MIT
