@p-yiush07/scraper-engine

v0.2.2

Published

3 months ago

High-performance web scraper with anti-bot bypass, built in Rust

0High
0Medium
0Low

p-yiush07

scraper crawler web-scraping anti-bot tls-fingerprint rust napi headless browser-emulation

@p-yiush07/scraper-engine

High-performance web scraper with built-in anti-bot bypass for Node.js.

Handles 401/403 Forbidden errors automatically by emulating real browser TLS fingerprints, rotating headers across Chrome, Firefox, Safari, and Edge, and retrying with different browser profiles.

Installation

npm install @p-yiush07/scraper-engine

Prebuilt native binaries are included for macOS, Linux, and Windows.

Quick Start

const { Scraper } = require('@p-yiush07/scraper-engine');

const scraper = new Scraper();

// Auto-extract article content — no selectors needed
const content = await scraper.extractFromUrl('https://www.reuters.com/some-article');
console.log(content.headline);    // "Article Title"
console.log(content.paragraphs);  // ["First paragraph...", ...]
console.log(content.links);       // [{ text: "...", url: "..." }, ...]
console.log(content.images);      // [{ src: "...", alt: "..." }, ...]

API

`new Scraper(config?)`

Creates a scraper instance with optional configuration.

const scraper = new Scraper({
  browserEmulation: 'chrome',   // "chrome" | "firefox" | "safari" | "edge"
  maxRetries: 5,                // Retries on 401/403/429 with different browser profiles (default: 3)
  retryBaseDelayMs: 1000,       // Base delay for exponential backoff (default: 1000)
  timeoutMs: 30000,             // Request timeout in ms (default: 30000)
  rateLimitPerSecond: 2,        // Max requests/sec (default: unlimited)
  followRedirects: true,        // Follow redirects (default: true)
  maxRedirects: 10,             // Max redirect hops (default: 10)
  cookieStore: true,            // Enable cookie jar (default: true)
  proxy: {                      // Optional proxy
    url: 'socks5://127.0.0.1:1080',
    username: 'user',
    password: 'pass',
  },
  customHeaders: [              // Optional extra headers
    { name: 'X-Custom', value: 'value' },
  ],
  userAgents: [                 // Optional custom UA list (overrides built-in rotation)
    'Mozilla/5.0 ...',
  ],
});

`scraper.extractFromUrl(url): Promise<ExtractedContent>`

Fetches a URL and auto-extracts structured content. Works on any website — no CSS selectors needed.

const content = await scraper.extractFromUrl('https://example.com/article');

Returns:

{
  title: string;           // Page title
  headline: string;        // Article headline (h1)
  description: string;     // Meta description
  paragraphs: string[];    // Article body paragraphs
  links: LinkInfo[];       // All links: { text, url }
  images: ImageInfo[];     // All images: { src, alt }
}

The extraction uses a multi-strategy cascade that handles React sites, semantic HTML, common CMS patterns, and falls back to content-density analysis.

`scraper.fetch(url): Promise<ScrapeResponse>`

Fetches a URL with full anti-bot measures and returns raw HTML.

const response = await scraper.fetch('https://example.com');
console.log(response.status); // 200
console.log(response.body);   // Raw HTML string
console.log(response.url);    // Final URL after redirects

`scraper.scrape(url, selector): Promise<ParsedElement[]>`

Fetches a URL and extracts elements matching a CSS selector.

const elements = await scraper.scrape('https://news.ycombinator.com', '.titleline > a');

for (const el of elements) {
  console.log(el.text);       // "Article title"
  console.log(el.tag);        // "a"
  console.log(el.html);       // Inner HTML
  console.log(el.attributes); // [{ name: "href", value: "https://..." }]
}

`scraper.fetchMany(urls, concurrency?): Promise<ScrapeResponse[]>`

Fetches multiple URLs concurrently with bounded concurrency.

const urls = ['https://example.com/1', 'https://example.com/2', 'https://example.com/3'];
const responses = await scraper.fetchMany(urls, 5); // 5 concurrent requests

`Scraper.parse(html, selector): ParsedElement[]` (static)

Parses existing HTML with a CSS selector. No network request.

const elements = Scraper.parse(htmlString, 'div.product > h2');

`Scraper.extract(html): ExtractedContent` (static)

Auto-extracts structured content from existing HTML. No network request.

const content = Scraper.extract(htmlString);

How Anti-Bot Bypass Works

When a request gets blocked (401/403/429), the scraper automatically:

Switches browser identity — rotates to a completely different browser's TLS fingerprint (e.g., Chrome → Firefox → Safari)
Updates all headers — User-Agent, Sec-Ch-Ua, Accept, and other headers match the new browser so nothing looks suspicious
Handles cookies — automatically stores and resends session cookies to solve challenge-response flows
Backs off — waits with exponential backoff + jitter before retrying
Repeats — tries up to maxRetries times with a different browser profile each attempt

This works because many anti-bot systems identify scrapers by their TLS fingerprint (JA3/JA4) — the scraper emulates 100+ real browser fingerprints at the TLS level, not just the User-Agent string.

TypeScript Support

Full TypeScript definitions are included. All types are auto-generated:

import { Scraper, ScraperConfig, ExtractedContent, ScrapeResponse, ParsedElement } from '@p-yiush07/scraper-engine';

Supported Platforms

| Platform | Architecture | Status | |----------|-------------|--------| | macOS | ARM64 (Apple Silicon) | Available | | macOS | x86_64 (Intel) | Available | | Linux | x86_64 (GNU) | Available | | Windows | x86_64 (MSVC) | Available |

Prebuilt binaries are included for all platforms

Issues

Report bugs and request features at scraper-engine-issues.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@p-yiush07/scraper-engine

Installation

Quick Start

API

new Scraper(config?)

scraper.extractFromUrl(url): Promise<ExtractedContent>

scraper.fetch(url): Promise<ScrapeResponse>

scraper.scrape(url, selector): Promise<ParsedElement[]>

scraper.fetchMany(urls, concurrency?): Promise<ScrapeResponse[]>

Scraper.parse(html, selector): ParsedElement[] (static)

Scraper.extract(html): ExtractedContent (static)

How Anti-Bot Bypass Works

TypeScript Support

Supported Platforms

Issues

License

`new Scraper(config?)`

`scraper.extractFromUrl(url): Promise<ExtractedContent>`

`scraper.fetch(url): Promise<ScrapeResponse>`

`scraper.scrape(url, selector): Promise<ParsedElement[]>`

`scraper.fetchMany(urls, concurrency?): Promise<ScrapeResponse[]>`

`Scraper.parse(html, selector): ParsedElement[]` (static)

`Scraper.extract(html): ExtractedContent` (static)