ag-webscrape

v0.0.29

Published

2 months ago

TypeScript web scraper with Playwright fallback for anti-scraping protection

0High
0Medium
0Low

e7npm

scraping playwright typescript web-scraper anti-scraping

ag-webscrape

A TypeScript web scraper with intelligent fallback strategy. Attempts direct HTTP fetching first, then falls back to Playwright for anti-scraping protection.

Features

Dual Strategy: Direct fetch first, Playwright fallback
Anti-Scraping Detection: Automatically detects and bypasses common anti-scraping measures
Persistent Browser: Maintains browser instance for faster subsequent scrapes
Error Handling: Comprehensive error detection for 4xx/5xx responses
TypeScript Support: Full type safety and IntelliSense
Configurable: Extensive customization options

Installation

npm install ag-webscrape

Quick Start

import { WebScraper } from 'ag-webscrape';

const scraper = new WebScraper();

// Scrape a single URL
const result = await scraper.scrape('https://example.com');
console.log(result.html);

// Clean up when done
await scraper.dispose();

API Reference

WebScraper Class

Constructor

new WebScraper(options?: ScrapingOptions)

Options

interface ScrapingOptions {
  timeout?: number;           // Request timeout in ms (default: 30000)
  userAgent?: string;         // Custom user agent
  headers?: Record<string, string>; // Additional headers
  retries?: number;           // Number of retries (default: 3)
  waitForSelector?: string;   // CSS selector to wait for
  waitForTimeout?: number;    // Time to wait in ms (default: 5000)
}

Methods

`scrape(url: string, options?: ScrapingOptions): Promise<ScrapingResult>`

Scrapes a single URL with fallback strategy.

const result = await scraper.scrape('https://example.com', {
  timeout: 60000,
  waitForSelector: '.main-content'
});

`scrapeMultiple(urls: string[], options?: ScrapingOptions): Promise<ScrapingResult[]>`

Scrapes multiple URLs efficiently.

const results = await scraper.scrapeMultiple([
  'https://example1.com',
  'https://example2.com'
]);

`dispose(): Promise<void>`

Cleans up browser resources. Always call this when done.

await scraper.dispose();

Result Object

interface ScrapingResult {
  url: string;              // Original URL
  html: string;             // HTML content
  status: number;           // HTTP status code
  method: 'fetch' | 'playwright'; // Method used
  error?: string;           // Error message if any
  redirected?: boolean;     // Whether request was redirected
  finalUrl?: string;        // Final URL after redirects
}

Advanced Usage

Custom Headers and User Agent

const scraper = new WebScraper({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  headers: {
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9'
  }
});

Waiting for Content

// Wait for specific element
const result = await scraper.scrape('https://spa-app.com', {
  waitForSelector: '.dynamic-content'
});

// Wait for specific time
const result = await scraper.scrape('https://slow-app.com', {
  waitForTimeout: 10000
});

Error Handling

const result = await scraper.scrape('https://example.com');

if (result.error) {
  console.error('Scraping failed:', result.error);
} else {
  console.log('Success:', result.html.length, 'characters');
}

Batch Scraping

const urls = [
  'https://news.site.com/article1',
  'https://news.site.com/article2',
  'https://news.site.com/article3'
];

const results = await scraper.scrapeMultiple(urls, {
  waitForSelector: '.article-content'
});

results.forEach((result, index) => {
  if (!result.error) {
    console.log(`Article ${index + 1}: ${result.html.length} chars`);
  }
});

How It Works

Direct Fetch: First attempts HTTP request using node-fetch
Anti-Scraping Detection: Checks response for common anti-scraping patterns
Playwright Fallback: If direct fetch fails or anti-scraping detected, uses Playwright
Error Detection: Monitors for 4xx/5xx responses in both methods
Resource Management: Maintains browser instance for performance

Anti-Scraping Protection

The scraper automatically detects and handles:

Cloudflare protection
DistilNetworks
PerimeterX
DataDome
Akamai Bot Manager
CAPTCHA challenges
JavaScript requirement checks
Rate limiting
Access denied pages

Performance

Fast: Direct fetch for simple pages
Efficient: Reuses browser instance
Robust: Fallback ensures high success rate
Intelligent: Only uses Playwright when necessary

Examples

Check out the src/example.ts file for complete usage examples.

License

MIT

Contributing

Pull requests welcome! Please ensure TypeScript compilation and tests pass.

Support

For issues and questions, please use the GitHub issue tracker.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ag-webscrape

Features

Installation

Quick Start

API Reference

WebScraper Class

Constructor

Options

Methods

scrape(url: string, options?: ScrapingOptions): Promise<ScrapingResult>

scrapeMultiple(urls: string[], options?: ScrapingOptions): Promise<ScrapingResult[]>

dispose(): Promise<void>

Result Object

Advanced Usage

Custom Headers and User Agent

Waiting for Content

Error Handling

Batch Scraping

How It Works

Anti-Scraping Protection

Performance

Examples

License

Contributing

Support

`scrape(url: string, options?: ScrapingOptions): Promise<ScrapingResult>`

`scrapeMultiple(urls: string[], options?: ScrapingOptions): Promise<ScrapingResult[]>`

`dispose(): Promise<void>`