ag-webscrape
v0.0.19
Published
TypeScript web scraper with Playwright fallback for anti-scraping protection
Maintainers
Readme
ag-webscrape
A TypeScript web scraper with intelligent fallback strategy. Attempts direct HTTP fetching first, then falls back to Playwright for anti-scraping protection.
Features
- Dual Strategy: Direct fetch first, Playwright fallback
- Anti-Scraping Detection: Automatically detects and bypasses common anti-scraping measures
- Persistent Browser: Maintains browser instance for faster subsequent scrapes
- Error Handling: Comprehensive error detection for 4xx/5xx responses
- TypeScript Support: Full type safety and IntelliSense
- Configurable: Extensive customization options
Installation
npm install ag-webscrapeQuick Start
import { WebScraper } from 'ag-webscrape';
const scraper = new WebScraper();
// Scrape a single URL
const result = await scraper.scrape('https://example.com');
console.log(result.html);
// Clean up when done
await scraper.dispose();API Reference
WebScraper Class
Constructor
new WebScraper(options?: ScrapingOptions)Options
interface ScrapingOptions {
timeout?: number; // Request timeout in ms (default: 30000)
userAgent?: string; // Custom user agent
headers?: Record<string, string>; // Additional headers
retries?: number; // Number of retries (default: 3)
waitForSelector?: string; // CSS selector to wait for
waitForTimeout?: number; // Time to wait in ms (default: 5000)
}Methods
scrape(url: string, options?: ScrapingOptions): Promise<ScrapingResult>
Scrapes a single URL with fallback strategy.
const result = await scraper.scrape('https://example.com', {
timeout: 60000,
waitForSelector: '.main-content'
});scrapeMultiple(urls: string[], options?: ScrapingOptions): Promise<ScrapingResult[]>
Scrapes multiple URLs efficiently.
const results = await scraper.scrapeMultiple([
'https://example1.com',
'https://example2.com'
]);dispose(): Promise<void>
Cleans up browser resources. Always call this when done.
await scraper.dispose();Result Object
interface ScrapingResult {
url: string; // Original URL
html: string; // HTML content
status: number; // HTTP status code
method: 'fetch' | 'playwright'; // Method used
error?: string; // Error message if any
redirected?: boolean; // Whether request was redirected
finalUrl?: string; // Final URL after redirects
}Advanced Usage
Custom Headers and User Agent
const scraper = new WebScraper({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
headers: {
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
}
});Waiting for Content
// Wait for specific element
const result = await scraper.scrape('https://spa-app.com', {
waitForSelector: '.dynamic-content'
});
// Wait for specific time
const result = await scraper.scrape('https://slow-app.com', {
waitForTimeout: 10000
});Error Handling
const result = await scraper.scrape('https://example.com');
if (result.error) {
console.error('Scraping failed:', result.error);
} else {
console.log('Success:', result.html.length, 'characters');
}Batch Scraping
const urls = [
'https://news.site.com/article1',
'https://news.site.com/article2',
'https://news.site.com/article3'
];
const results = await scraper.scrapeMultiple(urls, {
waitForSelector: '.article-content'
});
results.forEach((result, index) => {
if (!result.error) {
console.log(`Article ${index + 1}: ${result.html.length} chars`);
}
});How It Works
- Direct Fetch: First attempts HTTP request using
node-fetch - Anti-Scraping Detection: Checks response for common anti-scraping patterns
- Playwright Fallback: If direct fetch fails or anti-scraping detected, uses Playwright
- Error Detection: Monitors for 4xx/5xx responses in both methods
- Resource Management: Maintains browser instance for performance
Anti-Scraping Protection
The scraper automatically detects and handles:
- Cloudflare protection
- DistilNetworks
- PerimeterX
- DataDome
- Akamai Bot Manager
- CAPTCHA challenges
- JavaScript requirement checks
- Rate limiting
- Access denied pages
Performance
- Fast: Direct fetch for simple pages
- Efficient: Reuses browser instance
- Robust: Fallback ensures high success rate
- Intelligent: Only uses Playwright when necessary
Examples
Check out the src/example.ts file for complete usage examples.
License
MIT
Contributing
Pull requests welcome! Please ensure TypeScript compilation and tests pass.
Support
For issues and questions, please use the GitHub issue tracker.
