@danilidonbeltran/webscrapper

v2.3.1

Published

20 days ago

A web scraper using Playwright to extract all text content from websites

0High
0Medium
0Low

danilidonbeltran

webscraping playwright text-extraction

🕷️ Web Scraper with Playwright

A production-ready web scraping solution built with Playwright that extracts text content from websites with multiple operation modes and intelligent content filtering.

✨ Features

Multi-browser support (Chromium, Firefox, WebKit)
Unified CLI - Auto-detects operation mode (single URL, bulk, or preset-based)
JavaScript rendering - Handles dynamic content
Structured extraction - Headings, paragraphs, links, lists, images
Bulk processing - Scrape multiple URLs efficiently with rate limiting
Configuration presets - Optimized for news, blogs, docs, e-commerce
Pre-scrape interactions - Run Playwright interaction steps before extraction
Multiple output formats - JSON, TXT, CSV
Content grouping - Group results by CSS selectors
Error handling - Robust error recovery and retry mechanisms

🚀 Quick Start

# Install dependencies
npm install

# Install browser binaries
npm run install-browsers

# Basic scraping
npm run scrape "https://example.com"

# Structured extraction
npm run scrape "https://example.com" -- --structured --output results.json

# Run tests
npm test

📖 Usage

Single URL Scraping

# Basic text extraction
npm run scrape "https://example.com"

# Structured content (headings, links, paragraphs, lists)
npm run scrape "https://example.com" -- --structured

# Group content by sections
npm run scrape "https://news-site.com" -- --structured --group-by "article"

# Run interaction steps from file before scraping
npm run scrape "https://example.com" -- --interaction-steps-file interactions.json

# Use different browser
npm run scrape "https://example.com" -- --browser firefox

# Debug mode (visible browser)
npm run scrape "https://example.com" -- --no-headless

Bulk Scraping

# Multiple URLs
npm run scrape "https://site1.com" "https://site2.com" -- --structured

# From file
npm run scrape -- --file urls.txt --output results.json

# Custom batch processing
npm run scrape -- --file urls.txt --batch-size 3 --delay 2000 --format csv

Preset-based Scraping

# List available presets
npm run list-presets

# Show preset details
npm run show-preset news

# Use preset (news, blog, ecommerce, documentation)
npm run scrape -- --preset news "https://news-site.com" --output articles.json

Available Presets

news - News articles (excludes related articles, newsletters, ads)
blog - Blog posts (excludes author bio, related posts, comments)
ecommerce - Product pages (excludes reviews, recommendations, cart)
documentation - Technical docs (excludes edit buttons, breadcrumbs, navigation)

Programmatic Usage

import { WebScraper } from './src/scraper.js';

// Basic usage
const scraper = new WebScraper();
const result = await scraper.scrapeText('https://example.com');
console.log(result.text);
await scraper.close();

// With custom configuration
const customScraper = new WebScraper({
  browser: 'chromium',
  headless: true,
  timeout: 30000,
  waitForSelector: '.main-content',
  waitUntil: 'domcontentloaded',   // Navigation wait strategy (default: 'domcontentloaded')
  excludeSelectors: ['script', 'style', '.ads', 'nav', 'footer'],
  followPermanentRedirect: false,  // Don't follow permanent redirects 301/308 (default: true)
  followTemporaryRedirect: false,  // Don't follow temporary redirects 302/303/307 (default: true)
  interactionSteps: [
    { event: 'click', target: '#expand', wait: 1000 },
    { event: 'mouseover', target: '#info', required: false }
  ]
});

// Structured extraction
const structured = await scraper.scrapeTextStructured('https://example.com');
console.log(structured.headings, structured.links);

// Multiple section selectors - try multiple CSS selectors
const sectionScraper = new WebScraper({
  sectionSelectors: ['article', 'section', '.content', 'main']
});
const sections = await sectionScraper.scrapeTextStructured('https://example.com');
console.log(sections.sections); // Array of matched sections

// Override section selectors per request
const result = await scraper.scrapeTextStructured('https://example.com', {
  sectionSelectors: ['.post', 'article', '.entry']
});

Using Presets Programmatically

import { ConfigurableScraper } from './src/configurable-scraper.js';

const scraper = new ConfigurableScraper();
const result = await scraper.scrapeWithPreset(
  'https://news-site.com', 
  'news',
  { structured: true }
);

Bulk Processing

import { BulkScraper } from './src/bulk-scraper.js';

const bulkScraper = new BulkScraper();
const results = await bulkScraper.scrapeUrls(urls, {
  batchSize: 3,
  delay: 2000,
  structured: true,
  outputFormat: 'json'
});

🎯 Advanced Features

Pre-scrape Interaction Steps

Run a sequence of browser interactions after navigation and optional waitForSelector, before the scraper extracts content.

[
  { "event": "click", "target": "#expand", "wait": 1000 },
  { "event": "mouseover", "target": "#info", "required": false }
]

Use with CLI:

npm run scrape "https://example.com" -- --interaction-steps-file interactions.json

Supported events: click, dblclick, mouseover, hover, focus, fill, type, press

Each step supports:

event (required)
target (required, CSS selector)
required (optional, default true)
wait (optional milliseconds to wait after event)
timeout (optional override for that step)
value (optional; used by fill, type, and press where press requires non-empty key)

When a non-required step fails, scraping continues and the result includes interactionWarnings.

Redirect Handling

Control how the scraper handles HTTP redirects independently for permanent (301, 308) and temporary (302, 303, 307) redirects:

import { WebScraper, RedirectError } from './src/scraper.js';

// Throw on all redirects
const scraper = new WebScraper({
  followPermanentRedirect: false,  // Don't follow 301/308 (default: true)
  followTemporaryRedirect: false   // Don't follow 302/303/307 (default: true)
});

// Follow permanent redirects only
const permanentOnlyScraper = new WebScraper({
  followPermanentRedirect: true,
  followTemporaryRedirect: false
});

try {
  const result = await scraper.scrapeText('https://example.com/old-page');
  // Normal scraping if no redirect
  console.log(result.text);
} catch (error) {
  // Redirect detected - throws RedirectError
  if (error instanceof RedirectError) {
    console.log(`Redirect ${error.status}: ${error.originalUrl} -> ${error.location}`);
    console.log(error.message);
    // error.status - HTTP status code (301, 302, etc.)
    // error.location - redirect target URL
    // error.originalUrl - original URL that redirected
    // error.timestamp - ISO timestamp
  }
}

Use cases:

✅ Detect moved or deprecated URLs
✅ Track redirect chains in bulk operations
✅ Validate URL structure without following redirects
✅ Audit SEO redirect configurations

Multiple Section Selectors

You can now specify multiple CSS selectors to capture content from different section types. The scraper will try each selector and combine all matching sections. Each section in the results will include the selector as id, if there are several matches for the same selector, they will be numbered.

const scraper = new WebScraper({
  sectionSelectors: ['article', 'section', '.content', 'main']
});

const result = await scraper.scrapeTextStructured('https://example.com');
// Returns sections matching ANY of the selectors

This is useful when:

Different pages use different HTML structures
You want to capture multiple types of content sections
Content is split across various semantic elements

Benefits:

✅ More flexible scraping across different page layouts
✅ Fallback selectors if primary selector doesn't match
✅ Combine multiple content areas (e.g., main article + sidebars)

Navigation Wait Strategy (`waitUntil`)

Control when Playwright considers navigation complete. Useful for pages that load content at different stages:

| Value | Wait for | Best for | |-------|----------|----------| | 'domcontentloaded' | HTML parsed, DOM ready (default) | Most pages — fast and reliable | | 'load' | All resources (images, scripts, stylesheets) | Pages where initial layout matters | | 'networkidle' | No network activity for 500ms | Heavy SPAs that fetch data on load | | 'commit' | First byte received | Fast link-following / redirect detection |

// Default — fastest, works for most pages
const scraper = new WebScraper({ waitUntil: 'domcontentloaded' });

// Wait for all assets (images, fonts, etc.)
const scraper = new WebScraper({ waitUntil: 'load' });

// Wait for API calls to finish on a SPA
const scraper = new WebScraper({ waitUntil: 'networkidle' });

Tips:

'domcontentloaded' is the default — prefer it unless content is missing.
Use 'networkidle' for SPAs that render after async data fetches, but expect slower scraping.
Combine with waitForSelector to wait for a specific element after navigation.

📊 Output Formats

JSON (default)

{
  "url": "https://example.com",
  "text": "Extracted content...",
  "length": 1234,
  "timestamp": "2025-10-10T10:00:00.000Z"
}

Structured JSON (--structured flag)

{
  "url": "https://example.com",
  "title": "Page Title",
  "headings": {"h1": ["Main"], "h2": ["Sub1", "Sub2"]},
  "paragraphs": ["Text..."],
  "links": [{"text": "Link", "href": "https://..."}],
  "lists": [{"type": "ul", "items": ["Item 1"]}],
  "images": [{"alt": "Desc", "src": "https://..."}]
}

CSV (--format csv)

Comma-separated values with headers for bulk operations.

TXT (--format txt)

Plain text concatenation for simple text output.

🛠️ CLI Command Reference

# Basic commands
npm run scrape "URL"                          # Basic scraping
npm run scrape "URL" --structured             # Structured extraction
npm run scrape "URL" --output file.json       # Save to file

# Advanced options
npm run scrape "URL" --browser firefox        # Use Firefox
npm run scrape "URL" --no-headless            # Show browser
npm run scrape "URL" --timeout 60000          # 60s timeout
npm run scrape "URL" --group-by "selector"    # Group content
npm run scrape "URL" --interaction-steps-file interactions.json

# Bulk operations
npm run scrape "URL1" "URL2"                  # Multiple URLs
npm run scrape --file urls.txt --batch-size 3 # Custom batching
npm run scrape --file urls.txt --delay 2000   # 2s delay

# Preset operations
npm run list-presets                          # List presets
npm run show-preset news                      # Show preset config
npm run scrape --preset news "URL"            # Use preset

🚦 Best Practices

Be respectful - Use delays between requests (--delay 2000)
Handle errors - Always use try-catch blocks in code
Close resources - Call await scraper.close()
Use presets - Leverage optimized configurations
Check robots.txt - Respect website policies
Test first - Try single URL before bulk operations
Save important results - Use --output flag
Optimize performance - Use headless mode and appropriate batch sizes

🔍 Troubleshooting

Common Issues

| Issue | Solution | |-------|----------| | Timeout errors | Increase timeout: --timeout 60000 | | Empty results | Try preset or wait selector | | Browser crashes | Reduce batch size: --batch-size 2 | | Memory issues | Process fewer URLs at once | | Preset not found | Run npm run list-presets |

📄 License

MIT License - Feel free to use in your projects!