npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@danilidonbeltran/webscrapper

v2.3.1

Published

A web scraper using Playwright to extract all text content from websites

Readme

🕷️ Web Scraper with Playwright

A production-ready web scraping solution built with Playwright that extracts text content from websites with multiple operation modes and intelligent content filtering.

✨ Features

  • Multi-browser support (Chromium, Firefox, WebKit)
  • Unified CLI - Auto-detects operation mode (single URL, bulk, or preset-based)
  • JavaScript rendering - Handles dynamic content
  • Structured extraction - Headings, paragraphs, links, lists, images
  • Bulk processing - Scrape multiple URLs efficiently with rate limiting
  • Configuration presets - Optimized for news, blogs, docs, e-commerce
  • Pre-scrape interactions - Run Playwright interaction steps before extraction
  • Multiple output formats - JSON, TXT, CSV
  • Content grouping - Group results by CSS selectors
  • Error handling - Robust error recovery and retry mechanisms

🚀 Quick Start

# Install dependencies
npm install

# Install browser binaries
npm run install-browsers

# Basic scraping
npm run scrape "https://example.com"

# Structured extraction
npm run scrape "https://example.com" -- --structured --output results.json

# Run tests
npm test

📖 Usage

Single URL Scraping

# Basic text extraction
npm run scrape "https://example.com"

# Structured content (headings, links, paragraphs, lists)
npm run scrape "https://example.com" -- --structured

# Group content by sections
npm run scrape "https://news-site.com" -- --structured --group-by "article"

# Run interaction steps from file before scraping
npm run scrape "https://example.com" -- --interaction-steps-file interactions.json

# Use different browser
npm run scrape "https://example.com" -- --browser firefox

# Debug mode (visible browser)
npm run scrape "https://example.com" -- --no-headless

Bulk Scraping

# Multiple URLs
npm run scrape "https://site1.com" "https://site2.com" -- --structured

# From file
npm run scrape -- --file urls.txt --output results.json

# Custom batch processing
npm run scrape -- --file urls.txt --batch-size 3 --delay 2000 --format csv

Preset-based Scraping

# List available presets
npm run list-presets

# Show preset details
npm run show-preset news

# Use preset (news, blog, ecommerce, documentation)
npm run scrape -- --preset news "https://news-site.com" --output articles.json

Available Presets

  • news - News articles (excludes related articles, newsletters, ads)
  • blog - Blog posts (excludes author bio, related posts, comments)
  • ecommerce - Product pages (excludes reviews, recommendations, cart)
  • documentation - Technical docs (excludes edit buttons, breadcrumbs, navigation)

Programmatic Usage

import { WebScraper } from './src/scraper.js';

// Basic usage
const scraper = new WebScraper();
const result = await scraper.scrapeText('https://example.com');
console.log(result.text);
await scraper.close();

// With custom configuration
const customScraper = new WebScraper({
  browser: 'chromium',
  headless: true,
  timeout: 30000,
  waitForSelector: '.main-content',
  waitUntil: 'domcontentloaded',   // Navigation wait strategy (default: 'domcontentloaded')
  excludeSelectors: ['script', 'style', '.ads', 'nav', 'footer'],
  followPermanentRedirect: false,  // Don't follow permanent redirects 301/308 (default: true)
  followTemporaryRedirect: false,  // Don't follow temporary redirects 302/303/307 (default: true)
  interactionSteps: [
    { event: 'click', target: '#expand', wait: 1000 },
    { event: 'mouseover', target: '#info', required: false }
  ]
});

// Structured extraction
const structured = await scraper.scrapeTextStructured('https://example.com');
console.log(structured.headings, structured.links);

// Multiple section selectors - try multiple CSS selectors
const sectionScraper = new WebScraper({
  sectionSelectors: ['article', 'section', '.content', 'main']
});
const sections = await sectionScraper.scrapeTextStructured('https://example.com');
console.log(sections.sections); // Array of matched sections

// Override section selectors per request
const result = await scraper.scrapeTextStructured('https://example.com', {
  sectionSelectors: ['.post', 'article', '.entry']
});

Using Presets Programmatically

import { ConfigurableScraper } from './src/configurable-scraper.js';

const scraper = new ConfigurableScraper();
const result = await scraper.scrapeWithPreset(
  'https://news-site.com', 
  'news',
  { structured: true }
);

Bulk Processing

import { BulkScraper } from './src/bulk-scraper.js';

const bulkScraper = new BulkScraper();
const results = await bulkScraper.scrapeUrls(urls, {
  batchSize: 3,
  delay: 2000,
  structured: true,
  outputFormat: 'json'
});

🎯 Advanced Features

Pre-scrape Interaction Steps

Run a sequence of browser interactions after navigation and optional waitForSelector, before the scraper extracts content.

[
  { "event": "click", "target": "#expand", "wait": 1000 },
  { "event": "mouseover", "target": "#info", "required": false }
]

Use with CLI:

npm run scrape "https://example.com" -- --interaction-steps-file interactions.json

Supported events: click, dblclick, mouseover, hover, focus, fill, type, press

Each step supports:

  • event (required)
  • target (required, CSS selector)
  • required (optional, default true)
  • wait (optional milliseconds to wait after event)
  • timeout (optional override for that step)
  • value (optional; used by fill, type, and press where press requires non-empty key)

When a non-required step fails, scraping continues and the result includes interactionWarnings.

Redirect Handling

Control how the scraper handles HTTP redirects independently for permanent (301, 308) and temporary (302, 303, 307) redirects:

import { WebScraper, RedirectError } from './src/scraper.js';

// Throw on all redirects
const scraper = new WebScraper({
  followPermanentRedirect: false,  // Don't follow 301/308 (default: true)
  followTemporaryRedirect: false   // Don't follow 302/303/307 (default: true)
});

// Follow permanent redirects only
const permanentOnlyScraper = new WebScraper({
  followPermanentRedirect: true,
  followTemporaryRedirect: false
});

try {
  const result = await scraper.scrapeText('https://example.com/old-page');
  // Normal scraping if no redirect
  console.log(result.text);
} catch (error) {
  // Redirect detected - throws RedirectError
  if (error instanceof RedirectError) {
    console.log(`Redirect ${error.status}: ${error.originalUrl} -> ${error.location}`);
    console.log(error.message);
    // error.status - HTTP status code (301, 302, etc.)
    // error.location - redirect target URL
    // error.originalUrl - original URL that redirected
    // error.timestamp - ISO timestamp
  }
}

Use cases:

  • ✅ Detect moved or deprecated URLs
  • ✅ Track redirect chains in bulk operations
  • ✅ Validate URL structure without following redirects
  • ✅ Audit SEO redirect configurations

Multiple Section Selectors

You can now specify multiple CSS selectors to capture content from different section types. The scraper will try each selector and combine all matching sections. Each section in the results will include the selector as id, if there are several matches for the same selector, they will be numbered.

const scraper = new WebScraper({
  sectionSelectors: ['article', 'section', '.content', 'main']
});

const result = await scraper.scrapeTextStructured('https://example.com');
// Returns sections matching ANY of the selectors

This is useful when:

  • Different pages use different HTML structures
  • You want to capture multiple types of content sections
  • Content is split across various semantic elements

Benefits:

  • ✅ More flexible scraping across different page layouts
  • ✅ Fallback selectors if primary selector doesn't match
  • ✅ Combine multiple content areas (e.g., main article + sidebars)

Navigation Wait Strategy (waitUntil)

Control when Playwright considers navigation complete. Useful for pages that load content at different stages:

| Value | Wait for | Best for | |-------|----------|----------| | 'domcontentloaded' | HTML parsed, DOM ready (default) | Most pages — fast and reliable | | 'load' | All resources (images, scripts, stylesheets) | Pages where initial layout matters | | 'networkidle' | No network activity for 500ms | Heavy SPAs that fetch data on load | | 'commit' | First byte received | Fast link-following / redirect detection |

// Default — fastest, works for most pages
const scraper = new WebScraper({ waitUntil: 'domcontentloaded' });

// Wait for all assets (images, fonts, etc.)
const scraper = new WebScraper({ waitUntil: 'load' });

// Wait for API calls to finish on a SPA
const scraper = new WebScraper({ waitUntil: 'networkidle' });

Tips:

  • 'domcontentloaded' is the default — prefer it unless content is missing.
  • Use 'networkidle' for SPAs that render after async data fetches, but expect slower scraping.
  • Combine with waitForSelector to wait for a specific element after navigation.

📊 Output Formats

JSON (default)

{
  "url": "https://example.com",
  "text": "Extracted content...",
  "length": 1234,
  "timestamp": "2025-10-10T10:00:00.000Z"
}

Structured JSON (--structured flag)

{
  "url": "https://example.com",
  "title": "Page Title",
  "headings": {"h1": ["Main"], "h2": ["Sub1", "Sub2"]},
  "paragraphs": ["Text..."],
  "links": [{"text": "Link", "href": "https://..."}],
  "lists": [{"type": "ul", "items": ["Item 1"]}],
  "images": [{"alt": "Desc", "src": "https://..."}]
}

CSV (--format csv)

Comma-separated values with headers for bulk operations.

TXT (--format txt)

Plain text concatenation for simple text output.

🛠️ CLI Command Reference

# Basic commands
npm run scrape "URL"                          # Basic scraping
npm run scrape "URL" --structured             # Structured extraction
npm run scrape "URL" --output file.json       # Save to file

# Advanced options
npm run scrape "URL" --browser firefox        # Use Firefox
npm run scrape "URL" --no-headless            # Show browser
npm run scrape "URL" --timeout 60000          # 60s timeout
npm run scrape "URL" --group-by "selector"    # Group content
npm run scrape "URL" --interaction-steps-file interactions.json

# Bulk operations
npm run scrape "URL1" "URL2"                  # Multiple URLs
npm run scrape --file urls.txt --batch-size 3 # Custom batching
npm run scrape --file urls.txt --delay 2000   # 2s delay

# Preset operations
npm run list-presets                          # List presets
npm run show-preset news                      # Show preset config
npm run scrape --preset news "URL"            # Use preset

🚦 Best Practices

  1. Be respectful - Use delays between requests (--delay 2000)
  2. Handle errors - Always use try-catch blocks in code
  3. Close resources - Call await scraper.close()
  4. Use presets - Leverage optimized configurations
  5. Check robots.txt - Respect website policies
  6. Test first - Try single URL before bulk operations
  7. Save important results - Use --output flag
  8. Optimize performance - Use headless mode and appropriate batch sizes

🔍 Troubleshooting

Common Issues

| Issue | Solution | |-------|----------| | Timeout errors | Increase timeout: --timeout 60000 | | Empty results | Try preset or wait selector | | Browser crashes | Reduce batch size: --batch-size 2 | | Memory issues | Process fewer URLs at once | | Preset not found | Run npm run list-presets |

📄 License

MIT License - Feel free to use in your projects!