npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

web-scraper-ts

v1.0.0

Published

A powerful and flexible web scraper library built with TypeScript

Downloads

6

Readme

Web Scraper TypeScript

A powerful and flexible web scraper library built with TypeScript, designed for both HTTP-based and browser-based scraping.

npm version License: MIT TypeScript

🚀 Features

  • HTTP & Browser Scraping: Support for both lightweight HTTP requests and full browser automation
  • TypeScript First: Full TypeScript support with comprehensive type definitions
  • Rate Limiting: Built-in rate limiting to respect website policies
  • Caching: Memory and file-based caching for improved performance
  • Data Transformation: Rich set of data transformation utilities
  • Error Handling: Comprehensive error handling with retry mechanisms
  • Concurrent Processing: Support for concurrent scraping with configurable limits
  • Event-Driven: Event emitters for monitoring scraping progress
  • Extensible: Plugin-based architecture for custom functionality

📦 Installation

npm install web-scraper-ts

# For browser-based scraping
npm install puppeteer playwright

🔧 Quick Start

Basic HTTP Scraping

import { WebScraper, DataExtractor } from 'web-scraper-ts';

const scraper = new WebScraper({
  timeout: 10000,
  retries: 3,
  rateLimit: {
    requestsPerSecond: 2,
  },
});

const result = await scraper.scrape('https://example.com', {
  title: {
    selector: 'title',
    required: true,
  },
  links: {
    selector: 'a',
    attribute: 'href',
    multiple: true,
  },
  price: {
    selector: '.price',
    transform: DataExtractor.transforms.extractPrice,
  },
});

if (result.success) {
  console.log('Title:', result.data.data.title);
  console.log('Links found:', result.data.data.links.length);
}

Browser-Based Scraping

import { BrowserScraper } from 'web-scraper-ts';

const browserScraper = new BrowserScraper({
  browser: {
    headless: true,
    viewport: { width: 1920, height: 1080 },
  },
});

const result = await browserScraper.scrape('https://spa-app.com', {
  dynamicContent: {
    selector: '.loaded-content',
    required: true,
  },
});

await browserScraper.destroy();

Advanced Usage with Rules

import { ScraperManager } from 'web-scraper-ts';

const manager = new ScraperManager({
  rateLimit: { requestsPerSecond: 1 },
});

// Listen to events
manager.on('job:complete', ({ jobId, rule }) => {
  console.log(`Completed: ${rule}`);
});

const rules = [
  {
    name: 'product-scraping',
    url: 'https://ecommerce.com/products',
    selectors: {
      products: {
        selector: '.product',
        multiple: true,
      },
      prices: {
        selector: '.price',
        multiple: true,
        transform: DataExtractor.transforms.extractPrice,
      },
    },
  },
];

const results = await manager.executeRulesConcurrent(rules, 3);
console.log(`Scraped ${results.summary.successful} pages successfully`);

🛠️ API Reference

WebScraper

The main class for HTTP-based scraping.

Constructor Options

interface ScraperOptions {
  timeout?: number;              // Request timeout in ms (default: 30000)
  retries?: number;             // Number of retries (default: 3)
  delay?: number;               // Delay between retries in ms
  userAgent?: string;           // Custom user agent
  headers?: Record<string, string>; // Custom headers
  rateLimit?: {
    requestsPerSecond: number;
    burstSize?: number;
  };
  cache?: {
    enabled: boolean;
    ttl?: number;                // Time to live in seconds
    storage?: 'memory' | 'file';
    path?: string;               // For file storage
  };
  proxy?: {
    host: string;
    port: number;
    username?: string;
    password?: string;
  };
}

Methods

  • scrape(url: string, selectors?: Record<string, SelectorConfig>) - Scrape a single URL
  • scrapeMultiple(urls: string[], selectors?) - Scrape multiple URLs
  • destroy() - Clean up resources

BrowserScraper

For JavaScript-heavy sites requiring browser automation.

Constructor Options

interface BrowserConfig {
  headless?: boolean;           // Run in headless mode (default: true)
  viewport?: {
    width: number;
    height: number;
  };
  engine?: 'puppeteer' | 'playwright'; // Browser engine
  args?: string[];              // Additional browser arguments
}

Methods

  • scrape(url: string, selectors?) - Scrape with browser
  • scrapeWithScript(url: string, script: string) - Execute custom JavaScript
  • takeScreenshot(url: string, options?) - Take page screenshot
  • destroy() - Close browser and cleanup

ScraperManager

High-level manager for complex scraping operations.

Methods

  • executeRule(rule: ScrapingRule) - Execute single scraping rule
  • executeRules(rules: ScrapingRule[]) - Execute multiple rules sequentially
  • executeRulesConcurrent(rules: ScrapingRule[], concurrency: number) - Execute rules concurrently
  • healthCheck() - Check system health
  • destroy() - Cleanup all resources

SelectorConfig

Configuration for data extraction:

interface SelectorConfig {
  selector: string;             // CSS selector or XPath
  attribute?: string;           // Extract attribute instead of text
  transform?: (value: string) => any; // Transform extracted data
  multiple?: boolean;           // Extract multiple elements
  required?: boolean;           // Throw error if not found
}

Data Transformations

Built-in transformation functions:

DataExtractor.transforms = {
  toNumber: (value: string) => number,
  toDate: (value: string) => Date,
  extractPrice: (value: string) => number,
  extractEmail: (value: string) => string | null,
  extractPhone: (value: string) => string | null,
  cleanText: (value: string) => string,
  removeHtml: (value: string) => string,
  slugify: (value: string) => string,
  extractUrls: (value: string) => string[],
  // ... more transforms
};

🔧 Configuration

Environment Variables

# Optional: Set default configuration
SCRAPER_DEFAULT_TIMEOUT=30000
SCRAPER_DEFAULT_RETRIES=3
SCRAPER_CACHE_PATH=./cache
SCRAPER_LOG_LEVEL=info

Configuration File

Create scraper.config.js in your project root:

module.exports = {
  defaultOptions: {
    timeout: 15000,
    retries: 2,
    rateLimit: {
      requestsPerSecond: 1,
    },
    cache: {
      enabled: true,
      ttl: 300,
      storage: 'file',
      path: './cache',
    },
  },
  logger: {
    level: 'info',
    format: 'json',
    output: 'file',
    filePath: './logs/scraper.log',
  },
};

🧪 Testing

# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run tests in watch mode
npm run test:watch

📝 Examples

Check the examples/ directory for more comprehensive examples:

🚨 Error Handling

The library provides comprehensive error handling:

const result = await scraper.scrape(url, selectors);

if (!result.success) {
  switch (result.error?.type) {
    case 'NETWORK_ERROR':
      console.log('Network issue:', result.error.message);
      break;
    case 'TIMEOUT_ERROR':
      console.log('Request timed out');
      break;
    case 'RATE_LIMIT_ERROR':
      console.log('Rate limited, try again later');
      break;
    case 'SELECTOR_ERROR':
      console.log('Selector not found:', result.error.message);
      break;
    case 'PARSING_ERROR':
      console.log('Failed to parse content');
      break;
  }
}

🎯 Best Practices

1. Respect Robots.txt

Always check the target website's robots.txt file and respect their scraping policies.

2. Use Rate Limiting

Implement appropriate rate limiting to avoid overwhelming servers:

const scraper = new WebScraper({
  rateLimit: {
    requestsPerSecond: 1, // Conservative rate
    burstSize: 3,
  },
});

3. Handle Errors Gracefully

Always implement proper error handling and retry logic:

const scraper = new WebScraper({
  retries: 3,
  delay: 1000, // 1 second between retries
});

4. Use Caching

Enable caching for frequently accessed data:

const scraper = new WebScraper({
  cache: {
    enabled: true,
    ttl: 3600, // 1 hour
    storage: 'file',
  },
});

5. Monitor Performance

Use events to monitor scraping performance:

manager.on('job:complete', ({ rule, result }) => {
  console.log(`${rule} completed in ${result.metadata.responseTime}ms`);
});

🛡️ Legal Considerations

  • Always respect robots.txt files
  • Be mindful of website terms of service
  • Implement appropriate delays between requests
  • Consider the impact on website performance
  • Respect copyright and data privacy laws

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Clone the repository
git clone https://github.com/Arifzyn19/web-scraper-ts.git
cd web-scraper-ts

# Install dependencies
npm install

# Run in development mode
npm run dev

# Run tests
npm test

# Build the project
npm run build

📊 Performance

Benchmarks

| Operation | HTTP Scraper | Browser Scraper | |-----------|-------------|----------------| | Simple page | ~100ms | ~2000ms | | Complex SPA | N/A | ~3000ms | | Multiple pages (10) | ~1s | ~15s |

Memory Usage

  • HTTP Scraper: ~10-50MB
  • Browser Scraper: ~100-500MB (per browser instance)

🔄 Changelog

v1.0.0

  • Initial release
  • HTTP and browser-based scraping
  • Rate limiting and caching
  • Data transformation utilities
  • Comprehensive error handling

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Cheerio for HTML parsing
  • Puppeteer for browser automation
  • Axios for HTTP requests
  • The open source community for inspiration and contributions

📞 Support


Made with ❤️ by the Web Scraper Team