web-scraper-ts

v1.0.0

Published

6 months ago

A powerful and flexible web scraper library built with TypeScript

Downloads

0High
0Medium
0Low

arifzyn

scraper web-scraping typescript crawler parser data-extraction

Web Scraper TypeScript

A powerful and flexible web scraper library built with TypeScript, designed for both HTTP-based and browser-based scraping.

🚀 Features

HTTP & Browser Scraping: Support for both lightweight HTTP requests and full browser automation
TypeScript First: Full TypeScript support with comprehensive type definitions
Rate Limiting: Built-in rate limiting to respect website policies
Caching: Memory and file-based caching for improved performance
Data Transformation: Rich set of data transformation utilities
Error Handling: Comprehensive error handling with retry mechanisms
Concurrent Processing: Support for concurrent scraping with configurable limits
Event-Driven: Event emitters for monitoring scraping progress
Extensible: Plugin-based architecture for custom functionality

📦 Installation

npm install web-scraper-ts

# For browser-based scraping
npm install puppeteer playwright

🔧 Quick Start

Basic HTTP Scraping

import { WebScraper, DataExtractor } from 'web-scraper-ts';

const scraper = new WebScraper({
  timeout: 10000,
  retries: 3,
  rateLimit: {
    requestsPerSecond: 2,
  },
});

const result = await scraper.scrape('https://example.com', {
  title: {
    selector: 'title',
    required: true,
  },
  links: {
    selector: 'a',
    attribute: 'href',
    multiple: true,
  },
  price: {
    selector: '.price',
    transform: DataExtractor.transforms.extractPrice,
  },
});

if (result.success) {
  console.log('Title:', result.data.data.title);
  console.log('Links found:', result.data.data.links.length);
}

Browser-Based Scraping

import { BrowserScraper } from 'web-scraper-ts';

const browserScraper = new BrowserScraper({
  browser: {
    headless: true,
    viewport: { width: 1920, height: 1080 },
  },
});

const result = await browserScraper.scrape('https://spa-app.com', {
  dynamicContent: {
    selector: '.loaded-content',
    required: true,
  },
});

await browserScraper.destroy();

Advanced Usage with Rules

import { ScraperManager } from 'web-scraper-ts';

const manager = new ScraperManager({
  rateLimit: { requestsPerSecond: 1 },
});

// Listen to events
manager.on('job:complete', ({ jobId, rule }) => {
  console.log(`Completed: ${rule}`);
});

const rules = [
  {
    name: 'product-scraping',
    url: 'https://ecommerce.com/products',
    selectors: {
      products: {
        selector: '.product',
        multiple: true,
      },
      prices: {
        selector: '.price',
        multiple: true,
        transform: DataExtractor.transforms.extractPrice,
      },
    },
  },
];

const results = await manager.executeRulesConcurrent(rules, 3);
console.log(`Scraped ${results.summary.successful} pages successfully`);

🛠️ API Reference

WebScraper

The main class for HTTP-based scraping.

Constructor Options

interface ScraperOptions {
  timeout?: number;              // Request timeout in ms (default: 30000)
  retries?: number;             // Number of retries (default: 3)
  delay?: number;               // Delay between retries in ms
  userAgent?: string;           // Custom user agent
  headers?: Record<string, string>; // Custom headers
  rateLimit?: {
    requestsPerSecond: number;
    burstSize?: number;
  };
  cache?: {
    enabled: boolean;
    ttl?: number;                // Time to live in seconds
    storage?: 'memory' | 'file';
    path?: string;               // For file storage
  };
  proxy?: {
    host: string;
    port: number;
    username?: string;
    password?: string;
  };
}

Methods

scrape(url: string, selectors?: Record<string, SelectorConfig>) - Scrape a single URL
scrapeMultiple(urls: string[], selectors?) - Scrape multiple URLs
destroy() - Clean up resources

BrowserScraper

For JavaScript-heavy sites requiring browser automation.

Constructor Options

interface BrowserConfig {
  headless?: boolean;           // Run in headless mode (default: true)
  viewport?: {
    width: number;
    height: number;
  };
  engine?: 'puppeteer' | 'playwright'; // Browser engine
  args?: string[];              // Additional browser arguments
}

Methods

scrape(url: string, selectors?) - Scrape with browser
scrapeWithScript(url: string, script: string) - Execute custom JavaScript
takeScreenshot(url: string, options?) - Take page screenshot
destroy() - Close browser and cleanup

ScraperManager

High-level manager for complex scraping operations.

Methods

executeRule(rule: ScrapingRule) - Execute single scraping rule
executeRules(rules: ScrapingRule[]) - Execute multiple rules sequentially
executeRulesConcurrent(rules: ScrapingRule[], concurrency: number) - Execute rules concurrently
healthCheck() - Check system health
destroy() - Cleanup all resources

SelectorConfig

Configuration for data extraction:

interface SelectorConfig {
  selector: string;             // CSS selector or XPath
  attribute?: string;           // Extract attribute instead of text
  transform?: (value: string) => any; // Transform extracted data
  multiple?: boolean;           // Extract multiple elements
  required?: boolean;           // Throw error if not found
}

Data Transformations

Built-in transformation functions:

DataExtractor.transforms = {
  toNumber: (value: string) => number,
  toDate: (value: string) => Date,
  extractPrice: (value: string) => number,
  extractEmail: (value: string) => string | null,
  extractPhone: (value: string) => string | null,
  cleanText: (value: string) => string,
  removeHtml: (value: string) => string,
  slugify: (value: string) => string,
  extractUrls: (value: string) => string[],
  // ... more transforms
};

🔧 Configuration

Environment Variables

# Optional: Set default configuration
SCRAPER_DEFAULT_TIMEOUT=30000
SCRAPER_DEFAULT_RETRIES=3
SCRAPER_CACHE_PATH=./cache
SCRAPER_LOG_LEVEL=info

Configuration File

Create scraper.config.js in your project root:

module.exports = {
  defaultOptions: {
    timeout: 15000,
    retries: 2,
    rateLimit: {
      requestsPerSecond: 1,
    },
    cache: {
      enabled: true,
      ttl: 300,
      storage: 'file',
      path: './cache',
    },
  },
  logger: {
    level: 'info',
    format: 'json',
    output: 'file',
    filePath: './logs/scraper.log',
  },
};

🧪 Testing

# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run tests in watch mode
npm run test:watch

📝 Examples

Check the examples/ directory for more comprehensive examples:

🚨 Error Handling

The library provides comprehensive error handling:

const result = await scraper.scrape(url, selectors);

if (!result.success) {
  switch (result.error?.type) {
    case 'NETWORK_ERROR':
      console.log('Network issue:', result.error.message);
      break;
    case 'TIMEOUT_ERROR':
      console.log('Request timed out');
      break;
    case 'RATE_LIMIT_ERROR':
      console.log('Rate limited, try again later');
      break;
    case 'SELECTOR_ERROR':
      console.log('Selector not found:', result.error.message);
      break;
    case 'PARSING_ERROR':
      console.log('Failed to parse content');
      break;
  }
}

🎯 Best Practices

1. Respect Robots.txt

Always check the target website's robots.txt file and respect their scraping policies.

2. Use Rate Limiting

Implement appropriate rate limiting to avoid overwhelming servers:

const scraper = new WebScraper({
  rateLimit: {
    requestsPerSecond: 1, // Conservative rate
    burstSize: 3,
  },
});

3. Handle Errors Gracefully

Always implement proper error handling and retry logic:

const scraper = new WebScraper({
  retries: 3,
  delay: 1000, // 1 second between retries
});

4. Use Caching

Enable caching for frequently accessed data:

const scraper = new WebScraper({
  cache: {
    enabled: true,
    ttl: 3600, // 1 hour
    storage: 'file',
  },
});

5. Monitor Performance

Use events to monitor scraping performance:

manager.on('job:complete', ({ rule, result }) => {
  console.log(`${rule} completed in ${result.metadata.responseTime}ms`);
});

🛡️ Legal Considerations

Always respect robots.txt files
Be mindful of website terms of service
Implement appropriate delays between requests
Consider the impact on website performance
Respect copyright and data privacy laws

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Clone the repository
git clone https://github.com/Arifzyn19/web-scraper-ts.git
cd web-scraper-ts

# Install dependencies
npm install

# Run in development mode
npm run dev

# Run tests
npm test

# Build the project
npm run build

📊 Performance

Benchmarks

| Operation | HTTP Scraper | Browser Scraper | |-----------|-------------|----------------| | Simple page | ~100ms | ~2000ms | | Complex SPA | N/A | ~3000ms | | Multiple pages (10) | ~1s | ~15s |

Memory Usage

HTTP Scraper: ~10-50MB
Browser Scraper: ~100-500MB (per browser instance)

🔄 Changelog

v1.0.0

Initial release
HTTP and browser-based scraping
Rate limiting and caching
Data transformation utilities
Comprehensive error handling

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Cheerio for HTML parsing
Puppeteer for browser automation
Axios for HTTP requests
The open source community for inspiration and contributions

📞 Support

📧 Email: [email protected]
💬 Discord: Join our community
🐛 Issues: GitHub Issues

Made with ❤️ by the Web Scraper Team

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Web Scraper TypeScript

🚀 Features

📦 Installation

🔧 Quick Start

Basic HTTP Scraping

Browser-Based Scraping

Advanced Usage with Rules

🛠️ API Reference

WebScraper

Constructor Options

Methods

BrowserScraper

Constructor Options

Methods

ScraperManager

Methods

SelectorConfig

Data Transformations

🔧 Configuration

Environment Variables

Configuration File

🧪 Testing

📝 Examples

🚨 Error Handling

🎯 Best Practices

1. Respect Robots.txt

2. Use Rate Limiting

3. Handle Errors Gracefully

4. Use Caching

5. Monitor Performance

🛡️ Legal Considerations

🤝 Contributing

Development Setup

📊 Performance

Benchmarks

Memory Usage

🔄 Changelog

v1.0.0

📄 License

🙏 Acknowledgments

📞 Support