web-scraper-ts
v1.0.0
Published
A powerful and flexible web scraper library built with TypeScript
Downloads
6
Maintainers
Readme
Web Scraper TypeScript
A powerful and flexible web scraper library built with TypeScript, designed for both HTTP-based and browser-based scraping.
🚀 Features
- HTTP & Browser Scraping: Support for both lightweight HTTP requests and full browser automation
- TypeScript First: Full TypeScript support with comprehensive type definitions
- Rate Limiting: Built-in rate limiting to respect website policies
- Caching: Memory and file-based caching for improved performance
- Data Transformation: Rich set of data transformation utilities
- Error Handling: Comprehensive error handling with retry mechanisms
- Concurrent Processing: Support for concurrent scraping with configurable limits
- Event-Driven: Event emitters for monitoring scraping progress
- Extensible: Plugin-based architecture for custom functionality
📦 Installation
npm install web-scraper-ts
# For browser-based scraping
npm install puppeteer playwright🔧 Quick Start
Basic HTTP Scraping
import { WebScraper, DataExtractor } from 'web-scraper-ts';
const scraper = new WebScraper({
timeout: 10000,
retries: 3,
rateLimit: {
requestsPerSecond: 2,
},
});
const result = await scraper.scrape('https://example.com', {
title: {
selector: 'title',
required: true,
},
links: {
selector: 'a',
attribute: 'href',
multiple: true,
},
price: {
selector: '.price',
transform: DataExtractor.transforms.extractPrice,
},
});
if (result.success) {
console.log('Title:', result.data.data.title);
console.log('Links found:', result.data.data.links.length);
}Browser-Based Scraping
import { BrowserScraper } from 'web-scraper-ts';
const browserScraper = new BrowserScraper({
browser: {
headless: true,
viewport: { width: 1920, height: 1080 },
},
});
const result = await browserScraper.scrape('https://spa-app.com', {
dynamicContent: {
selector: '.loaded-content',
required: true,
},
});
await browserScraper.destroy();Advanced Usage with Rules
import { ScraperManager } from 'web-scraper-ts';
const manager = new ScraperManager({
rateLimit: { requestsPerSecond: 1 },
});
// Listen to events
manager.on('job:complete', ({ jobId, rule }) => {
console.log(`Completed: ${rule}`);
});
const rules = [
{
name: 'product-scraping',
url: 'https://ecommerce.com/products',
selectors: {
products: {
selector: '.product',
multiple: true,
},
prices: {
selector: '.price',
multiple: true,
transform: DataExtractor.transforms.extractPrice,
},
},
},
];
const results = await manager.executeRulesConcurrent(rules, 3);
console.log(`Scraped ${results.summary.successful} pages successfully`);🛠️ API Reference
WebScraper
The main class for HTTP-based scraping.
Constructor Options
interface ScraperOptions {
timeout?: number; // Request timeout in ms (default: 30000)
retries?: number; // Number of retries (default: 3)
delay?: number; // Delay between retries in ms
userAgent?: string; // Custom user agent
headers?: Record<string, string>; // Custom headers
rateLimit?: {
requestsPerSecond: number;
burstSize?: number;
};
cache?: {
enabled: boolean;
ttl?: number; // Time to live in seconds
storage?: 'memory' | 'file';
path?: string; // For file storage
};
proxy?: {
host: string;
port: number;
username?: string;
password?: string;
};
}Methods
scrape(url: string, selectors?: Record<string, SelectorConfig>)- Scrape a single URLscrapeMultiple(urls: string[], selectors?)- Scrape multiple URLsdestroy()- Clean up resources
BrowserScraper
For JavaScript-heavy sites requiring browser automation.
Constructor Options
interface BrowserConfig {
headless?: boolean; // Run in headless mode (default: true)
viewport?: {
width: number;
height: number;
};
engine?: 'puppeteer' | 'playwright'; // Browser engine
args?: string[]; // Additional browser arguments
}Methods
scrape(url: string, selectors?)- Scrape with browserscrapeWithScript(url: string, script: string)- Execute custom JavaScripttakeScreenshot(url: string, options?)- Take page screenshotdestroy()- Close browser and cleanup
ScraperManager
High-level manager for complex scraping operations.
Methods
executeRule(rule: ScrapingRule)- Execute single scraping ruleexecuteRules(rules: ScrapingRule[])- Execute multiple rules sequentiallyexecuteRulesConcurrent(rules: ScrapingRule[], concurrency: number)- Execute rules concurrentlyhealthCheck()- Check system healthdestroy()- Cleanup all resources
SelectorConfig
Configuration for data extraction:
interface SelectorConfig {
selector: string; // CSS selector or XPath
attribute?: string; // Extract attribute instead of text
transform?: (value: string) => any; // Transform extracted data
multiple?: boolean; // Extract multiple elements
required?: boolean; // Throw error if not found
}Data Transformations
Built-in transformation functions:
DataExtractor.transforms = {
toNumber: (value: string) => number,
toDate: (value: string) => Date,
extractPrice: (value: string) => number,
extractEmail: (value: string) => string | null,
extractPhone: (value: string) => string | null,
cleanText: (value: string) => string,
removeHtml: (value: string) => string,
slugify: (value: string) => string,
extractUrls: (value: string) => string[],
// ... more transforms
};🔧 Configuration
Environment Variables
# Optional: Set default configuration
SCRAPER_DEFAULT_TIMEOUT=30000
SCRAPER_DEFAULT_RETRIES=3
SCRAPER_CACHE_PATH=./cache
SCRAPER_LOG_LEVEL=infoConfiguration File
Create scraper.config.js in your project root:
module.exports = {
defaultOptions: {
timeout: 15000,
retries: 2,
rateLimit: {
requestsPerSecond: 1,
},
cache: {
enabled: true,
ttl: 300,
storage: 'file',
path: './cache',
},
},
logger: {
level: 'info',
format: 'json',
output: 'file',
filePath: './logs/scraper.log',
},
};🧪 Testing
# Run tests
npm test
# Run tests with coverage
npm run test:coverage
# Run tests in watch mode
npm run test:watch📝 Examples
Check the examples/ directory for more comprehensive examples:
🚨 Error Handling
The library provides comprehensive error handling:
const result = await scraper.scrape(url, selectors);
if (!result.success) {
switch (result.error?.type) {
case 'NETWORK_ERROR':
console.log('Network issue:', result.error.message);
break;
case 'TIMEOUT_ERROR':
console.log('Request timed out');
break;
case 'RATE_LIMIT_ERROR':
console.log('Rate limited, try again later');
break;
case 'SELECTOR_ERROR':
console.log('Selector not found:', result.error.message);
break;
case 'PARSING_ERROR':
console.log('Failed to parse content');
break;
}
}🎯 Best Practices
1. Respect Robots.txt
Always check the target website's robots.txt file and respect their scraping policies.
2. Use Rate Limiting
Implement appropriate rate limiting to avoid overwhelming servers:
const scraper = new WebScraper({
rateLimit: {
requestsPerSecond: 1, // Conservative rate
burstSize: 3,
},
});3. Handle Errors Gracefully
Always implement proper error handling and retry logic:
const scraper = new WebScraper({
retries: 3,
delay: 1000, // 1 second between retries
});4. Use Caching
Enable caching for frequently accessed data:
const scraper = new WebScraper({
cache: {
enabled: true,
ttl: 3600, // 1 hour
storage: 'file',
},
});5. Monitor Performance
Use events to monitor scraping performance:
manager.on('job:complete', ({ rule, result }) => {
console.log(`${rule} completed in ${result.metadata.responseTime}ms`);
});🛡️ Legal Considerations
- Always respect
robots.txtfiles - Be mindful of website terms of service
- Implement appropriate delays between requests
- Consider the impact on website performance
- Respect copyright and data privacy laws
🤝 Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone the repository
git clone https://github.com/Arifzyn19/web-scraper-ts.git
cd web-scraper-ts
# Install dependencies
npm install
# Run in development mode
npm run dev
# Run tests
npm test
# Build the project
npm run build📊 Performance
Benchmarks
| Operation | HTTP Scraper | Browser Scraper | |-----------|-------------|----------------| | Simple page | ~100ms | ~2000ms | | Complex SPA | N/A | ~3000ms | | Multiple pages (10) | ~1s | ~15s |
Memory Usage
- HTTP Scraper: ~10-50MB
- Browser Scraper: ~100-500MB (per browser instance)
🔄 Changelog
v1.0.0
- Initial release
- HTTP and browser-based scraping
- Rate limiting and caching
- Data transformation utilities
- Comprehensive error handling
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Cheerio for HTML parsing
- Puppeteer for browser automation
- Axios for HTTP requests
- The open source community for inspiration and contributions
📞 Support
- 📧 Email: [email protected]
- 💬 Discord: Join our community
- 🐛 Issues: GitHub Issues
Made with ❤️ by the Web Scraper Team
