web-scraper-pro
v1.1.1
Published
Professional web scraper with Puppeteer & Mozilla Readability. Extract clean content from any website with full TypeScript support.
Maintainers
Readme
web-scraper-pro
A professional web scraper powered by Puppeteer and Mozilla Readability. Extract clean, readable content from any website with full TypeScript support and comprehensive error handling.
🚀 Installation
npm install web-scraper-pro📦 Quick Start
const WebScraper = require("web-scraper-pro");
// Method 1: Default output directory (./output)
const scraper = new WebScraper();
const result = await scraper.scrapeAndSave("https://example.com");
// Method 2: Custom output directory via constructor
const scraper2 = new WebScraper({ outputDir: "./my-downloads" });
const result2 = await scraper2.scrapeAndSave("https://example.com");
// Method 3: Set output directory after creation
const scraper3 = new WebScraper();
scraper3.setOutputDir("./custom-folder");
const result3 = await scraper3.scrapeAndSave("https://example.com");
// Method 4: Extract content only (no files saved)
const content = await scraper.scrapeContentOnly("https://example.com", [
"title",
"content",
]);
console.log(content);📁 Project Structure
web-scraper-pro/
├── src/
│ ├── scraper.js # Main scraper implementation
│ └── scraper.d.ts # TypeScript definitions
├── output/ # Generated output files
│ ├── scraped_*.html # Raw HTML files
│ └── extracted_*.txt # Extracted content files
├── test/ # Test files
├── index.js # Main entry point
├── package.json
└── README.md🔧 Usage
Command Line Interface
Run with default URL (Wikipedia):
node src/scraper.jsRun with custom URL:
node src/scraper.js "https://your-target-url.com"Setting Custom Output Directory
There are multiple ways to specify where output files should be saved:
1. Via Constructor
const WebScraper = require("web-scraper-pro");
const scraper = new WebScraper({ outputDir: "./my-custom-folder" });2. Via setOutputDir() Method
const scraper = new WebScraper();
scraper.setOutputDir("./downloads/scraped-data");
// Method chaining is supported
scraper.setOutputDir("./downloads").getOutputDir(); // Returns current path3. Using Absolute Paths
const scraper = new WebScraper();
scraper.setOutputDir("C:/Users/Desktop/scraped-content");
// or on Linux/Mac
scraper.setOutputDir("/home/user/scraped-content");4. Check Current Output Directory
const currentPath = scraper.getOutputDir();
console.log(`Files will be saved to: ${currentPath}`);Note: The output directory will be created automatically if it doesn't exist.
📝 API Reference
Constructor
Creates a new WebScraper instance with optional configuration.
const WebScraper = require("web-scraper-pro");
// Default output directory (./output)
const scraper = new WebScraper();
// Custom output directory
const scraper = new WebScraper({ outputDir: "./my-output" });Parameters:
options(object, optional): Configuration optionsoutputDir(string): Custom output directory path
setOutputDir(path)
Sets a custom output directory for saved files.
scraper.setOutputDir("./custom-folder");
// Supports method chaining
scraper.setOutputDir("./downloads").getOutputDir();Parameters:
path(string): Absolute or relative path to output directory
Returns: WebScraper instance (for method chaining)
getOutputDir()
Gets the current output directory path.
const currentPath = scraper.getOutputDir();
console.log(currentPath); // e.g., "C:/Users/Name/project/output"Returns: String with current output directory path
scrapeAndSave(url, returnFields)
Scrapes a webpage and saves both HTML and extracted content to files.
const result = await scraper.scrapeAndSave("https://example.com", [
"url",
"title",
"content",
]);
console.log(result.data.title); // Extracted title
console.log(result.files.htmlFile); // Path to saved HTML file
console.log(result.files.txtFile); // Path to saved text fileParameters:
url(string): The URL to scrapereturnFields(array, optional): Data fields to include in output- Default:
['url','title','siteName','length','extractedAt','content']
- Default:
Returns: Object with success, data, files, and duration properties
scrapeContentOnly(url, fields)
Scrapes a webpage and returns formatted content as a string (no files created).
const { scrapeContentOnly } = require("web-scraper-pro");
const content = await scrapeContentOnly("https://example.com", [
"title",
"url",
"content",
]);
console.log(content);
// Output:
// TITLE: Page Title
// URL: https://example.com
//
// CONTENT:
// Main content text...Parameters:
url(string): The URL to scrapefields(array): Data fields to include in output string
Returns: Formatted string with extracted content
Available Data Fields
url- The webpage URLtitle- Page titlecontent- Main content (cleaned by Readability)siteName- Website namelength- Content length in charactersextractedAt- Extraction timestampexcerpt- Short content summary
⚙️ Configuration
TypeScript Support
Full TypeScript definitions are included:
import {
scrapeAndSave,
scrapeContentOnly,
ExtractedData,
ScrapeResult,
} from "web-scraper-pro";
const result: ScrapeResult = await scrapeAndSave("https://example.com", {
saveText: true,
saveHtml: true,
});
const content: string = await scrapeContentOnly("https://example.com", [
"title",
"content",
]);Custom Puppeteer Settings
The scraper uses optimized Puppeteer settings by default, but you can customize them by modifying the source:
// In src/scraper.js
browser = await puppeteer.launch({
headless: false, // Show browser for debugging
args: ["--no-sandbox"], // Additional Chrome flags
timeout: 60000, // Custom timeout
});Timeout Configuration
await page.goto(url, {
waitUntil: "networkidle2",
timeout: 60000, // Increase timeout to 60 seconds
});📄 Output Formats
File Output
Generated files use descriptive naming with timestamps:
extracted_example_com_2025-09-08T12-34-56-789Z.txt
scraped_example_com_2025-09-08T12-34-56-789Z.htmlText file format:
URL: https://example.com
TITLE: Page Title
SITE_NAME: Site Name
LENGTH: 1000 characters
EXTRACTED_AT: 2025-09-08T12:34:56.789Z
CONTENT:
Main content extracted by Mozilla Readability...Programmatic Output
{
url: "https://example.com",
title: "Page Title",
content: "Main content...",
siteName: "Site Name",
length: 1000,
extractedAt: "2025-09-08T12:34:56.789Z",
excerpt: "Brief summary..."
}🛠️ Troubleshooting
Puppeteer Installation Issues
Linux:
sudo apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo-gobject2 libdrm2 libgtk-3-0 libnspr4 libnss3 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6macOS (M1/M2):
arch -x86_64 npm install puppeteerWindows: Ensure Visual Studio Build Tools are installed for native dependencies.
Bot Detection
Some websites block automated scraping. Try these solutions:
// Add user agent and delays
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
await page.setViewport({ width: 1366, height: 768 });
await page.waitForTimeout(2000); // Add delayMemory Issues
For large-scale scraping, implement proper cleanup:
// Close browser instances properly
await browser.close();
// Monitor memory usage
process.on("exit", () =>
console.log("Process memory usage:", process.memoryUsage())
);✨ Features
- ✅ JavaScript-rendered content - Handles dynamic pages
- ✅ Clean content extraction - Removes ads, sidebars, navigation
- ✅ Automatic file naming - Timestamp-based file organization
- ✅ Comprehensive error handling - Robust retry logic
- ✅ TypeScript support - Full type definitions included
- ✅ Dual output modes - Files + data or string-only
- ✅ Component architecture - Modular, maintainable codebase
- ✅ Professional logging - Detailed console feedback
🧪 Tested Websites
- ✅ Wikipedia (Multiple languages)
- ✅ News websites (BBC, CNN, etc.)
- ✅ Gaming sites (Perfect World Games, IGN)
- ✅ Blog platforms (Medium, Dev.to)
- ✅ Documentation sites (MDN, official docs)
- ✅ E-commerce (Product pages)
Running Tests
# Test core functionality
npm test
# Test with specific examples
node test-new-functions.js
node test-gaming-news.js📊 Performance
- Speed: ~3-5 seconds per page (includes browser startup)
- Memory: ~50-100MB per browser instance
- Reliability: Built-in retry logic for network issues
- Output Size: Typical compression ratio 80-90% vs raw HTML
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit changes:
git commit -am 'Add feature' - Push to branch:
git push origin feature-name - Submit a pull request
📝 License
MIT License - see LICENSE file for details.
🔗 Links
- NPM Package: https://www.npmjs.com/package/web-scraper-pro
- GitHub Repository: https://github.com/nanpapu/web-scraper-pro
- Issues & Support: https://github.com/nanpapu/web-scraper-pro/issues
Made with ❤️ by Nanpapu
