web-scraper-pro

v1.1.1

Published

9 months ago

Professional web scraper with Puppeteer & Mozilla Readability. Extract clean content from any website with full TypeScript support.

0High
0Medium
0Low

helmethater

web-scraper puppeteer readability content-extraction web-scraping html-parser mozilla-readability headless-browser automation content-parser javascript nodejs typescript

web-scraper-pro

A professional web scraper powered by Puppeteer and Mozilla Readability. Extract clean, readable content from any website with full TypeScript support and comprehensive error handling.

🚀 Installation

npm install web-scraper-pro

📦 Quick Start

const WebScraper = require("web-scraper-pro");

// Method 1: Default output directory (./output)
const scraper = new WebScraper();
const result = await scraper.scrapeAndSave("https://example.com");

// Method 2: Custom output directory via constructor
const scraper2 = new WebScraper({ outputDir: "./my-downloads" });
const result2 = await scraper2.scrapeAndSave("https://example.com");

// Method 3: Set output directory after creation
const scraper3 = new WebScraper();
scraper3.setOutputDir("./custom-folder");
const result3 = await scraper3.scrapeAndSave("https://example.com");

// Method 4: Extract content only (no files saved)
const content = await scraper.scrapeContentOnly("https://example.com", [
  "title",
  "content",
]);
console.log(content);

📁 Project Structure

web-scraper-pro/
├── src/
│   ├── scraper.js          # Main scraper implementation
│   └── scraper.d.ts        # TypeScript definitions
├── output/                 # Generated output files
│   ├── scraped_*.html     # Raw HTML files
│   └── extracted_*.txt    # Extracted content files
├── test/                  # Test files
├── index.js              # Main entry point
├── package.json
└── README.md

🔧 Usage

Command Line Interface

Run with default URL (Wikipedia):

node src/scraper.js

Run with custom URL:

node src/scraper.js "https://your-target-url.com"

Setting Custom Output Directory

There are multiple ways to specify where output files should be saved:

1. Via Constructor

const WebScraper = require("web-scraper-pro");
const scraper = new WebScraper({ outputDir: "./my-custom-folder" });

2. Via setOutputDir() Method

const scraper = new WebScraper();
scraper.setOutputDir("./downloads/scraped-data");
// Method chaining is supported
scraper.setOutputDir("./downloads").getOutputDir(); // Returns current path

3. Using Absolute Paths

const scraper = new WebScraper();
scraper.setOutputDir("C:/Users/Desktop/scraped-content");
// or on Linux/Mac
scraper.setOutputDir("/home/user/scraped-content");

4. Check Current Output Directory

const currentPath = scraper.getOutputDir();
console.log(`Files will be saved to: ${currentPath}`);

Note: The output directory will be created automatically if it doesn't exist.

📝 API Reference

Constructor

Creates a new WebScraper instance with optional configuration.

const WebScraper = require("web-scraper-pro");

// Default output directory (./output)
const scraper = new WebScraper();

// Custom output directory
const scraper = new WebScraper({ outputDir: "./my-output" });

Parameters:

options (object, optional): Configuration options
- outputDir (string): Custom output directory path

setOutputDir(path)

Sets a custom output directory for saved files.

scraper.setOutputDir("./custom-folder");
// Supports method chaining
scraper.setOutputDir("./downloads").getOutputDir();

Parameters:

path (string): Absolute or relative path to output directory

Returns: WebScraper instance (for method chaining)

getOutputDir()

Gets the current output directory path.

const currentPath = scraper.getOutputDir();
console.log(currentPath); // e.g., "C:/Users/Name/project/output"

Returns: String with current output directory path

scrapeAndSave(url, returnFields)

Scrapes a webpage and saves both HTML and extracted content to files.

const result = await scraper.scrapeAndSave("https://example.com", [
  "url",
  "title",
  "content",
]);

console.log(result.data.title); // Extracted title
console.log(result.files.htmlFile); // Path to saved HTML file
console.log(result.files.txtFile); // Path to saved text file

Parameters:

url (string): The URL to scrape
returnFields (array, optional): Data fields to include in output
- Default: ['url','title','siteName','length','extractedAt','content']

Returns: Object with success, data, files, and duration properties

scrapeContentOnly(url, fields)

Scrapes a webpage and returns formatted content as a string (no files created).

const { scrapeContentOnly } = require("web-scraper-pro");

const content = await scrapeContentOnly("https://example.com", [
  "title",
  "url",
  "content",
]);

console.log(content);
// Output:
// TITLE: Page Title
// URL: https://example.com
//
// CONTENT:
// Main content text...

Parameters:

url (string): The URL to scrape
fields (array): Data fields to include in output string

Returns: Formatted string with extracted content

Available Data Fields

url - The webpage URL
title - Page title
content - Main content (cleaned by Readability)
siteName - Website name
length - Content length in characters
extractedAt - Extraction timestamp
excerpt - Short content summary

⚙️ Configuration

TypeScript Support

Full TypeScript definitions are included:

import {
  scrapeAndSave,
  scrapeContentOnly,
  ExtractedData,
  ScrapeResult,
} from "web-scraper-pro";

const result: ScrapeResult = await scrapeAndSave("https://example.com", {
  saveText: true,
  saveHtml: true,
});

const content: string = await scrapeContentOnly("https://example.com", [
  "title",
  "content",
]);

Custom Puppeteer Settings

The scraper uses optimized Puppeteer settings by default, but you can customize them by modifying the source:

// In src/scraper.js
browser = await puppeteer.launch({
  headless: false, // Show browser for debugging
  args: ["--no-sandbox"], // Additional Chrome flags
  timeout: 60000, // Custom timeout
});

Timeout Configuration

await page.goto(url, {
  waitUntil: "networkidle2",
  timeout: 60000, // Increase timeout to 60 seconds
});

📄 Output Formats

File Output

Generated files use descriptive naming with timestamps:

extracted_example_com_2025-09-08T12-34-56-789Z.txt
scraped_example_com_2025-09-08T12-34-56-789Z.html

Text file format:

URL: https://example.com
TITLE: Page Title
SITE_NAME: Site Name
LENGTH: 1000 characters
EXTRACTED_AT: 2025-09-08T12:34:56.789Z

CONTENT:
Main content extracted by Mozilla Readability...

Programmatic Output

{
  url: "https://example.com",
  title: "Page Title",
  content: "Main content...",
  siteName: "Site Name",
  length: 1000,
  extractedAt: "2025-09-08T12:34:56.789Z",
  excerpt: "Brief summary..."
}

🛠️ Troubleshooting

Puppeteer Installation Issues

Linux:

sudo apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo-gobject2 libdrm2 libgtk-3-0 libnspr4 libnss3 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6

macOS (M1/M2):

arch -x86_64 npm install puppeteer

Windows: Ensure Visual Studio Build Tools are installed for native dependencies.

Bot Detection

Some websites block automated scraping. Try these solutions:

// Add user agent and delays
await page.setUserAgent(
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
await page.setViewport({ width: 1366, height: 768 });
await page.waitForTimeout(2000); // Add delay

Memory Issues

For large-scale scraping, implement proper cleanup:

// Close browser instances properly
await browser.close();

// Monitor memory usage
process.on("exit", () =>
  console.log("Process memory usage:", process.memoryUsage())
);

✨ Features

✅ JavaScript-rendered content - Handles dynamic pages
✅ Clean content extraction - Removes ads, sidebars, navigation
✅ Automatic file naming - Timestamp-based file organization
✅ Comprehensive error handling - Robust retry logic
✅ TypeScript support - Full type definitions included
✅ Dual output modes - Files + data or string-only
✅ Component architecture - Modular, maintainable codebase
✅ Professional logging - Detailed console feedback

🧪 Tested Websites

✅ Wikipedia (Multiple languages)
✅ News websites (BBC, CNN, etc.)
✅ Gaming sites (Perfect World Games, IGN)
✅ Blog platforms (Medium, Dev.to)
✅ Documentation sites (MDN, official docs)
✅ E-commerce (Product pages)

Running Tests

# Test core functionality
npm test

# Test with specific examples
node test-new-functions.js
node test-gaming-news.js

📊 Performance

Speed: ~3-5 seconds per page (includes browser startup)
Memory: ~50-100MB per browser instance
Reliability: Built-in retry logic for network issues
Output Size: Typical compression ratio 80-90% vs raw HTML

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Commit changes: git commit -am 'Add feature'
Push to branch: git push origin feature-name
Submit a pull request

📝 License

MIT License - see LICENSE file for details.

🔗 Links

NPM Package: https://www.npmjs.com/package/web-scraper-pro
GitHub Repository: https://github.com/nanpapu/web-scraper-pro
Issues & Support: https://github.com/nanpapu/web-scraper-pro/issues

Made with ❤️ by Nanpapu

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

web-scraper-pro

🚀 Installation

📦 Quick Start

📁 Project Structure

🔧 Usage

Command Line Interface

Setting Custom Output Directory

1. Via Constructor

2. Via setOutputDir() Method

3. Using Absolute Paths

4. Check Current Output Directory

📝 API Reference

Constructor

setOutputDir(path)

getOutputDir()

scrapeAndSave(url, returnFields)

scrapeContentOnly(url, fields)

Available Data Fields

⚙️ Configuration

TypeScript Support

Custom Puppeteer Settings

Timeout Configuration

📄 Output Formats

File Output

Programmatic Output

🛠️ Troubleshooting

Puppeteer Installation Issues

Bot Detection

Memory Issues

✨ Features

🧪 Tested Websites

Running Tests

📊 Performance

🤝 Contributing

📝 License

🔗 Links