npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

web-scraper-pro

v1.1.1

Published

Professional web scraper with Puppeteer & Mozilla Readability. Extract clean content from any website with full TypeScript support.

Readme

web-scraper-pro

npm version License: MIT Node.js

A professional web scraper powered by Puppeteer and Mozilla Readability. Extract clean, readable content from any website with full TypeScript support and comprehensive error handling.

🚀 Installation

npm install web-scraper-pro

📦 Quick Start

const WebScraper = require("web-scraper-pro");

// Method 1: Default output directory (./output)
const scraper = new WebScraper();
const result = await scraper.scrapeAndSave("https://example.com");

// Method 2: Custom output directory via constructor
const scraper2 = new WebScraper({ outputDir: "./my-downloads" });
const result2 = await scraper2.scrapeAndSave("https://example.com");

// Method 3: Set output directory after creation
const scraper3 = new WebScraper();
scraper3.setOutputDir("./custom-folder");
const result3 = await scraper3.scrapeAndSave("https://example.com");

// Method 4: Extract content only (no files saved)
const content = await scraper.scrapeContentOnly("https://example.com", [
  "title",
  "content",
]);
console.log(content);

📁 Project Structure

web-scraper-pro/
├── src/
│   ├── scraper.js          # Main scraper implementation
│   └── scraper.d.ts        # TypeScript definitions
├── output/                 # Generated output files
│   ├── scraped_*.html     # Raw HTML files
│   └── extracted_*.txt    # Extracted content files
├── test/                  # Test files
├── index.js              # Main entry point
├── package.json
└── README.md

🔧 Usage

Command Line Interface

Run with default URL (Wikipedia):

node src/scraper.js

Run with custom URL:

node src/scraper.js "https://your-target-url.com"

Setting Custom Output Directory

There are multiple ways to specify where output files should be saved:

1. Via Constructor

const WebScraper = require("web-scraper-pro");
const scraper = new WebScraper({ outputDir: "./my-custom-folder" });

2. Via setOutputDir() Method

const scraper = new WebScraper();
scraper.setOutputDir("./downloads/scraped-data");
// Method chaining is supported
scraper.setOutputDir("./downloads").getOutputDir(); // Returns current path

3. Using Absolute Paths

const scraper = new WebScraper();
scraper.setOutputDir("C:/Users/Desktop/scraped-content");
// or on Linux/Mac
scraper.setOutputDir("/home/user/scraped-content");

4. Check Current Output Directory

const currentPath = scraper.getOutputDir();
console.log(`Files will be saved to: ${currentPath}`);

Note: The output directory will be created automatically if it doesn't exist.

📝 API Reference

Constructor

Creates a new WebScraper instance with optional configuration.

const WebScraper = require("web-scraper-pro");

// Default output directory (./output)
const scraper = new WebScraper();

// Custom output directory
const scraper = new WebScraper({ outputDir: "./my-output" });

Parameters:

  • options (object, optional): Configuration options
    • outputDir (string): Custom output directory path

setOutputDir(path)

Sets a custom output directory for saved files.

scraper.setOutputDir("./custom-folder");
// Supports method chaining
scraper.setOutputDir("./downloads").getOutputDir();

Parameters:

  • path (string): Absolute or relative path to output directory

Returns: WebScraper instance (for method chaining)

getOutputDir()

Gets the current output directory path.

const currentPath = scraper.getOutputDir();
console.log(currentPath); // e.g., "C:/Users/Name/project/output"

Returns: String with current output directory path

scrapeAndSave(url, returnFields)

Scrapes a webpage and saves both HTML and extracted content to files.

const result = await scraper.scrapeAndSave("https://example.com", [
  "url",
  "title",
  "content",
]);

console.log(result.data.title); // Extracted title
console.log(result.files.htmlFile); // Path to saved HTML file
console.log(result.files.txtFile); // Path to saved text file

Parameters:

  • url (string): The URL to scrape
  • returnFields (array, optional): Data fields to include in output
    • Default: ['url','title','siteName','length','extractedAt','content']

Returns: Object with success, data, files, and duration properties

scrapeContentOnly(url, fields)

Scrapes a webpage and returns formatted content as a string (no files created).

const { scrapeContentOnly } = require("web-scraper-pro");

const content = await scrapeContentOnly("https://example.com", [
  "title",
  "url",
  "content",
]);

console.log(content);
// Output:
// TITLE: Page Title
// URL: https://example.com
//
// CONTENT:
// Main content text...

Parameters:

  • url (string): The URL to scrape
  • fields (array): Data fields to include in output string

Returns: Formatted string with extracted content

Available Data Fields

  • url - The webpage URL
  • title - Page title
  • content - Main content (cleaned by Readability)
  • siteName - Website name
  • length - Content length in characters
  • extractedAt - Extraction timestamp
  • excerpt - Short content summary

⚙️ Configuration

TypeScript Support

Full TypeScript definitions are included:

import {
  scrapeAndSave,
  scrapeContentOnly,
  ExtractedData,
  ScrapeResult,
} from "web-scraper-pro";

const result: ScrapeResult = await scrapeAndSave("https://example.com", {
  saveText: true,
  saveHtml: true,
});

const content: string = await scrapeContentOnly("https://example.com", [
  "title",
  "content",
]);

Custom Puppeteer Settings

The scraper uses optimized Puppeteer settings by default, but you can customize them by modifying the source:

// In src/scraper.js
browser = await puppeteer.launch({
  headless: false, // Show browser for debugging
  args: ["--no-sandbox"], // Additional Chrome flags
  timeout: 60000, // Custom timeout
});

Timeout Configuration

await page.goto(url, {
  waitUntil: "networkidle2",
  timeout: 60000, // Increase timeout to 60 seconds
});

📄 Output Formats

File Output

Generated files use descriptive naming with timestamps:

extracted_example_com_2025-09-08T12-34-56-789Z.txt
scraped_example_com_2025-09-08T12-34-56-789Z.html

Text file format:

URL: https://example.com
TITLE: Page Title
SITE_NAME: Site Name
LENGTH: 1000 characters
EXTRACTED_AT: 2025-09-08T12:34:56.789Z

CONTENT:
Main content extracted by Mozilla Readability...

Programmatic Output

{
  url: "https://example.com",
  title: "Page Title",
  content: "Main content...",
  siteName: "Site Name",
  length: 1000,
  extractedAt: "2025-09-08T12:34:56.789Z",
  excerpt: "Brief summary..."
}

🛠️ Troubleshooting

Puppeteer Installation Issues

Linux:

sudo apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo-gobject2 libdrm2 libgtk-3-0 libnspr4 libnss3 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6

macOS (M1/M2):

arch -x86_64 npm install puppeteer

Windows: Ensure Visual Studio Build Tools are installed for native dependencies.

Bot Detection

Some websites block automated scraping. Try these solutions:

// Add user agent and delays
await page.setUserAgent(
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
await page.setViewport({ width: 1366, height: 768 });
await page.waitForTimeout(2000); // Add delay

Memory Issues

For large-scale scraping, implement proper cleanup:

// Close browser instances properly
await browser.close();

// Monitor memory usage
process.on("exit", () =>
  console.log("Process memory usage:", process.memoryUsage())
);

✨ Features

  • JavaScript-rendered content - Handles dynamic pages
  • Clean content extraction - Removes ads, sidebars, navigation
  • Automatic file naming - Timestamp-based file organization
  • Comprehensive error handling - Robust retry logic
  • TypeScript support - Full type definitions included
  • Dual output modes - Files + data or string-only
  • Component architecture - Modular, maintainable codebase
  • Professional logging - Detailed console feedback

🧪 Tested Websites

  • Wikipedia (Multiple languages)
  • News websites (BBC, CNN, etc.)
  • Gaming sites (Perfect World Games, IGN)
  • Blog platforms (Medium, Dev.to)
  • Documentation sites (MDN, official docs)
  • E-commerce (Product pages)

Running Tests

# Test core functionality
npm test

# Test with specific examples
node test-new-functions.js
node test-gaming-news.js

📊 Performance

  • Speed: ~3-5 seconds per page (includes browser startup)
  • Memory: ~50-100MB per browser instance
  • Reliability: Built-in retry logic for network issues
  • Output Size: Typical compression ratio 80-90% vs raw HTML

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Commit changes: git commit -am 'Add feature'
  4. Push to branch: git push origin feature-name
  5. Submit a pull request

📝 License

MIT License - see LICENSE file for details.

🔗 Links

  • NPM Package: https://www.npmjs.com/package/web-scraper-pro
  • GitHub Repository: https://github.com/nanpapu/web-scraper-pro
  • Issues & Support: https://github.com/nanpapu/web-scraper-pro/issues

Made with ❤️ by Nanpapu