npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, πŸ‘‹, I’m Ryan HefnerΒ  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you πŸ™

Β© 2026 – Pkg Stats / Ryan Hefner

clean-web-scraper

v4.4.1

Published

A powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! πŸ€–

Downloads

349

Readme

Web Content Scraper

A powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! πŸ€–

✨ Features

  • 🌐 Smart web crawling of internal links
  • πŸ”„ Smart retry mechanism with proxy fallback
  • πŸ“ Clean content extraction using Mozilla's Readability
  • 🧹 Smart content processing and cleaning
  • πŸ—‚οΈ Maintains original URL structure in saved files
  • 🚫 Excludes unwanted paths from scraping
  • 🚦 Configurable rate limiting and delays
  • πŸ€– AI-friendly output formats (JSONL, CSV, clean text)
  • πŸ“Š Rich metadata extraction
  • πŸ“ Combine results from multiple scrapers into a unified dataset

πŸ› οΈ Prerequisites

  • Node.js (v20 or higher)
  • npm

πŸ“¦ Dependencies

  • axios - HTTP requests master
  • jsdom - DOM parsing wizard
  • @mozilla/readability - Content extraction genius

πŸš€ Installation

npm i clean-web-scraper

# OR

git clone https://github.com/mlibre/Clean-Web-Scraper
cd Clean-Web-Scraper
sudo pacman -S extra/xorg-server-xvfb chromium
npm install --ignore-scripts

πŸ’» Usage

const WebScraper = require('clean-web-scraper');

const scraper = new WebScraper({
  baseURL: 'https://example.com/news',          // Required: The website base url to scrape
  startURL: 'https://example.com/blog',         // Optional: Custom starting URL
  excludeList: ['/admin', '/private'],          // Optional: Paths to exclude
  exactExcludeList: ['/specific-page',          // Optional: Exact URLs to exclude 
  /^https:\/\/host\.com\/\d{4}\/$/],            // Optional: Regex patterns to exclude. this will exclude urls likee https://host.com/2023/
  scrapResultPath: './example.com/website',     // Required: Where to save the content
  jsonlOutputPath: './example.com/train.jsonl', // Optional: Custom JSONL output path
  textOutputPath: "./example.com/texts",        // Optional: Custom text output path
  csvOutputPath: "./example.com/train.csv",     // Optional: Custom CSV output path
  strictBaseURL: true,                          // Optional: Only scrape URLs from same domain
  maxDepth: Infinity,                           // Optional: Maximum crawling depth
  maxArticles: Infinity,                        // Optional: Maximum articles to scrape
  crawlingDelay: 1000,                          // Optional: Delay between requests (ms)
  batchSize: 5,                                 // Optional: Number of URLs to process concurrently
  minContentLength: 400,                        // Optional: Minimum content length to consider valid

  // Network options
  axiosHeaders: {},                             // Optional: Custom HTTP headers
  axiosProxy: {                                 // Optional: HTTP/HTTPS proxy
   host: "localhost",
   port: 2080,
   protocol: "http"
  },              
  axiosMaxRetries: 5,                           // Optional: Max retry attempts
  axiosRetryDelay: 40000,                       // Optional: Delay between retries (ms)
  useProxyAsFallback: false,                    // Optional: Fallback to proxy on failure
  
  // Puppeteer options for handling dynamic content
  usePuppeteer: false,                          // Optional: Enable Puppeteer browser
});
await scraper.start();

πŸ’» Advanced Usage: Multi-Site Scraping

const WebScraper = require('clean-web-scraper');

// Scrape documentation website
const docsScraper = new WebScraper({
  baseURL: 'https://docs.example.com',
  scrapResultPath: './datasets/docs',
  maxDepth: 3,                               // Optional: Maximum depth for recursive crawling
  includeMetadata: true,                     // Optional: Include metadata in output files
  metadataFields: ["author", "articleTitle", "pageTitle", "description", "dataScrapedDate", "url"],
   // Optional: Specify metadata fields to include
});

// Scrape blog website
const blogScraper = new WebScraper({
  baseURL: 'https://blog.example.com',
  scrapResultPath: './datasets/blog',
  maxDepth: 3,                               // Optional: Maximum depth for recursive crawling
  includeMetadata: true,                     // Optional: Include metadata in output files
  metadataFields: ["author", "articleTitle", "pageTitle", "description", "dataScrapedDate"],
   // Optional: Specify metadata fields to include
});

// Start scraping both sites
await docsScraper.start();
await blogScraper.start();

// Combine all scraped content into a single dataset
await WebScraper.combineResults('./combined', [docsScraper, blogScraper]);
node example-usage.js

πŸ“€ Output

The content is saved in a clean, structured format:

  • πŸ“ Base folder: ./folderPath/example.com/
  • πŸ“‘ Files preserve original URL paths
  • πŸ€– No HTML, no noise - just clean, structured text (.txt files)
  • πŸ“Š JSONL and CSV outputs, ready for AI consumption, model training and fine-tuning
example.com/
β”œβ”€β”€ website/
β”‚   β”œβ”€β”€ page1.txt             # Clean text content
β”‚   β”œβ”€β”€ page1.json            # Full metadata
β”‚   β”œβ”€β”€ page1.html            # Original HTML content
β”‚   └── blog/
β”‚       β”œβ”€β”€ post1.txt
β”‚       └── post1.json
β”‚       └── post1.html
β”œβ”€β”€ texts/                    # Numbered text files
β”‚   β”œβ”€β”€ 1.txt
β”‚   └── 2.txt
β”œβ”€β”€ texts_with_metadata/      # When includeMetadata is true
β”‚   β”œβ”€β”€ 1.txt
β”‚   └── 2.txt
β”œβ”€β”€ train.jsonl               # Combined content
β”œβ”€β”€ train_with_metadata.jsonl # When includeMetadata is true
β”œβ”€β”€ train.csv                 # Clean text in CSV format
└── train_with_metadata.csv   # When includeMetadata is true

combined/
β”œβ”€β”€ texts/                    # Combined numbered text files
β”‚   β”œβ”€β”€ 1.txt
β”‚   β”œβ”€β”€ 2.txt
β”‚   └── n.txt
β”œβ”€β”€ texts_with_metadata/      # Combined metadata text files
β”‚   β”œβ”€β”€ 1.txt
β”‚   β”œβ”€β”€ 2.txt
β”‚   └── n.txt
β”œβ”€β”€ combined.jsonl            # Combined JSONL content
β”œβ”€β”€ combined_with_metadata.jsonl
β”œβ”€β”€ combined.csv              # Combined CSV content
└── combined_with_metadata.csv

πŸ“„ Output File Formats

πŸ“ Text Files (*.txt)

The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage

πŸ“‘ Text Files with Metadata (texts_with_metadata/*.txt)

articleTitle: Palestine history
description: This is a great article about Palestine history
author: Rawan
language: en
dateScraped: 2024-01-20T10:30:00Z
url: https://palianswers.com

---

The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage.

πŸ“Š JSONL Files (train.jsonl)

{"text": "Clean article content here"}
{"text": "Another article content here"}

πŸ“ˆ JSONL with Metadata (train_with_metadata.jsonl)

{"text": "Article content", "metadata": {"articleTitle": "Page Title", "author": "John Doe"}}
{"text": "Another article", "metadata": {"articleTitle": "Second Page", "author": "Jane Smith"}}

πŸ—ƒοΈ JSON Files In Website Directory (*.json)

{
  "url": "https://example.com/page",
  "pageTitle": "Page Title",
  "description": "Page description",
  "language": "en",
  "canonicalUrl": "https://example.com/canonical",
  "ogTitle": "Open Graph Title",
  "ogDescription": "Open Graph Description",
  "ogImage": "https://example.com/image.jpg",
  "ogType": "article",
  "dataScrapedDate": "2024-01-20T10:30:00Z",
  "originalHtml": "<html>...</html>",
  "articleTitle": "Article Title",
}

πŸ“‹ CSV Files (train.csv)

text
"Clean article content here"
"Another article content here"

πŸ“Š CSV with Metadata (train_with_metadata.csv)

text,articleTitle,author,description
"Article content","Page Title","John Doe","Page description"
"Another article","Second Page","Jane Smith","Another description"

Standing with Palestine πŸ‡΅πŸ‡Έ

This project supports Palestinian rights and stands in solidarity with Palestine. We believe in the importance of documenting and preserving Palestinian narratives, history, and struggles for justice and liberation.

Free Palestine πŸ‡΅πŸ‡Έ