npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@the-node-forge/simple-web-scraper

v1.1.2

Published

Extracts structured data from web pages for automation, research, or aggregation.

Readme

Simple Web Scraper

License: MIT Made with TypeScript

NPM Version Build Status Platform

Live Documentation

A lightweight and efficient web scraping package for JavaScript/TypeScript applications. This package helps developers fetch HTML content, parse web pages, and extract data effortlessly.


✨ Features

  • Fetch Web Content – Retrieve HTML from any URL with ease.
  • Parse and Extract Data – Utilize integrated parsing tools to extract information.
  • Configurable Options – Customize scraping behaviors using CSS selectors.
  • Headless Browser Support – Optionally use Puppeteer for JavaScript-rendered pages.
  • Lightweight & Fast – Uses Cheerio for static HTML scraping.
  • TypeScript Support – Fully typed for robust development.
  • Data Export Support – Export scraped data in JSON or CSV formats.
  • CSV Import Support – Read CSV files and convert them to JSON.

📚 Installation

Install via npm:

npm install simple-web-scraper

or using Yarn:

yarn add simple-web-scraper

🚀 Why Use Cheerio and Puppeteer?

This package leverages Cheerio and Puppeteer for powerful web scraping capabilities:

🔹 Cheerio (Fast and Lightweight)

  • Ideal for static HTML parsing (like jQuery for the backend).
  • Extremely fast and lightweight – perfect for pages without JavaScript rendering.
  • Provides easy CSS selector querying for extracting structured data.

🔹 Puppeteer (Headless Browser Automation)

  • Handles JavaScript-rendered pages – essential for scraping dynamic content.
  • Can interact with pages, click buttons, and fill out forms.
  • Allows screenshot capturing, PDF generation, and full-page automation.

Best of Both Worlds

  • Use Cheerio for speed when scraping static pages.
  • Switch to Puppeteer for JavaScript-heavy sites requiring full rendering.
  • Provides flexibility to choose the best approach for your project.

API Reference

WebScraper Class

new WebScraper(options?: ScraperOptions)

📊 Props

| Parameter | Type | Description | | -------------- | ------------------------ | --------------------------------------------------------------- | | usePuppeteer | boolean (optional) | Whether to use Puppeteer (default: true) | | throttle | number (optional) | Delay in milliseconds between requests (default: 1000) | | rules | Record<string, string> | CSS selectors defining data extraction rules |


Methods

scrape(url: string): Promise<Record<string, any>>

  • Scrapes the given URL based on the configured options.

exportToJSON(data: any, filePath: string): void

  • Exports the given data to a JSON file.

exportToCSV(data: any | any[], filePath: string): void

  • Exports the given data to a CSV file.

🛠️ Basic Usage

1. Scraping Web Pages

You can scrape web pages using either Puppeteer (for JavaScript-heavy pages) or Cheerio (for static HTML pages).

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: false, // Set to true for dynamic pages
  rules: {
    title: 'h1',
    description: 'meta[name=\"description\"]',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

2. Using Puppeteer for JavaScript-heavy Pages

To scrape pages that require JavaScript execution:

const scraper = new WebScraper({
  usePuppeteer: true, // Enable Puppeteer for JavaScript-rendered content
  rules: {
    heading: 'h1',
    price: '.product-price',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com/product');
  console.log(data);
})();

3. Exporting Data

  • Scraped data can be exported to JSON or CSV files using utility functions.

Export to JSON

import { exportToJSON } from 'simple-web-scraper';

const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');

Export to CSV

import { exportToCSV } from 'simple-web-scraper';

const data = [
  { name: 'Example 1', value: 42 },
  { name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');

// Preserve null and undefined values as null
exportToCSV(data, 'output.csv', { preserveNulls: true });

🖥 Backend Example - Module (import)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';

const app = express();
const scraper = new WebScraper({
  usePuppeteer: true,
  rules: { title: 'h1', content: 'p' },
});

app.get('/scrape-example', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv', { preserveNulls: true }); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

🖥 Backend Example - CommonJS (require)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

const {
  WebScraper,
  exportToJSON,
  exportToCSV,
} = require('@the-node-forge/simple-web-scraper/dist');

const scraper = new WebScraper({
  usePuppeteer: true,
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

app.get('/test-scraper', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv'); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

🛠️ Full Usage Example

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: true, // Set to false if scraping static pages
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

📊 Rule Set Table

| Rule | CSS Selector | Target Data | | -------------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | | fullHTML | html | The entire HTML of the page | | title | head > title | The <title> of the page | | description | meta[name="description"] | Meta description for SEO | | keywords | meta[name="keywords"] | Meta keywords | | favicon | link[rel="icon"] | Website icon | | mainHeading | h1 | The first <h1> heading | | allHeadings | h1, h2, h3, h4, h5, h6 | All headings (h1-h6) | | firstParagraph | p | The first paragraph (<p>) | | allParagraphs | p | All paragraphs on the page | | links | a | All anchor <a> links | | images | img | All image <img> sources | | imageAlts | img | All image alt texts | | videos | video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"] | Video sources (<video>, YouTube, Vimeo) | | tables | table | All <table> elements | | tableData | td | Individual <td> elements | | lists | ul, ol | All ordered <ol> and unordered <ul> lists | | listItems | li | All list <li> items | | scripts | script | JavaScript files included (<script src="...">) | | stylesheets | link[rel="stylesheet"] | Stylesheets (<link rel="stylesheet">) | | structuredData | script[type="application/ld+json"] | JSON-LD structured data for SEO | | socialLinks | a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"] | Facebook, Twitter, LinkedIn, Instagram links | | author | meta[name="author"] | Page author (meta[name="author"]) | | publishDate | meta[property="article:published_time"], time | Date article was published | | modifiedDate | meta[property="article:modified_time"] | Last modified date | | canonicalURL | link[rel="canonical"] | Canonical URL (avoids duplicate content) | | openGraphTitle | meta[property="og:title"] | OpenGraph metadata for social sharing | | openGraphDescription | meta[property="og:description"] | OpenGraph description | | openGraphImage | meta[property="og:image"] | OpenGraph image URL | | twitterCard | meta[name="twitter:card"] | Twitter card type (summary, summary_large_image) | | twitterTitle | meta[name="twitter:title"] | Twitter title metadata | | twitterDescription | meta[name="twitter:description"] | Twitter description metadata | | twitterImage | meta[name="twitter:image"] | Twitter image metadata |


💡 Contributing

Contributions are welcome! Please submit issues or pull requests.