npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

ph-typescript-lib-scraping

v1.0.9

Published

Shared library for "ph-portals" and "ph-configurable-web-scraper" projects.

Downloads

33

Readme

Introduction

Shared library for "ph-portals" and "ph-configurable-web-scraper" projects.

Installations

Setup

  1. Prerequisites:
  • must have installed tools from the "Installations" section
  • must have checked out / cloned source code from this repository
  1. Open up the cloned repository/project/folder and run following commands in the given order, one by one:

    npm install
    npm run build
  2. Both of the above commands should have completed successfully, and should have not caused any "package.json" or "package-lock.json" changes.

Explanation

type ScraperConfig = {
  // Choose what to use for req/res
  use: 'axios' | 'fetch';

  // Name of the scraper
  name: string;

  // Base URL of the website you want to scrape
  base: string;

  // Favicon URL of the website
  favicon: string;

  // Base/root links that other content links will be extracted from
  roots: string[];

  // Configuration object that is used for the content link extraction from the root links
  // "fetching" - method (GET, POST, PUT), headers and body to perform an HTTP request to retrieve root links contents
  // "type" - either "text", "html" or "json"
  // "selector" - either a HTML/JSON path selector string (for "html" or "json") or Regular Expression (for "text")
  links: LinkExtractor;

  // Array of HTML elements to remove from the content itself, before any actual scraping starts
  remove: string[];

  // Array of configuration objects used for actual content extraction from the previously extracted content links
  // "property" - name of property that the extracted value will be assigned to in final result object
  // "selector" - HTML path selector string
  // "transfomers" - array of "ContentTransformer" configuration object, that is able to transform extracted data by performing various functions upon it
  // -> available "transformers": trim, uppercase, lowercase, substring, slice, replace, padEnd, padStart
  // "remove" - additional array of HTML elements to remove from the content itself
  // "type" - should Cheerio take "text" or "html" value from the element
  // "take" - should Cheerio take "first", "last" or "normal" order element
  scrape: ContentExtractor[];

  // Configuration object that is used for either file saving of scraped content or sending an HTTP requests towards arbitrary 3rd party website endpoint
  // "type" - either "file" or "request", to save the content to FS or upload it to 3rd party web service
  // "destination" - if type=file, path to the file where the content will be saved
  // "url" - if type=request, URL of the website where the content will be uploaded
  // "method" - if type=request, HTTP method to be used, either POST or PUT
  // "headers" - if type=request, headers object, for HTTP request header addition and usage
  submit: ContentSubmitter;
};

Example

A simple, working JSON configuration object that can be used to scrape articles from Dnevno.hr news portal into a local JSON file.

{
  "use": "axios",
  "name": "Dnevno.hr News Portal",
  "base": "https://www.dnevno.hr",
  "favicon": "https://dnevno.hr/favicon.ico",
  "roots": [
    "https://www.dnevno.hr/category/vijesti",
    "https://www.dnevno.hr/category/sport",
    "https://www.dnevno.hr/category/magazin",
    "https://www.dnevno.hr/category/gospodarstvo-i-turizam",
    "https://www.dnevno.hr/category/planet-x",
    "https://www.dnevno.hr/category/zdravlje",
    "https://www.dnevno.hr/category/domovina",
    "https://www.dnevno.hr/category/vjera"
  ],
  "links": {
    "fetching": {
      "method": "GET"
    },
    "selector": "article.post a",
    "type": "html"
  },
  "remove": [
    "img",
    "iframe",
    "div.wpipa-container",
    "div.lwdgt-container",
    "p.lwdgt-logo",
    "center",
    "blockquote",
    "figure",
    "figcaption"
  ],
  "scrape": [
    { "property": "title", "selector": "h1" },
    { "property": "lead", "selector": "a.title" },
    {
      "property": "time",
      "selector": "time.date",
      "take": "first",
      "type": "text",
      "transfomers": [
        {
          "type": "split",
          "value": ",",
          "index": 1
        },
        {
          "type": "trim"
        }
      ]
    },
    {
      "property": "author",
      "selector": "span.author",
      "take": "first",
      "transformers": [
        {
          "type": "split",
          "value": "Autor:",
          "index": 1
        },
        {
          "type": "trim"
        }
      ]
    },
    {
      "remove": [
        "div.img-holder",
        "div.heading",
        "h1",
        "style",
        "div.info",
        "div.info-holder"
      ],
      "property": "content",
      "selector": "section.description",
      "type": "html"
    }
  ],
  "submit": {
    "type": "file",
    "destination": "./dnevno.json"
  }
}