npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

get-set-fetch

v0.3.8

Published

web crawler, parser and scraper with storage capabilities

Downloads

9

Readme

node FOSSA Status dependencies Status Known Vulnerabilities Build Status Coverage Status

Node.js web crawler and scrapper supporting various storage options under an extendable plugin system.

Table of Contents

Getting Started

Prerequisites

get-set-fetch handles all async operations via javascript es6 async / await syntax. It requires at least node 7.10.1.

Installation

Install get-set-fetch module
npm install get-set-fetch --save

Install knex, sqlite3 in order to use the default sqlite:memory storage.
npm install knex sqlite3 --save

First Crawl

// import the get-set-fetch dependency
const GetSetFetch = require('get-set-fetch');

/*
the entire code is async,
declare an async function in order to make use of await
*/
async function firstCrawl() {
  // init db connection, by default in memory sqlite
  const { Site } = await GetSetFetch.init();

  /*
  load site if already present,
  otherwise create it by specifying a name and the first url to crawl,
  only links from this location down will be subject to further crawling
  */
  let site = await Site.get('simpleSite');
  if (!site) {
    site = new Site(
      'simpleSite',
      'https://simpleSite/',
    );

    await site.save();
  }

  // keep crawling the site until there are no more resources to crawl
  await site.crawl();
}

// start crawling
firstCrawl();

The above example uses a set of default plugins capable of crawling html content. Once firstCrawl completes, all site html resources have been crawled assuming they are discoverable from the initial url.

Storage

SQL Connections

SQLite

Default storage option if none provided consuming the least amount of resources.

Requires knex and sqlite driver.
npm install knex sqlite3 --save

Init storage:

const { Site } = await GetSetFetch.init({
  "client": "sqlite3",
  "useNullAsDefault": true,
  "connection": {
      "filename": ":memory:"
  }
});

MySQL

Requires knex and mysql driver.
npm install knex mysql --save

Init storage:

const { Site } = await GetSetFetch.init({
  "client": "mysql",
  "useNullAsDefault": true,
  "connection": {
    "host" : "localhost",
    "port": 33060,
    "user" : "get-set-fetch-user",
    "password" : "get-set-fetch-pswd",
    "database" : "get-set-fetch-db"
  }
});

PostgreSQL

Requires knex and postgresql driver.
npm install knex pg --save

Init storage:

const { Site } = await GetSetFetch.init({
  "client": "pg",
  "useNullAsDefault": true,
  "connection": {
    "host" : "localhost",
    "port": 54320,
    "user" : "get-set-fetch-user",
    "password" : "get-set-fetch-pswd",
    "database" : "get-set-fetch-db"
  }
});

NoSQL Connections

MongoDB

Requires mongodb driver.
npm install mongodb --save

Init storage:

const { Site } = await GetSetFetch.init({
  "url": "mongodb://localhost:27027",
  "dbName": "get-set-fetch-test"
});

Site

Site CRUD API

new Site(name, url, opts, createDefaultPlugins)

  • name <string> site name
  • url <string> site url
  • opts <Object> site options
    • resourceFilter <Object> bloom filter settings for filtering duplicate urls
      • maxEntries <number> maximum number of expected unique urls. Defaults to 5000.
      • probability <number> probability an url is erroneously marked as duplicate. Defaults to 0.01.
  • createDefaultPlugins <boolean> indicate if the default plugin set should be added to the site. Defaults to true.

Site.get(nameOrId)

  • nameOrId <string> site name or id
  • returns <Promise<Site>>

site.save()

  • returns <Promise<<number>> the newly created site id
    When a new site is created, its url is also saved as the first site resource at depth 0.

site.update()

  • returns <Promise>

site.del()

  • returns <Promise>

Site.delAll()

  • returns <Promise>

Site Plugin API

site.getPlugins()

  • returns <Array<BasePlugin>> the plugins used for crawling.

site.setPlugins(plugins)

  • plugins <Array<BasePlugin>> the plugins to be used for crawling.
    The existing ones are removed.

site.addPlugins(plugins)

  • plugins <Array<BasePlugin>> additional plugins to be used for crawling.
    The existing plugins are kept unless an additional plugin is of the same type as an existing plugin. In this case the additional plugin overwrites the existing one.

site.removePlugins(pluginNames)

  • pluginNames <Array<String>> constructor names of the plugins to be removed
    Remove the matching plugins from the existing ones.

site.cleanupPlugins()

  • Some plugins (like ChromeFetchPlugin) open external processes. Each plugin is responsible for its own cleanup via plugin.cleanup().

Site Crawl API

site.getResourceToCrawl()

site.saveResources(urls, depth)

  • urls <Array<String>> urls of the resources to be added
  • depth <Number> depth of the resources to be added
  • returns <Promise>
    The urls are filtered against the site bloom filter in order to remove duplicates.

site.getResourceCount()

  • returns <Promise<<number>> total number of site resources

site.fetchRobots(reqHeaders)

  • reqHeaders <Object>
    Retrieves the site robots.txt content via and updates the site robotsTxt property.
  • returns <Promise>

site.crawlResource()

  • Loops through the ordered (based on phase) plugins and apply each one to the current site-resource pair. The first plugin in the SELECT phase is responsible for retrieving the resource to be crawled.

site.crawl(opts)

  • opts <Object> crawl options
    • maxConnections <number> maximum number of resources crawled in parallel. Defaults to 1.
    • maxResources <number> If set, crawling will stop once the indicated number of resources has been crawled.
    • maxDepth <number> If set, crawling will stop once there are no more resources with a lower than indicated depth.
    • delay <number> delay in miliseconds between consecutive crawls. Defaults to 100. Each time a resource has finished crawling attempt to restore maximum number of parallel connections in case new resources have been found and saved. Crawling stops and the returned promise is resolved once there are no more resources to crawl meeting the above criteria.
  • returns <Promise>

site.stop()

  • No further resource crawls are initiated. The one in progress are completed.
  • returns <Promise>

Resource

Resource CRUD API

new Resource(siteId, url, depth)

  • siteId <string> id of the site the resource belongs to
  • url <string> resource url
  • depth <number> resource depth. First site resource has depth 0.

Resource.get(urlOrId)

  • urlOrId <string> resource url or id
  • returns <Promise<Resource>>

resource.save()

  • returns <Promise<number>> the newly created resource id

resource.update()

  • returns <Promise>

resource.del()

  • returns <Promise>

Resource.delAll()

  • returns <Promise>

Resource Crawl API

Resource.getResourceToCrawl(siteId)

  • siteId <string> resource will belong to the specified site id returns <Promise<Resource>>

PluginManager

PluginManager.DEFAULT_PLUGINS

pluginManager.register(plugins)

  • plugins <Array<BasePlugin>|BasePlugin> registered plugins can later be instantiated from JSON strings retrieved from storage.

pluginManager.instantiate(jsonPlugins)

  • plugins <Array<[string]>|[string]> instantiate plugin(s) from their corresponding JSON strings
  • returns <Array<BasePlugin>|BasePlugin> plugin instance(s)

Plugins

Default Plugins

SelectResourcePlugin

  • Selects a resource to crawl from the current site.

NodeFetchPlugin

  • Downloads a site resource using node HTTP and HTTPS libraries.

JsDomPlugin

  • Generates a jsdom document for the current resource.

ExtractUrlPlugin

  • Responsible for extracting new resources from a resource document.

RobotsFilterPlugin

  • Filters newly found resources based on robots.txt rules.

UpdateResourcePlugin

  • Updates a resource after crawling it.

InsertResourcePlugin

  • Saves newly found resource within the current site.

Optional Plugins

PersistResourcePlugin

  • Writes a resources to disk.

ChromeFetchPlugin

  • Alternative to <NodeFetchPlugin>, instead of just downloading a site resource it also executes the javascript code (if present) returning the dynamically generated html content. Uses Puppeteer to control headless Chrome which needs to be installed separately:
    npm install puppeteer --save

Logger

Logger.setLogLevel(logLevel)

  • logLevel <trace|debug|info|warn|error> set desired log level, levels below it will be ignored. Defaults to warn.

Logger.setLogPaths(outputPath, errorPath)

  • outputPath <string> path for the output log
  • errorPath <string> path for the error log

Logger.getLogger(cls)

  • cls <string> class / category to be appended on each log message. A log entry has the following elements: [LogLevel] [Class] [Date] [Message]. All console methods can be invoked: trace, debug, info, warn, error.
  • returns <Logger>

Additional Documentation

Read the full documentation for more details.