get-set-fetch

v0.3.8

Published

2 years ago

web crawler, parser and scraper with storage capabilities

Downloads

0High
0Medium
0Low

a1sabau

web crawler spider parser scraper

Node.js web crawler and scrapper supporting various storage options under an extendable plugin system.

Getting Started

Prerequisites

get-set-fetch handles all async operations via javascript es6 async / await syntax. It requires at least node 7.10.1.

Installation

Install get-set-fetch module
npm install get-set-fetch --save

Install knex, sqlite3 in order to use the default sqlite:memory storage.
npm install knex sqlite3 --save

First Crawl

// import the get-set-fetch dependency
const GetSetFetch = require('get-set-fetch');

/*
the entire code is async,
declare an async function in order to make use of await
*/
async function firstCrawl() {
  // init db connection, by default in memory sqlite
  const { Site } = await GetSetFetch.init();

  /*
  load site if already present,
  otherwise create it by specifying a name and the first url to crawl,
  only links from this location down will be subject to further crawling
  */
  let site = await Site.get('simpleSite');
  if (!site) {
    site = new Site(
      'simpleSite',
      'https://simpleSite/',
    );

    await site.save();
  }

  // keep crawling the site until there are no more resources to crawl
  await site.crawl();
}

// start crawling
firstCrawl();

The above example uses a set of default plugins capable of crawling html content. Once firstCrawl completes, all site html resources have been crawled assuming they are discoverable from the initial url.

Storage

SQL Connections

SQLite

Default storage option if none provided consuming the least amount of resources.

Requires knex and sqlite driver.
npm install knex sqlite3 --save

Init storage:

const { Site } = await GetSetFetch.init({
  "client": "sqlite3",
  "useNullAsDefault": true,
  "connection": {
      "filename": ":memory:"
  }
});

MySQL

Requires knex and mysql driver.
npm install knex mysql --save

Init storage:

const { Site } = await GetSetFetch.init({
  "client": "mysql",
  "useNullAsDefault": true,
  "connection": {
    "host" : "localhost",
    "port": 33060,
    "user" : "get-set-fetch-user",
    "password" : "get-set-fetch-pswd",
    "database" : "get-set-fetch-db"
  }
});

PostgreSQL

Requires knex and postgresql driver.
npm install knex pg --save

Init storage:

const { Site } = await GetSetFetch.init({
  "client": "pg",
  "useNullAsDefault": true,
  "connection": {
    "host" : "localhost",
    "port": 54320,
    "user" : "get-set-fetch-user",
    "password" : "get-set-fetch-pswd",
    "database" : "get-set-fetch-db"
  }
});

NoSQL Connections

MongoDB

Requires mongodb driver.
npm install mongodb --save

Init storage:

const { Site } = await GetSetFetch.init({
  "url": "mongodb://localhost:27027",
  "dbName": "get-set-fetch-test"
});

Site

Site CRUD API

new Site(name, url, opts, createDefaultPlugins)

name <string> site name
url <string> site url
opts <Object> site options
- resourceFilter <Object> bloom filter settings for filtering duplicate urls
  - maxEntries <number> maximum number of expected unique urls. Defaults to 5000.
  - probability <number> probability an url is erroneously marked as duplicate. Defaults to 0.01.
createDefaultPlugins <boolean> indicate if the default plugin set should be added to the site. Defaults to true.

Site.get(nameOrId)

nameOrId <string> site name or id
returns <Promise<Site>>

site.save()

returns <Promise<<number>> the newly created site id
When a new site is created, its url is also saved as the first site resource at depth 0.

site.update()

returns <Promise>

site.del()

returns <Promise>

Site.delAll()

returns <Promise>

Site Plugin API

site.getPlugins()

returns <Array<BasePlugin>> the plugins used for crawling.

site.setPlugins(plugins)

plugins <Array<BasePlugin>> the plugins to be used for crawling.
The existing ones are removed.

site.addPlugins(plugins)

plugins <Array<BasePlugin>> additional plugins to be used for crawling.
The existing plugins are kept unless an additional plugin is of the same type as an existing plugin. In this case the additional plugin overwrites the existing one.

site.removePlugins(pluginNames)

pluginNames <Array<String>> constructor names of the plugins to be removed
Remove the matching plugins from the existing ones.

site.cleanupPlugins()

Some plugins (like ChromeFetchPlugin) open external processes. Each plugin is responsible for its own cleanup via plugin.cleanup().

Site Crawl API

site.getResourceToCrawl()

returns <Promise<Resource>>

site.saveResources(urls, depth)

urls <Array<String>> urls of the resources to be added
depth <Number> depth of the resources to be added
returns <Promise>
The urls are filtered against the site bloom filter in order to remove duplicates.

site.getResourceCount()

returns <Promise<<number>> total number of site resources

site.fetchRobots(reqHeaders)

reqHeaders <Object>
Retrieves the site robots.txt content via and updates the site robotsTxt property.
returns <Promise>

site.crawlResource()

Loops through the ordered (based on phase) plugins and apply each one to the current site-resource pair. The first plugin in the SELECT phase is responsible for retrieving the resource to be crawled.

site.crawl(opts)

opts <Object> crawl options
- maxConnections <number> maximum number of resources crawled in parallel. Defaults to 1.
- maxResources <number> If set, crawling will stop once the indicated number of resources has been crawled.
- maxDepth <number> If set, crawling will stop once there are no more resources with a lower than indicated depth.
- delay <number> delay in miliseconds between consecutive crawls. Defaults to 100. Each time a resource has finished crawling attempt to restore maximum number of parallel connections in case new resources have been found and saved. Crawling stops and the returned promise is resolved once there are no more resources to crawl meeting the above criteria.
returns <Promise>

site.stop()

No further resource crawls are initiated. The one in progress are completed.
returns <Promise>

Resource

Resource CRUD API

new Resource(siteId, url, depth)

siteId <string> id of the site the resource belongs to
url <string> resource url
depth <number> resource depth. First site resource has depth 0.

Resource.get(urlOrId)

urlOrId <string> resource url or id
returns <Promise<Resource>>

resource.save()

returns <Promise<number>> the newly created resource id

resource.update()

returns <Promise>

resource.del()

returns <Promise>

Resource.delAll()

returns <Promise>

Resource Crawl API

Resource.getResourceToCrawl(siteId)

siteId <string> resource will belong to the specified site id returns <Promise<Resource>>

PluginManager

PluginManager.DEFAULT_PLUGINS

returns <Array<BasePlugin>> default plugins

pluginManager.register(plugins)

plugins <Array<BasePlugin>|BasePlugin> registered plugins can later be instantiated from JSON strings retrieved from storage.

pluginManager.instantiate(jsonPlugins)

plugins <Array<[string]>|[string]> instantiate plugin(s) from their corresponding JSON strings
returns <Array<BasePlugin>|BasePlugin> plugin instance(s)

Plugins

Default Plugins

SelectResourcePlugin

Selects a resource to crawl from the current site.

NodeFetchPlugin

Downloads a site resource using node HTTP and HTTPS libraries.

JsDomPlugin

Generates a jsdom document for the current resource.

ExtractUrlPlugin

Responsible for extracting new resources from a resource document.

RobotsFilterPlugin

Filters newly found resources based on robots.txt rules.

UpdateResourcePlugin

Updates a resource after crawling it.

InsertResourcePlugin

Saves newly found resource within the current site.

Optional Plugins

PersistResourcePlugin

Writes a resources to disk.

ChromeFetchPlugin

Alternative to <NodeFetchPlugin>, instead of just downloading a site resource it also executes the javascript code (if present) returning the dynamically generated html content. Uses Puppeteer to control headless Chrome which needs to be installed separately:
npm install puppeteer --save

Logger

Logger.setLogLevel(logLevel)

logLevel <trace|debug|info|warn|error> set desired log level, levels below it will be ignored. Defaults to warn.

Logger.setLogPaths(outputPath, errorPath)

outputPath <string> path for the output log
errorPath <string> path for the error log

Logger.getLogger(cls)

cls <string> class / category to be appended on each log message. A log entry has the following elements: [LogLevel] [Class] [Date] [Message]. All console methods can be invoked: trace, debug, info, warn, error.
returns <Logger>

Additional Documentation

Read the full documentation for more details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Table of Contents

Getting Started

Prerequisites

Installation

First Crawl

Storage

SQL Connections

SQLite

MySQL

PostgreSQL

NoSQL Connections

MongoDB

Site

Site CRUD API

new Site(name, url, opts, createDefaultPlugins)

Site.get(nameOrId)

site.save()

site.update()

site.del()

Site.delAll()

Site Plugin API

site.getPlugins()

site.setPlugins(plugins)

site.addPlugins(plugins)

site.removePlugins(pluginNames)

site.cleanupPlugins()

Site Crawl API

site.getResourceToCrawl()

site.saveResources(urls, depth)

site.getResourceCount()

site.fetchRobots(reqHeaders)

site.crawlResource()

site.crawl(opts)

site.stop()

Resource

Resource CRUD API

new Resource(siteId, url, depth)

Resource.get(urlOrId)

resource.save()

resource.update()

resource.del()

Resource.delAll()

Resource Crawl API

Resource.getResourceToCrawl(siteId)

PluginManager

PluginManager.DEFAULT_PLUGINS

pluginManager.register(plugins)

pluginManager.instantiate(jsonPlugins)

Plugins

Default Plugins

SelectResourcePlugin

NodeFetchPlugin

JsDomPlugin

ExtractUrlPlugin

RobotsFilterPlugin

UpdateResourcePlugin

InsertResourcePlugin

Optional Plugins

PersistResourcePlugin

ChromeFetchPlugin

Logger

Logger.setLogLevel(logLevel)

Logger.setLogPaths(outputPath, errorPath)

Logger.getLogger(cls)

Additional Documentation