get-set-fetch
v0.3.8
Published
web crawler, parser and scraper with storage capabilities
Downloads
9
Maintainers
Readme
Node.js web crawler and scrapper supporting various storage options under an extendable plugin system.
Table of Contents
- Getting Started
- Storage
- Site
- Resource
- PluginManager
- Plugins
- Logger
- Additional Documentation
Getting Started
Prerequisites
get-set-fetch handles all async operations via javascript es6 async / await syntax. It requires at least node 7.10.1.
Installation
Install get-set-fetch modulenpm install get-set-fetch --save
Install knex, sqlite3 in order to use the default sqlite:memory storage.npm install knex sqlite3 --save
First Crawl
// import the get-set-fetch dependency
const GetSetFetch = require('get-set-fetch');
/*
the entire code is async,
declare an async function in order to make use of await
*/
async function firstCrawl() {
// init db connection, by default in memory sqlite
const { Site } = await GetSetFetch.init();
/*
load site if already present,
otherwise create it by specifying a name and the first url to crawl,
only links from this location down will be subject to further crawling
*/
let site = await Site.get('simpleSite');
if (!site) {
site = new Site(
'simpleSite',
'https://simpleSite/',
);
await site.save();
}
// keep crawling the site until there are no more resources to crawl
await site.crawl();
}
// start crawling
firstCrawl();
The above example uses a set of default plugins capable of crawling html content. Once firstCrawl completes, all site html resources have been crawled assuming they are discoverable from the initial url.
Storage
SQL Connections
SQLite
Default storage option if none provided consuming the least amount of resources.
Requires knex and sqlite driver.npm install knex sqlite3 --save
Init storage:
const { Site } = await GetSetFetch.init({
"client": "sqlite3",
"useNullAsDefault": true,
"connection": {
"filename": ":memory:"
}
});
MySQL
Requires knex and mysql driver.npm install knex mysql --save
Init storage:
const { Site } = await GetSetFetch.init({
"client": "mysql",
"useNullAsDefault": true,
"connection": {
"host" : "localhost",
"port": 33060,
"user" : "get-set-fetch-user",
"password" : "get-set-fetch-pswd",
"database" : "get-set-fetch-db"
}
});
PostgreSQL
Requires knex and postgresql driver.npm install knex pg --save
Init storage:
const { Site } = await GetSetFetch.init({
"client": "pg",
"useNullAsDefault": true,
"connection": {
"host" : "localhost",
"port": 54320,
"user" : "get-set-fetch-user",
"password" : "get-set-fetch-pswd",
"database" : "get-set-fetch-db"
}
});
NoSQL Connections
MongoDB
Requires mongodb driver.npm install mongodb --save
Init storage:
const { Site } = await GetSetFetch.init({
"url": "mongodb://localhost:27027",
"dbName": "get-set-fetch-test"
});
Site
Site CRUD API
new Site(name, url, opts, createDefaultPlugins)
name
<string> site nameurl
<string> site urlopts
<Object> site optionsresourceFilter
<Object> bloom filter settings for filtering duplicate urlsmaxEntries
<number> maximum number of expected unique urls. Defaults to5000
.probability
<number> probability an url is erroneously marked as duplicate. Defaults to0.01
.
createDefaultPlugins
<boolean> indicate if the default plugin set should be added to the site. Defaults totrue
.
Site.get(nameOrId)
nameOrId
<string> site name or id- returns <Promise<Site>>
site.save()
- returns <Promise<<number>> the newly created site id
When a new site is created, its url is also saved as the first site resource at depth 0.
site.update()
- returns <Promise>
site.del()
- returns <Promise>
Site.delAll()
- returns <Promise>
Site Plugin API
site.getPlugins()
- returns <Array<BasePlugin>> the plugins used for crawling.
site.setPlugins(plugins)
plugins
<Array<BasePlugin>> the plugins to be used for crawling.
The existing ones are removed.
site.addPlugins(plugins)
plugins
<Array<BasePlugin>> additional plugins to be used for crawling.
The existing plugins are kept unless an additional plugin is of the same type as an existing plugin. In this case the additional plugin overwrites the existing one.
site.removePlugins(pluginNames)
- pluginNames <Array<String>> constructor names of the plugins to be removed
Remove the matching plugins from the existing ones.
site.cleanupPlugins()
- Some plugins (like ChromeFetchPlugin) open external processes. Each plugin is responsible for its own cleanup via plugin.cleanup().
Site Crawl API
site.getResourceToCrawl()
- returns <Promise<Resource>>
site.saveResources(urls, depth)
urls
<Array<String>> urls of the resources to be addeddepth
<Number> depth of the resources to be added- returns <Promise>
The urls are filtered against the site bloom filter in order to remove duplicates.
site.getResourceCount()
- returns <Promise<<number>> total number of site resources
site.fetchRobots(reqHeaders)
reqHeaders
<Object>
Retrieves the site robots.txt content via and updates the site robotsTxt property.- returns <Promise>
site.crawlResource()
- Loops through the ordered (based on phase) plugins and apply each one to the current site-resource pair. The first plugin in the SELECT phase is responsible for retrieving the resource to be crawled.
site.crawl(opts)
opts
<Object> crawl optionsmaxConnections
<number> maximum number of resources crawled in parallel. Defaults to1
.maxResources
<number> If set, crawling will stop once the indicated number of resources has been crawled.maxDepth
<number> If set, crawling will stop once there are no more resources with a lower than indicated depth.delay
<number> delay in miliseconds between consecutive crawls. Defaults to100
. Each time a resource has finished crawling attempt to restore maximum number of parallel connections in case new resources have been found and saved. Crawling stops and the returned promise is resolved once there are no more resources to crawl meeting the above criteria.
- returns <Promise>
site.stop()
- No further resource crawls are initiated. The one in progress are completed.
- returns <Promise>
Resource
Resource CRUD API
new Resource(siteId, url, depth)
siteId
<string> id of the site the resource belongs tourl
<string> resource urldepth
<number> resource depth. First site resource has depth 0.
Resource.get(urlOrId)
urlOrId
<string> resource url or id- returns <Promise<Resource>>
resource.save()
- returns <Promise<number>> the newly created resource id
resource.update()
- returns <Promise>
resource.del()
- returns <Promise>
Resource.delAll()
- returns <Promise>
Resource Crawl API
Resource.getResourceToCrawl(siteId)
siteId
<string> resource will belong to the specified site id returns <Promise<Resource>>
PluginManager
PluginManager.DEFAULT_PLUGINS
- returns <Array<BasePlugin>> default plugins
pluginManager.register(plugins)
plugins
<Array<BasePlugin>|BasePlugin> registered plugins can later be instantiated from JSON strings retrieved from storage.
pluginManager.instantiate(jsonPlugins)
plugins
<Array<[string]>|[string]> instantiate plugin(s) from their corresponding JSON strings- returns <Array<BasePlugin>|BasePlugin> plugin instance(s)
Plugins
Default Plugins
SelectResourcePlugin
- Selects a resource to crawl from the current site.
NodeFetchPlugin
- Downloads a site resource using node HTTP and HTTPS libraries.
JsDomPlugin
- Generates a jsdom document for the current resource.
ExtractUrlPlugin
- Responsible for extracting new resources from a resource document.
RobotsFilterPlugin
- Filters newly found resources based on robots.txt rules.
UpdateResourcePlugin
- Updates a resource after crawling it.
InsertResourcePlugin
- Saves newly found resource within the current site.
Optional Plugins
PersistResourcePlugin
- Writes a resources to disk.
ChromeFetchPlugin
- Alternative to <NodeFetchPlugin>, instead of just downloading a site resource it also executes the javascript code (if present) returning the dynamically generated html content. Uses Puppeteer to control headless Chrome which needs to be installed separately:
npm install puppeteer --save
Logger
Logger.setLogLevel(logLevel)
logLevel
<trace|debug|info|warn|error> set desired log level, levels below it will be ignored. Defaults to warn.
Logger.setLogPaths(outputPath, errorPath)
outputPath
<string> path for the output logerrorPath
<string> path for the error log
Logger.getLogger(cls)
cls
<string> class / category to be appended on each log message. A log entry has the following elements: [LogLevel] [Class] [Date] [Message]. All console methods can be invoked: trace, debug, info, warn, error.- returns <Logger>
Additional Documentation
Read the full documentation for more details.