web-tree-crawl

v1.1.4

Published

2 years ago

Simple framework for crawling/scraping web sites. The result is a tree, where each node is a single request.

Downloads

0High
0Medium
0Low

wotanii

crawler webcrawler tree simple

Note to english-speakers

Many comments, issues, etc. are partially written in german. If you want something translated create an issue and I'll take care of it.

Introduction

web-tree-crawl is available on npm and on gitlab

Idea

The crawling process is tree-shaped: You start with a single URL (the root), download a document (a node) and discover new URLs (child nodes) which in turn will be downloaded. So every crawled document is a node in the tree and every URL is an edge. The tree spans only over new edges; edges to already known urls will be stored, but not processed.

The end result will be a tree representing the crawl-process. All discovered information will be stored in this tree.

The main difference between different crawler is which URLs and which data will be scraped from discovered documents. So those two scraper will need to be supplied by the user while the library web-tree-crawl takes care of everything else.

Example

note: everything here is ECMA6

So lets say you want the last couple of comics from xkcd.com. All you have to do is

"use strict";
const crawler = require('web-tree-crawl');

// configure your crawler
let ts = new crawler("https://xkcd.com");
ts.config.maxRequests = 5;
ts.config.dataScraper = crawler.builtin.dataScraper.generalHtml;
ts.config.urlScraper = crawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');

// exectute!
ts.buildTree(function (root) {
    //  print discovered data to std::out
    console.log(JSON.stringify(crawler.builtin.treeHelper.getDataAsFlatArray(root), null, "\t"));
});

For more examples see: https://gitlab.com/wotanii/web-tree-crawl/tree/master/examples

Details/Dokumentation

Use web-tree-crawl like this:

create the object & set the initial url
modify the config-object
call buildTree & wait for the callback

Config

You will always want to define those config-attributes :

maxRequests: how many documents may be crawled?
dataScraper: what data do you want to find?
urlScraper: how does the crawler look for new urls?

There are more, but their defaults work well on most websites and are pretty much self-explanatory (if not, let me know by opening an issue).

Url Scraper

These are functions, that scrape urls from a document. The crawler will apply this function to all crawled documents to discover new documents.

Create your own url scraper or use a builtin. All url scraper must have this signature:

parameters
1. string: content of current document
2. string: url of current document
returns
1. string[]: discovered urls

Data Scraper

These are functions, that scrape data from a document. The crawler will apply this function to all crawled documents to decide what data to store for this document.

The crawler will not use this data in anyway, so you can return whatever you want.

Create your own data scraper or use a builtin. All data scraper must have this signature:

parameters
1. string: content of current document
2. string: current node
returns
1. anything

Builtin

There are some static builtin function, that you don't need to use, but they will make your life easier. Some of those function can be used directly and some are factories, that return those functions.

Url Scraper

These are functions, that will scrape for urls in a usual manner. Use them by putting them in your config like this:

ts.config.urlScraper = treeCrawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');

Data Scraper

These are functions, that will scrape for data in a usual manner. Use them by putting them in your config like this:

ts.config.dataScraper = treeCrawler.builtin.dataScraper.generalHtml;

Tree Helper

These are functions, that help to extract information from the result-tree. Use these function once buildTree has finished.

They will modify your tree (e.g. treeCrawler.builtin.treeHelper.addParentsToNodes) or they will extract data from your tree (e.g. crawler.builtin.treeHelper.getDataAsFlatArray)

Dev-Setup

sudo apt install npm nodejs

git clone [email protected]:wotanii/web-tree-crawl.git
cd web-tree-crawl/
npm install

npm test

if tests fail with your set-up, either create an issue or comment on an existing issue

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Note to english-speakers

Introduction

Idea

Example

Details/Dokumentation

Config

Url Scraper

Data Scraper

Builtin

Url Scraper

Data Scraper

Tree Helper

Dev-Setup