dcrawler

v0.0.8

Published

2 years ago

DCrawler is a distribited web spider written in Nodejs and queued with Mongodb. It gives you the full power of jQuery to parse big pages as they are downloaded, asynchronously. Simplifying distributed crawler!

Downloads

0High
0Medium
0Low

blikenoother

distribited crawling spider scraper scraping jquery crawler

node-distributed-crawler

Features

Distributed crawler
Configurable url parser and data parser
jQuery selector using cheerio
Parsed data insertion in Mongodb collection
Domain wise interval configuration in distributed enviroment
node 0.8+ support

Note: update to latest version (0.0.4+), don't use 0.0.1

I am actively updating this library, for any feature suggestion or git fork request are welcomed :)

Installation

$ npm install dcrawler

Usage

var DCrawler = require("dcrawler");

var options = {
    mongodbUri:     "mongodb://0.0.0.0:27017/crawler-data",
    profilePath:    __dirname + "/" + "profile"
};
var logs = {
    dbUri:      "mongodb://0.0.0.0:27017/crawler-log",
	storeHost:  true
};
var dc = new DCrawler(options, logs);
dc.start();

Note: mongodb connection uri (mongodbUri & dbUri) should be same (queueing of urls should be centeralized)

The DCrawler takes options and log options construcotr:

options with following porperties *:

mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *

logs to store logs in centrelized location using winston-mongodb with following porperties:

dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
storeHost: Boolean, true or false to store workers host name or not in log collection.

Note: logs is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor

var dc = new DCrawler(options);

Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:

collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or ['http://example.com']) *
interval: Interval between request in miliseconds. Default is 1000 (Eg: For 2 secods interval: 2000)
followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
resume: Boolean, true or false to resume crawling from previous crawled data.
beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:

beforeStart: function (config) {
    console.log("started crawling example.com");
}

parseUrl: Function to get further url from crawled page. Function has error, response object and $ jQuery object param. Function returns Array of url string. Example function:

parseUrl: function (error, response, $) {
    var _url = [];
    
    try {
        $("a").each(function(){
            var href = $(this).attr("href");
            if (href && href.indexOf("/products") > -1) {
                if (href.indexOf("http://example.com") === -1) {
                    href = "http://example.com/" + href;
                }
                _url.push(href);
            }
        )};
    } catch (e) {
        console.log(e);
    }
    
    return _url;
}

parseData: Function to exctract information from crawled page. Function has error, response object and $ jQuery object param. Function returns data Object to insert in collection . Example function:

parseData: function (error, response, $) {
    var _data = null;
    
    try {
        var _id = $("h1#productId").html();
        var name = $("span#productName").html();
        var price = $("label#productPrice").html();
        var url = response.uri;
        
        _data = {
            _id: _id,
            name: name,
            price: price,
            url: url
        }
    } catch (e) {
        console.log(e);
    }
    
    return _data;
}

onComplete: Function to execute on completing crawling. Function has config param which contains perticular profile config object. Example function:

onComplete: function (config) {
    console.log("completed crawling example.com");
}

Chirag (blikenoother -[at]- gmail [dot] com)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

node-distributed-crawler

Features

Installation

Usage