npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

dcrawler

v0.0.8

Published

DCrawler is a distribited web spider written in Nodejs and queued with Mongodb. It gives you the full power of jQuery to parse big pages as they are downloaded, asynchronously. Simplifying distributed crawler!

Downloads

19

Readme

node-distributed-crawler

Features

  • Distributed crawler
  • Configurable url parser and data parser
  • jQuery selector using cheerio
  • Parsed data insertion in Mongodb collection
  • Domain wise interval configuration in distributed enviroment
  • node 0.8+ support

Note: update to latest version (0.0.4+), don't use 0.0.1

I am actively updating this library, for any feature suggestion or git fork request are welcomed :)

Installation

$ npm install dcrawler

Usage

var DCrawler = require("dcrawler");

var options = {
    mongodbUri:     "mongodb://0.0.0.0:27017/crawler-data",
    profilePath:    __dirname + "/" + "profile"
};
var logs = {
    dbUri:      "mongodb://0.0.0.0:27017/crawler-log",
	storeHost:  true
};
var dc = new DCrawler(options, logs);
dc.start();

Note: mongodb connection uri (mongodbUri & dbUri) should be same (queueing of urls should be centeralized)

The DCrawler takes options and log options construcotr:

  1. options with following porperties *:
  • mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
  • profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *
  1. logs to store logs in centrelized location using winston-mongodb with following porperties:
  • dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
  • storeHost: Boolean, true or false to store workers host name or not in log collection.

Note: logs is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor

var dc = new DCrawler(options);

Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:

  • collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
  • url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or ['http://example.com']) *
  • interval: Interval between request in miliseconds. Default is 1000 (Eg: For 2 secods interval: 2000)
  • followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
  • resume: Boolean, true or false to resume crawling from previous crawled data.
  • beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:
beforeStart: function (config) {
    console.log("started crawling example.com");
}
  • parseUrl: Function to get further url from crawled page. Function has error, response object and $ jQuery object param. Function returns Array of url string. Example function:
parseUrl: function (error, response, $) {
    var _url = [];
    
    try {
        $("a").each(function(){
            var href = $(this).attr("href");
            if (href && href.indexOf("/products") > -1) {
                if (href.indexOf("http://example.com") === -1) {
                    href = "http://example.com/" + href;
                }
                _url.push(href);
            }
        )};
    } catch (e) {
        console.log(e);
    }
    
    return _url;
}
  • parseData: Function to exctract information from crawled page. Function has error, response object and $ jQuery object param. Function returns data Object to insert in collection . Example function:
parseData: function (error, response, $) {
    var _data = null;
    
    try {
        var _id = $("h1#productId").html();
        var name = $("span#productName").html();
        var price = $("label#productPrice").html();
        var url = response.uri;
        
        _data = {
            _id: _id,
            name: name,
            price: price,
            url: url
        }
    } catch (e) {
        console.log(e);
    }
    
    return _data;
}
  • onComplete: Function to execute on completing crawling. Function has config param which contains perticular profile config object. Example function:
onComplete: function (config) {
    console.log("completed crawling example.com");
}

Chirag (blikenoother -[at]- gmail [dot] com)