repunt

v0.1.4

Published

4 years ago

Simple, configurable and extensible webcrawler

Downloads

0High
0Medium
0Low

jlarsson

crawl crawler spider util utility

node-repunt

Webcrawler for node.

Installation

npm install repunt

Features

repunt is a webcrawler characterized by

standard eventpublishing through EventEmitter
tight integration with request
optional caching, enabling incremental crawling and offline analysis
markup analysis through cheerio
extensible filter architecture, allowing inspection, modification, prevention and enqueing of requests
uses the best of the best libaries: request, cheerio, lodash, q

Quick sample

This sample crawls localhost, starting at http://localhost/start, following links until at most 10 distinct pages has been crawled. For each page, if analyzable by cheerio ($), the title is sent to console.

var repunt = require('repunt');
repunt({connections: 8})    // throttle to atmost 8 concurrent requests
    .use(repunt.cheerio())  // parse markup using cheerio
    .use(repunt.followLinks())  // follow <a href=...>
    .use(repunt.stayInRange(['http://localhost/'])) // dont stray away from this domain
    .use(repunt.trimHashes())   // ignore hashes: /start#about becomes /start
    .use(repunt.once()) // visit each distinct url at most once
    .use(repunt.atMost(10)) // limit to 10 requests all in all
    .use(repunt.fileCache('./temp/.cache')) // fetch from/save to cache
    .on('start', function (){ // called once before any other event
        console.log('SPIDER START');
    })
    .on('enqueue', function (task) {  // called once per queued url
        //console.log('ENQUEUE',task.url);
    })
    .on('init', function (task) { // called before an actual request is issued
        // console.log('INIT',task.url);
    })
    .on('complete', function (task) { // called when result from fetch is available
        if (task.$){
            console.log(task.$('title').text());
        }
    })
    .on('error', function (error /*, task - if applicable */) { // hopefully never called
        console.log('ERROR',error);
    })
    .on('done', function (){  // called once after all other events
        console.log('SPIDER DONE');
    })
    .enqueue('http://localhost/start')
    .start();

Motivation

Why another crawler? I participated in several projects where public websites should be migrated to other platforms. In one particular case, it was an e-commerce without a product database, so my only option was to crawl the existing site and pull out product information (descriptions, variant, related products, images, ...) from the web. This is generally timeconsuming and requires a lot of development of the analysis part. Having a nice crawler with good caching support (speedup is significant!) made my life easier.

Filter cheatsheet

.use(repunt.trimHashes())

Removes hashes from urls. http://www.mysite.com/start#index will be enqueued as http://www.mysite.com/start.

.use(repunt.ignoreQueryStrings(true))

Removes all querystrings. http://www.mysite.com/start?a=1&b=2 will be enqueued as http://www.mysite.com/start.

.use(repunt.ignoreQueryStrings(['a','b'))

Removes named querystring parameters. http://www.mysite.com/start?a=1&b=2&c=1 will be enqueued as http://www.mysite.com/start?c=1.

.use(repunt.once())

Any url will only be enqueued once. Always use...

.use(repunt.atMost(10))

Enqueue at most 10 urls. Great for testing.

.use(repunt.cheerio())

Response is parsed using cheerio and stored in task.$. Useful for DOM-inspection.

.use(repunt.followLinks())

If the cheerio filter is used, and the content type is something like text/*, links from <a href> are enqueued to the repunt instance.

.use(repunt.followImages())

Similar to followLinks but enqueues <img src> instead.

.use(repunt.stayInRange(['http://site1/', 'http://site2/']))

Prevent repunt from straying away from site1 and site2, even crawled pages has links to this and that. Always use, unless you want to crawl the whole internet!

.use(repunt.fileCache('./temp/.cache'))

Cache results (with http status 200) in the folder ./temp/.cache. Crawl the site once, throw away your network card, and you can still repeat you last run. Great for offline analysis of sites. FYI: ./temp/.cache/.index contains some useful info about whats cached.

Architecture

Tasks

Tasks are the objects keeping state about requests.

{
    url: <url passed to repunt.enqueue(url, referer)>
    refererer: <referer url passed to repunt.enqueue(url, referer)>
    error: <error code set by setCompleted>
    response:  <response object set by setCompleted>
    body:  <body set by setCompleted>
    $: <typically set by cheerio filter>,
    ext: <object, storage for filter specific data>,
    promise: <promise object allowing stuff like task.promise.then(...) >,
    cancel: function () {/* cancel further processing of this task */ },
    setCompleted: function () {/* mark this task as fully handled */ },
    setResult: function (error, response, body) {/* set result from http fetch och cache loading */ },
}

Tasks are created from within repunt.enqueue() and are then passed around to filters and events.

Filters

The driving force in repunt are filters. A filter is expected to implement some or all of the methods in the canonical do-nothing example below

{
    start: function (next, ctx) { next(); },
    enqueue: function (task, next, ctx) { next(); },
    init: function (task, next, ctx) { next(); },
    request: function (task, next, ctx) { next(); },
    complete: function (task, next, ctx) { next(); }
}

The ctx parameter is the actual repunt instance and next is a function that must be called for further processing of a task. Depending on situation, further processing of a task an be prevented by

not calling next()
calling task.cancel()
calling task.setCompleted()
calling task.request.abort(); ... task.setResult(error,response,body)

The lifecycle is

start(...) is called once per filter instance. Useful for complex initial setup.
enqueue(...) is called when repunt.enqueue() is called. Some filters prevent furher execution in this step (once, atMost, stayInRange), while others like trimHashes and ignoreQueryStrings modifies task.url.
init(...) is called right before the actual request object is created
request(...) is called when task.request is set. This is a good place to modify headers and stuff.
complete(...) is called when the task finally has a result (error, response, body)

Ordering of filters are important. For the standard filters the following order of url/request queue manipulating filters gives meaningful results:

trimHashes/ingoreQueryStrings
stayInRange
once
atmost

Content processing filters should have the order

cheerio
followLinks (depends on cheerio)
followImages (depends on cheerio)

Testing

mocha-testsuite can be found in the test folder.