npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

repunt

v0.1.4

Published

Simple, configurable and extensible webcrawler

Downloads

12

Readme

node-repunt

Webcrawler for node.

Installation

npm install repunt

Features

repunt is a webcrawler characterized by

  • standard eventpublishing through EventEmitter
  • tight integration with request
  • optional caching, enabling incremental crawling and offline analysis
  • markup analysis through cheerio
  • extensible filter architecture, allowing inspection, modification, prevention and enqueing of requests
  • uses the best of the best libaries: request, cheerio, lodash, q

Quick sample

This sample crawls localhost, starting at http://localhost/start, following links until at most 10 distinct pages has been crawled. For each page, if analyzable by cheerio ($), the title is sent to console.

var repunt = require('repunt');
repunt({connections: 8})    // throttle to atmost 8 concurrent requests
    .use(repunt.cheerio())  // parse markup using cheerio
    .use(repunt.followLinks())  // follow <a href=...>
    .use(repunt.stayInRange(['http://localhost/'])) // dont stray away from this domain
    .use(repunt.trimHashes())   // ignore hashes: /start#about becomes /start
    .use(repunt.once()) // visit each distinct url at most once
    .use(repunt.atMost(10)) // limit to 10 requests all in all
    .use(repunt.fileCache('./temp/.cache')) // fetch from/save to cache
    .on('start', function (){ // called once before any other event
        console.log('SPIDER START');
    })
    .on('enqueue', function (task) {  // called once per queued url
        //console.log('ENQUEUE',task.url);
    })
    .on('init', function (task) { // called before an actual request is issued
        // console.log('INIT',task.url);
    })
    .on('complete', function (task) { // called when result from fetch is available
        if (task.$){
            console.log(task.$('title').text());
        }
    })
    .on('error', function (error /*, task - if applicable */) { // hopefully never called
        console.log('ERROR',error);
    })
    .on('done', function (){  // called once after all other events
        console.log('SPIDER DONE');
    })
    .enqueue('http://localhost/start')
    .start();

Motivation

Why another crawler? I participated in several projects where public websites should be migrated to other platforms. In one particular case, it was an e-commerce without a product database, so my only option was to crawl the existing site and pull out product information (descriptions, variant, related products, images, ...) from the web. This is generally timeconsuming and requires a lot of development of the analysis part. Having a nice crawler with good caching support (speedup is significant!) made my life easier.

Filter cheatsheet

.use(repunt.trimHashes())

Removes hashes from urls. http://www.mysite.com/start#index will be enqueued as http://www.mysite.com/start.

.use(repunt.ignoreQueryStrings(true))

Removes all querystrings. http://www.mysite.com/start?a=1&b=2 will be enqueued as http://www.mysite.com/start.

.use(repunt.ignoreQueryStrings(['a','b'))

Removes named querystring parameters. http://www.mysite.com/start?a=1&b=2&c=1 will be enqueued as http://www.mysite.com/start?c=1.

.use(repunt.once())

Any url will only be enqueued once. Always use...

.use(repunt.atMost(10))

Enqueue at most 10 urls. Great for testing.

.use(repunt.cheerio())

Response is parsed using cheerio and stored in task.$. Useful for DOM-inspection.

.use(repunt.followLinks())

If the cheerio filter is used, and the content type is something like text/*, links from <a href> are enqueued to the repunt instance.

.use(repunt.followImages())

Similar to followLinks but enqueues <img src> instead.

.use(repunt.stayInRange(['http://site1/', 'http://site2/']))

Prevent repunt from straying away from site1 and site2, even crawled pages has links to this and that. Always use, unless you want to crawl the whole internet!

.use(repunt.fileCache('./temp/.cache'))

Cache results (with http status 200) in the folder ./temp/.cache. Crawl the site once, throw away your network card, and you can still repeat you last run. Great for offline analysis of sites. FYI: ./temp/.cache/.index contains some useful info about whats cached.

Architecture

Tasks

Tasks are the objects keeping state about requests.

{
    url: <url passed to repunt.enqueue(url, referer)>
    refererer: <referer url passed to repunt.enqueue(url, referer)>
    error: <error code set by setCompleted>
    response:  <response object set by setCompleted>
    body:  <body set by setCompleted>
    $: <typically set by cheerio filter>,
    ext: <object, storage for filter specific data>,
    promise: <promise object allowing stuff like task.promise.then(...) >,
    cancel: function () {/* cancel further processing of this task */ },
    setCompleted: function () {/* mark this task as fully handled */ },
    setResult: function (error, response, body) {/* set result from http fetch och cache loading */ },
}

Tasks are created from within repunt.enqueue() and are then passed around to filters and events.

Filters

The driving force in repunt are filters. A filter is expected to implement some or all of the methods in the canonical do-nothing example below

{
    start: function (next, ctx) { next(); },
    enqueue: function (task, next, ctx) { next(); },
    init: function (task, next, ctx) { next(); },
    request: function (task, next, ctx) { next(); },
    complete: function (task, next, ctx) { next(); }
}

The ctx parameter is the actual repunt instance and next is a function that must be called for further processing of a task. Depending on situation, further processing of a task an be prevented by

  • not calling next()
  • calling task.cancel()
  • calling task.setCompleted()
  • calling task.request.abort(); ... task.setResult(error,response,body)

The lifecycle is

  • start(...) is called once per filter instance. Useful for complex initial setup.
  • enqueue(...) is called when repunt.enqueue() is called. Some filters prevent furher execution in this step (once, atMost, stayInRange), while others like trimHashes and ignoreQueryStrings modifies task.url.
  • init(...) is called right before the actual request object is created
  • request(...) is called when task.request is set. This is a good place to modify headers and stuff.
  • complete(...) is called when the task finally has a result (error, response, body)

Ordering of filters are important. For the standard filters the following order of url/request queue manipulating filters gives meaningful results:

  1. trimHashes/ingoreQueryStrings
  2. stayInRange
  3. once
  4. atmost

Content processing filters should have the order

  1. cheerio
  2. followLinks (depends on cheerio)
  3. followImages (depends on cheerio)

Testing

mocha-testsuite can be found in the test folder.