npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

gumo

v1.0.7

Published

A web-crawler and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications.

Downloads

4

Readme

🕸️Gumo

"Gumo" (蜘蛛) is Japanese for "spider".

npm version MIT license

Overview 👓

A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.

Table of Contents 📖

Features 🌟

  • Crawl hyperlinks present on the pages of any domain and its subdomains.
  • Scrape meta-tags and body text from every page.
  • Store entire sitemap in a GraphDB (currently supports Neo4J).
  • Store page content in ElasticSearch for easy full-text lookup.

Installation 🏗️

NPM

Usage 👨‍💻

From code:

// 1: import the module
const gumo = require('gumo')

// 2: instantiate the crawler
let cron = new gumo()

// 3: call the configure method and pass the configuration options
cron.configure({
    'neo4j': { // replace with your details or remove if not required
        'url' : 'neo4j://localhost',
        'user' : 'neo4j',
        'password' : 'gumo123'
    },
    'elastic': { // replace with your details or remove if not required
        'url' : 'http://localhost:9200',
        'index' : 'myIndex'
    },
    'crawler': {
        'url': 'https://www.example.com',
    }
});

// 4: start crawling
cron.insert()

Note: The config params passed to cron.configure above are the default values. Please refer to the Configuration section below to learn more about the customization options that are available.

Configuration ⚙️

The behavior of the crawler can be customized by passing a custom configuration object to the config() method. The following are the attributes which can be configured:

| Attribute ( * - Mandatory ) | Type | Accepted Values | Description | Default Value | Default Behavior | | :---------------------------- | :------------ | :--------------- | :----------------------------------------------------------------------------------------- | :---------------------- | :----------------------------------------------------------------------- | | * crawler.url | string | | Base URL to start scanning from | "" (empty string) | Module is disabled | | | crawler.Cookie | string | | Cookie string to be sent with each request (useful for pages that require auth) | "" (empty string) | Cookies will not be attached to the requests | | crawler.saveOutputAsHtml | string | "Yes"/"No" | Whether or not to store scraped content as HTML files in the output/html/ directory | "No" | Saving output as HTML files is disabled | | crawler.saveOutputAsJson | string | "Yes"/"No" | Whether or not to store scraped content as JSON files in the output/json/ directory | "No" | Saving output as JSON files is disabled | | crawler.maxRequestsPerSecond | int | range: 1 to 5000 | The maximum number of requests to be sent to the target in one second | 5000 | | | crawler.maxConcurrentRequests | int | range: 1 to 5000 | The maximum number of concurrent connections to be created with the host at any given time | 5000 | | | crawler.whiteList | Array(string) | | If populated, only these URLs will be traversed | [] (empty array) | All URLs with the same hostname as the "url" attribute will be traversed | | crawler.blackList | Array(string) | | If populated, these URLs will ignored | [] (empty array) | | | crawler.depth | int | range: 1 to 999 | Depth up to which nested hyperlinks will be followed | 3 | | | * elastic.url | string | | URI of the ElasticSearch instance to connect to | "http://localhost:9200" | | | * elastic.index | string | | The name of the ElasticSearch index to store results in | "myIndex" | | | * neo4j.url | string | | The URI of a running Neo4J instance (uses the Bolt driver to connect) | "neo4j://localhost" | | | * neo4j.user | string | | Neo4J server username | "neo4j" | | | * neo4j.password | string | | Neo4J server password | "gumo123" | |

ElasticSearch ⚡

The content of the web page will be stored along with the url, and a hash. The index for the elastic search can be selected through config.json index attribute. If the index already exists in the elastic search it will be used, else it will create one.

id: hash, index: config.index, type: 'pages', body: JSON.stringify(page content)

GraphDB ☋

The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:

Nodes

  • Label: Page
  • Properties:

| Property Name | Type | Description | | :------------ | :----- | :---------------------------------------------------------------------------------------------------------- | | pid | String | UID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB | | link | String | URL of the current page | | parent | String | URL of the page from which the current page was accessed (typically only used while creating relationships) | | title | String | Page title as it appears in the page header |

Relationships

| Name | Direction | Condition | | :--------- | :----------------------- | :---------------- | | links_to | (a)-[r1:links_to]->(b) | b.link = a.parent | | links_from | (b)-[r2:links_from]->(a) | b.link = a.parent |

TODO ☑️

  • [ ] Make it executable from CLI
  • [x] Enable to send config parameters while invoking the gumo
  • [ ] Write more tests