gumo
v2.0.1
Published
A web-crawler and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications.
Readme
🕸️Gumo
"Gumo" (蜘蛛) is Japanese for "spider".
Overview 👓
A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.
Table of Contents 📖
Features 🌟
- Crawl hyperlinks present on the pages of any domain and its subdomains.
- Scrape meta-tags and body text from every page.
- Store entire sitemap in a GraphDB (currently supports Neo4J).
- Store page content in ElasticSearch for easy full-text lookup.
Requirements 📋
- Node.js ≥ 24.0.0 (LTS). Pinned in
package.json(engines) and.nvmrcfor nvm users. - Neo4j 4.0+ when using the graph (constraint syntax requires it).
Installation 🏗️
Use Node 24+ (e.g.
nvm useif you have nvm and the repo’s.nvmrc).Install dependencies (uses
package-lock.jsonfor reproducible installs):npm installOr in CI:
npm ci.
Usage 👨💻
From code:
// 1: import the module
const gumo = require('gumo')
// 2: instantiate the crawler
let cron = new gumo()
// 3: call the configure method and pass the configuration options
cron.configure({
'neo4j': { // replace with your details or remove if not required
'url' : 'neo4j://localhost',
'user' : 'neo4j',
'password' : 'gumo123'
},
'elastic': { // replace with your details or remove if not required
'url' : 'http://localhost:9200',
'index' : 'myIndex'
},
'crawler': {
'url': 'https://www.example.com',
}
});
// 4: start crawling
cron.insert()Note: The config params passed to cron.configure above are the default values. See Configuration for all options.
When using Gumo as a dependency (e.g. require('gumo') with no config.json in your project), in-package defaults are used so the module loads; pass your Elasticsearch, Neo4j, and crawler settings via configure() before calling insert().
Development 🛠️
| Script | Description |
| -------- | ------------------------------------ |
| npm run dev | Run the crawler (node index.js). |
| npm run lint | Run ESLint on the project (see eslint.config.js). |
| npm test | Run tests (placeholder until tests are added). |
CI runs on GitHub Actions (Node 24, lint + test) on push/PR to main/master.
Configuration ⚙️
The behavior of the crawler can be customized by passing a custom configuration object to the config() method. The following are the attributes which can be configured:
| Attribute ( * - Mandatory ) | Type | Accepted Values | Description | Default Value | Default Behavior | | :---------------------------- | :------------ | :--------------- | :----------------------------------------------------------------------------------------- | :---------------------- | :----------------------------------------------------------------------- | | * crawler.url | string | | Base URL to start scanning from | "" (empty string) | Module is disabled | | | crawler.Cookie | string | | Cookie string to be sent with each request (useful for pages that require auth) | "" (empty string) | Cookies will not be attached to the requests | | crawler.saveOutputAsHtml | string | "Yes"/"No" | Whether or not to store scraped content as HTML files in the output/html/ directory | "No" | Saving output as HTML files is disabled | | crawler.saveOutputAsJson | string | "Yes"/"No" | Whether or not to store scraped content as JSON files in the output/json/ directory | "No" | Saving output as JSON files is disabled | | crawler.maxRequestsPerSecond | int | range: 1 to 5000 | The maximum number of requests to be sent to the target in one second | 5000 | | | crawler.maxConcurrentRequests | int | range: 1 to 5000 | The maximum number of concurrent connections to be created with the host at any given time | 5000 | | | crawler.whiteList | Array(string) | | If populated, only these URLs will be traversed | [] (empty array) | All URLs with the same hostname as the "url" attribute will be traversed | | crawler.blackList | Array(string) | | If populated, these URLs will ignored | [] (empty array) | | | crawler.depth | int | range: 1 to 999 | Depth up to which nested hyperlinks will be followed | 3 | | | * elastic.url | string | | URI of the ElasticSearch instance to connect to | "http://localhost:9200" | | | * elastic.index | string | | The name of the ElasticSearch index to store results in | "myIndex" | | | * neo4j.url | string | | The URI of a running Neo4J instance (uses the Bolt driver to connect) | "neo4j://localhost" | | | * neo4j.user | string | | Neo4J server username | "neo4j" | | | * neo4j.password | string | | Neo4J server password | "gumo123" | |
ElasticSearch ⚡
Page content is stored with the URL and a hash. The index is set via the elastic.index config (or config.json). If the index does not exist, it is created. Gumo uses the official @elastic/elasticsearch client; each page is indexed with id = hash and document = the page object (no separate type field).
GraphDB ☋
The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:
Nodes
- Label: Page
- Properties:
| Property Name | Type | Description | | :------------ | :----- | :---------------------------------------------------------------------------------------------------------- | | pid | String | UID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB | | link | String | URL of the current page | | parent | String | URL of the page from which the current page was accessed (typically only used while creating relationships) | | title | String | Page title as it appears in the page header |
Relationships
| Name | Direction | Condition | | :--------- | :----------------------- | :---------------- | | links_to | (a)-[r1:links_to]->(b) | b.link = a.parent | | links_from | (b)-[r2:links_from]->(a) | b.link = a.parent |
Changelog
See CHANGELOG.md for version history and upgrading notes (e.g. Node 24, Elasticsearch client, Neo4j driver in v2.0.0).
TODO ☑️
- [ ] Make it executable from CLI
- [x] Enable to send config parameters while invoking the gumo
- [x] CI (GitHub Actions, Node 24, lint + test)
- [ ] Write more tests

