gumo

v1.0.7

Published

3 years ago

A web-crawler and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications.

Downloads

0High
0Medium
0Low

nvk.681

d3c0d3r

dom javascript crawling web-crawler spider scraper scraping jquery crawler nodejs elasticsearch neo4j knowledge mining

🕸️Gumo

"Gumo" (蜘蛛) is Japanese for "spider".

Overview 👓

A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.

Table of Contents 📖

🕸️Gumo

Features 🌟

Crawl hyperlinks present on the pages of any domain and its subdomains.
Scrape meta-tags and body text from every page.
Store entire sitemap in a GraphDB (currently supports Neo4J).
Store page content in ElasticSearch for easy full-text lookup.

Installation 🏗️

Usage 👨‍💻

From code:

// 1: import the module
const gumo = require('gumo')

// 2: instantiate the crawler
let cron = new gumo()

// 3: call the configure method and pass the configuration options
cron.configure({
    'neo4j': { // replace with your details or remove if not required
        'url' : 'neo4j://localhost',
        'user' : 'neo4j',
        'password' : 'gumo123'
    },
    'elastic': { // replace with your details or remove if not required
        'url' : 'http://localhost:9200',
        'index' : 'myIndex'
    },
    'crawler': {
        'url': 'https://www.example.com',
    }
});

// 4: start crawling
cron.insert()

Note: The config params passed to cron.configure above are the default values. Please refer to the Configuration section below to learn more about the customization options that are available.

Configuration ⚙️

The behavior of the crawler can be customized by passing a custom configuration object to the config() method. The following are the attributes which can be configured:

ElasticSearch ⚡

The content of the web page will be stored along with the url, and a hash. The index for the elastic search can be selected through config.json index attribute. If the index already exists in the elastic search it will be used, else it will create one.

id: hash, index: config.index, type: 'pages', body: JSON.stringify(page content)

GraphDB ☋

The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:

Nodes

Label: Page
Properties:

Relationships

TODO ☑️

[ ] Make it executable from CLI
[x] Enable to send config parameters while invoking the gumo
[ ] Write more tests