microcrawler

v0.1.30

Published

4 years ago

Micro implementation of crawler

0High
0Medium
0Low

korczis

microcrawler

Status

Screenshots

Available Official Crawlers

List of official publicly available crawlers.

Missing something? Feel free to open issue.

Prerequisites

Installation

From npmjs.org (the easy way)

This is the easiest way. The prerequisites still needs to be satisfied.

npm install -g microcrawler

From Sources

This is useful if you want to tweak the source code, implement new crawler, etc.

# Clone repository
git clone https://github.com/ApolloCrawler/microcrawler.git

# Enter folder
cd microcrawler

# Install required packages - dependencies
npm install

# Install from local sources
npm install -g .

Usage

Show available commands

$ microcrawler

  Usage: microcrawler [options] [command]


  Commands:

    collector [args]  Run data collector
    config [args]     Run config
    exporter [args]   Run data exporter
    worker [args]     Run crawler worker
    crawl [args]      Crawl specified site
    help [cmd]        display help for [cmd]

  Options:

    -h, --help     output usage information
    -V, --version  output the version number

Check microcrawler version

$ microcrawler --version
0.1.27

Initialize config file

$ microcrawler config init
2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
{
    "client": "superagent",
    "timeout": 10000,
    "throttler": {
        "enabled": false,
        "active": true,
        "rate": 20,
        "ratePer": 1000,
        "concurrent": 8
    },
    "retry": {
        "count": 2
    },
    "headers": {
        "Accept": "*/*",
        "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "From": "googlebot(at)googlebot.com"
    },
    "proxy": {
        "enabled": false,
        "list": [
            "https://168.63.20.19:8145"
        ]
    },
    "natFaker": {
        "enabled": true,
        "base": "192.168.1.1",
        "bits": 16
    },
    "amqp": {
        "uri": "amqp://localhost",
        "queues": {
            "collector": "collector",
            "worker": "worker"
        },
        "options": {
            "heartbeat": 60
        }
    },
    "couchbase": {
        "uri": "couchbase://localhost:8091",
        "bucket": "microcrawler",
        "username": "Administrator",
        "password": "Administrator",
        "connectionTimeout": 60000000,
        "durabilityTimeout": 60000000,
        "managementTimeout": 60000000,
        "nodeConnectionTimeout": 10000000,
        "operationTimeout": 10000000,
        "viewTimeout": 10000000
    },
    "elasticsearch": {
        "uri": "localhost:9200",
        "index": "microcrawler",
        "log": "debug"
    }
}

Edit config file

$ vim ~/.microcrawler/config.json

Show config file

$ microcrawler config show
{
    "client": "superagent",
    "timeout": 10000,
    "throttler": {
        "enabled": false,
        "active": true,
        "rate": 20,
        "ratePer": 1000,
        "concurrent": 8
    },
    "retry": {
        "count": 2
    },
    "headers": {
        "Accept": "*/*",
        "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "From": "googlebot(at)googlebot.com"
    },
    "proxy": {
        "enabled": false,
        "list": [
            "https://168.63.20.19:8145"
        ]
    },
    "natFaker": {
        "enabled": true,
        "base": "192.168.1.1",
        "bits": 16
    },
    "amqp": {
        "uri": "amqp://example.com",
        "queues": {
            "collector": "collector",
            "worker": "worker"
        },
        "options": {
            "heartbeat": 60
        }
    },
    "couchbase": {
        "uri": "couchbase://example.com:8091",
        "bucket": "microcrawler",
        "username": "Administrator",
        "password": "Administrator",
        "connectionTimeout": 60000000,
        "durabilityTimeout": 60000000,
        "managementTimeout": 60000000,
        "nodeConnectionTimeout": 10000000,
        "operationTimeout": 10000000,
        "viewTimeout": 10000000
    },
    "elasticsearch": {
        "uri": "example.com:9200",
        "index": "microcrawler",
        "log": "debug"
    }
}

Start Couchbase

TBD

Start Elasticsearch

TBD

Start Kibana

TBD

Query elasticsearch

TBD

Example usage

Craiglist

microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/

Firmy.cz

microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="

Google

microcrawler crawl google.index http://google.com/search?q=Buena+Vista

Hacker News

microcrawler crawl hackernews.index https://news.ycombinator.com/

xkcd

microcrawler crawl xkcd.index http://xkcd.com

Yelp

microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"

Youjizz

microcrawler crawl youjizz.com.index http://youjizz.com

Credits

@pavelbinar for QA and not just that.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

microcrawler

Status

Screenshots

Available Official Crawlers

Prerequisites

Installation

From npmjs.org (the easy way)

From Sources

Usage

Show available commands

Check microcrawler version

Initialize config file

Edit config file

Show config file

Start Couchbase

Start Elasticsearch

Start Kibana

Query elasticsearch

Example usage

Craiglist

Firmy.cz

Google

Hacker News

xkcd

Yelp

Youjizz

Credits