nescavater

v0.0.1

Published

4 years ago

JSON Driven Site Scraper (Scavater). It is useful when you want to extract some information from a site when the extraction process may not require any complex logic, only xpath patterns of the elements that contain the information are sufficient.

0High
0Medium
0Low

gianfebrian

crawler scraper

Nescavater

JSON-Driven Site Scraper (Scavater). It is useful when you want to extract some information from a site when the extraction process may not require any complex logic, only xpath patterns of the elements that contain the information are sufficient.

Install Package

It will also install chromium binary (size ~120mb) needed by puppeteer (headless engine)

npm install nescavater --save

Usage Example

If you are using mongoose for the store engine then you need to setup the connection first.

const mongoose = require('mongoose');

(() => {
  const connectionOption = {
    useNewUrlParser: true,
    useUnifiedTopology: true,
    useCreateIndex: true,
    useFindAndModify: false,
  };

  await mongoose.connect(process.env.MONGODB_URI, connectionOption);

  const crawler = new Crawler({
    options: { connection: mongoose.connection },
  });

  // preferable you have json file
  const htmlConfig = {
    name: 'example',
    sites: [
      'https://example.com',
    ],
    engine: {
      type: 'html',
      options: {},
    },
    attributes: {
      name: {
        target: 'string',
        output: 'single',
        type: 'xpath',
        selectors: [
          {
            type: 'text',
            selector: '//x:h1[@class="page-title"]',
          },
        ],
      },
    },
  };

  const jsonConfig = JSON.stringify(htmlConfig);

  // you only need to set config once as it should be stored in mongodb
  await crawler.setConfig(jsonConfig);

  const url = 'https://example.com';
  const config = await crawler.getConfigByUrl(url);
  const result = await crawler.fetch(url, config);
  console.log(result);
})();

Sample output:

{
  "name": "some extracted value"
}

Configuration

name: (any) -- Unique identifier of the configuration
sites: (array of string) -- Site patterns which will use the extraction patterns. A group of site patterns should only exist once. e.g ["https://example.com", "https://m.example.com"].
engine: (shape)
- type: (one of)
  - html: -- Light engine for plain HTML site only. For Javascript site, use headless instead.
  - headless: -- Heavy engine It uses puppeteer to render the site in headless mode. It can be used for plain HTML or Javascript site.
- options: (any of)
  - waitForXPath: -- Tell the engine to wait for a certain xpath to be visible before doing the extraction (only available for headless type)
attributes: (shape)
- [target attribute key]: (shape) -- The target's attribute key or value container variable.
  - target: (one of)
    - number -- Convert the type of the value found by the engine into number type
    - string -- Convert the type of the value found by the engine into string type
    - boolean -- Convert the type of the value found by the engine into boolean type
  - output: (one of)
    - single: -- non-array value which has type determined by the target
    - multiple: -- array value which has type determined by the target
  - type (one of)
    - xpath: -- Use xpath selector
  - *selectors: (array of shape)
    - type: (one of)
      - text -- Get text value from the selected element
      - html -- Get HTML from the selected element
      - attr -- Get attribute value from the selected element
    - selector: (string) -- Xpath selector of the target element

Sample JSON config with HTML engine:

{
  "name": "example",
  "sites": [
    "https://example.com"
  ],
  "engine": {
    "type": "html",
    "options": {}
  },
  "attributes": {
    "name": {
      "target": "string",
      "output": "single",
      "type": "xpath",
      "selectors": [
        {
          "type": "text",
          "selector": "//x:h1[@class=\"page-title\"]"
        }
      ]
    }
  }
}

Sample JSON config with Headless engine:

{
  "name": "example",
  "sites": [
    "https://example.com"
  ],
  "engine": {
    "type": "headless",
    "options": {
      "waitForXPath": "//div[@class=\"fotorama__stage\"]"
    }
  },
  "attributes": {
    "name": {
      "target": "string",
      "output": "single",
      "type": "xpath",
      "selectors": [
        {
          "type": "text",
          "selector": "//h1[@class=\"page-title\"]"
        }
      ]
    }
  }
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Nescavater

Install Package

Usage Example

Configuration