@cherastain/scraper

v2.0.0

Published

8 months ago

An simple scraper

0High
0Medium
0Low

cherastain

scraper scraping web-scraping webscraping puppeteer

@cherastain/scraper

An simple scraper using puppeteer to evaluate content and for further page processing.

Basic usage

import { Scraper } from "@cherastain/scraper";

const scraper = new Scraper();
const url = "https://www.npmjs.com/";
(async () => {
  const result = await scraper.process(url);
  console.log(result);
})();

// expected result (default template): { hrefs: [...] }

Examples

Template option

Each template property can be defined so the result contains a property with the same name. For a result as follow:

{
  title:"Title of page",
  subTitles:[...],
  links:[...]
}

template should be defined as:

const template: IScraperTemplate = {
  title: { selector: "h1", format: ["unique"] },
  subTitles: "h2",
  links: { selector: "a", format: { attr: "href" } },
};

Remark: default 'format' value is DOM element innerText

const scraper = new Scraper();
const url = "https://www.npmjs.com/";
const template: IScraperTemplate = {
  title: { selector: "h1", format: ["unique"] },
  subTitles: "h2",
  links: { selector: "a", format: { attr: "href" } },
};
(async () => {
  const result = await scraper.process(url, { template });
  console.log(result);
})();

Preprocess

The following example use preprocess option to :

scroll to the bottom of the page
change every href to "foo"

and use a template to get an unique link href formatted with -${x}-

import { Scraper, IScraperTemplate, IScraperOptions } from "@cherastain/scraper";
import { Page } from "puppeteer";

const s = new Scraper();
const url = "https://www.npmjs.com/";
const template: IScraperTemplate = {
  firstLinkHref: {
    selector: "a",
    format: [{ attr: "href" }, "unique", (x) => `-${x}-`],
  },
};
const options: IScraperOptions = {
  preProcess: [
    "scrollBottom",
    async (page: Page) => {
      await page.evaluate(() => {
        //@ts-ignore
        const links = [...document.getElementsByTagName("a")];
        links.forEach((link) => {
          link.href = "foo";
        });
      });
    },
  ],
  template,
};
const result = await s.process(url, options);

// expected result : { firstLinkHref: "-foo-"}

Html head tags and attributes template

const template: IScraperTemplate = {
  metas: {
    selector: "//head/meta", // meta tag
    format: [{ attr: "content" }, { attr: "property" }], // content & property attributes
  },
  title: {
    selector: "//head/title", // title tag
    format: ["unique"],
  },
};

Documentation

Scraper class

process method

Executes the scraping process

scraperInstance.process(url, options);

| Parameter | Type | Description | Default | | --------- | --------------------------------------------- | ------------- | --------------------------------------------------------------------------- | | url | string | url to scrape | | | options | IScraperOptions | (optional) | { template: { hrefs: { selector: "a", format: { attr: "href" } } } } |

getCookies method

When isCookiesPersisted scraper option is set to true, gets persisted cookies once process has been called

scraperInstance.getCookies();

isVerboseEnabled static property

Set to true to globally enable verbose

Scraper.isVerboseEnabled = true;

Contracts

IScraperOptions interface

| Property | Type | Description | | ------------------ | ----------------------------------------------------- | --------------------------------------------------------------------- | | cookies | Cookie[] | (optional) cookies to use during scraping process | | isConsoleEnabled | boolean | (optional) enable console from page evaluation | | isCookiesPersisted | boolean | (optional) enable cookies to persist from one process call to another | | isRobotIgnored | boolean | (optional) ignore robots.txt on domain scraped | | isVerboseEnabled | boolean | (optional) enable verbose debugging messages | | preProcess | (((page: Page) => Promise) or "scrollBottom")[] | (optional) function called before scraping occured | | template | IScraperTemplate | (optional) template to use for scrape result | | userAgent | string | (optional) set user-agent as seen by the scraped site |

IScraperTemplate interface

Template property can be:

a string that is
- a html tag name
- or a css class (prefixed with .)
- or a xpath
a IScraperTemplateIdentifier

Example 1

{
  links: "a"; // string as html tag
}

Example 2

{
  links: ".link"; // string as css class
}

Example 3

{
  links: "/html/body/a"; // string as xpath
}

Example 4

{
  links: {
    selector: "a";
  } // IScraperSelectorIdentifier equivalent to Example 1
}

IScraperSelectorIdentifier interface

| Property | Type | Description | | -------- | ------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | selector | string | can be a html tag name, a css class (prefixed with .) or a xpath | | format | IScraperTemplate or ScraperValueFormater[] | (optional) |

ScraperValueFormater type

By default, result values return DOM element innerText but can be formated using:

| Format | Description | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | { attr: string } | value will be the given attribute of the DOM container | | "html" | value will be the innerHTML of the DOM container | | "unique" | value will be unique (instead of an array) | | ((value: any) => string); | final value will be formatted during post process based on given function and value set for the element by other formatter |

Versions changelog

2.0.0

scraping options removed from ctor
template/selector disambiguation
process options integrate template
getCookies added
verbose can be enabled globally by setting Scraper.isVerboseEnabled to true

1.2.1

isCookiesPersisted and cookies new options added

1.1.3

unique formatter no longer throws error when no element found
dependencies update

1.1.2

formatter can manage several attribute format ({attr:string})

1.1.1

fix robots.txt check that invalidate path starting with a disallow rule

1.1.0

robots.txt check (enable by default)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@cherastain/scraper

Basic usage

Examples

Template option

Preprocess

Html head tags and attributes template

Documentation

Scraper class

process method

getCookies method

isVerboseEnabled static property

Contracts

IScraperOptions interface

IScraperTemplate interface

Example 1

Example 2

Example 3

Example 4

IScraperSelectorIdentifier interface

ScraperValueFormater type

Versions changelog

2.0.0

1.2.1

1.1.3

1.1.2

1.1.1

1.1.0