@cherastain/scraper
v2.0.0
Published
An simple scraper
Maintainers
Readme
@cherastain/scraper
An simple scraper using puppeteer to evaluate content and for further page processing.
Basic usage
import { Scraper } from "@cherastain/scraper";
const scraper = new Scraper();
const url = "https://www.npmjs.com/";
(async () => {
const result = await scraper.process(url);
console.log(result);
})();
// expected result (default template): { hrefs: [...] }Examples
Template option
Each template property can be defined so the result contains a property with the same name. For a result as follow:
{
title:"Title of page",
subTitles:[...],
links:[...]
}template should be defined as:
const template: IScraperTemplate = {
title: { selector: "h1", format: ["unique"] },
subTitles: "h2",
links: { selector: "a", format: { attr: "href" } },
};Remark: default 'format' value is DOM element innerText
const scraper = new Scraper();
const url = "https://www.npmjs.com/";
const template: IScraperTemplate = {
title: { selector: "h1", format: ["unique"] },
subTitles: "h2",
links: { selector: "a", format: { attr: "href" } },
};
(async () => {
const result = await scraper.process(url, { template });
console.log(result);
})();Preprocess
The following example use preprocess option to :
- scroll to the bottom of the page
- change every href to "foo"
and use a template to get an unique link href formatted with -${x}-
import { Scraper, IScraperTemplate, IScraperOptions } from "@cherastain/scraper";
import { Page } from "puppeteer";
const s = new Scraper();
const url = "https://www.npmjs.com/";
const template: IScraperTemplate = {
firstLinkHref: {
selector: "a",
format: [{ attr: "href" }, "unique", (x) => `-${x}-`],
},
};
const options: IScraperOptions = {
preProcess: [
"scrollBottom",
async (page: Page) => {
await page.evaluate(() => {
//@ts-ignore
const links = [...document.getElementsByTagName("a")];
links.forEach((link) => {
link.href = "foo";
});
});
},
],
template,
};
const result = await s.process(url, options);
// expected result : { firstLinkHref: "-foo-"}Html head tags and attributes template
const template: IScraperTemplate = {
metas: {
selector: "//head/meta", // meta tag
format: [{ attr: "content" }, { attr: "property" }], // content & property attributes
},
title: {
selector: "//head/title", // title tag
format: ["unique"],
},
};Documentation
Scraper class
process method
Executes the scraping process
scraperInstance.process(url, options);| Parameter | Type | Description | Default |
| --------- | --------------------------------------------- | ------------- | --------------------------------------------------------------------------- |
| url | string | url to scrape | |
| options | IScraperOptions | (optional) | { template: { hrefs: { selector: "a", format: { attr: "href" } } } } |
getCookies method
When isCookiesPersisted scraper option is set to true, gets persisted cookies once process has been called
scraperInstance.getCookies();isVerboseEnabled static property
Set to true to globally enable verbose
Scraper.isVerboseEnabled = true;Contracts
IScraperOptions interface
| Property | Type | Description | | ------------------ | ----------------------------------------------------- | --------------------------------------------------------------------- | | cookies | Cookie[] | (optional) cookies to use during scraping process | | isConsoleEnabled | boolean | (optional) enable console from page evaluation | | isCookiesPersisted | boolean | (optional) enable cookies to persist from one process call to another | | isRobotIgnored | boolean | (optional) ignore robots.txt on domain scraped | | isVerboseEnabled | boolean | (optional) enable verbose debugging messages | | preProcess | (((page: Page) => Promise) or "scrollBottom")[] | (optional) function called before scraping occured | | template | IScraperTemplate | (optional) template to use for scrape result | | userAgent | string | (optional) set user-agent as seen by the scraped site |
IScraperTemplate interface
Template property can be:
- a string that is
- a html tag name
- or a css class (prefixed with
.) - or a xpath
- a IScraperTemplateIdentifier
Example 1
{
links: "a"; // string as html tag
}Example 2
{
links: ".link"; // string as css class
}Example 3
{
links: "/html/body/a"; // string as xpath
}Example 4
{
links: {
selector: "a";
} // IScraperSelectorIdentifier equivalent to Example 1
}IScraperSelectorIdentifier interface
| Property | Type | Description | | -------- | ------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | selector | string | can be a html tag name, a css class (prefixed with .) or a xpath | | format | IScraperTemplate or ScraperValueFormater[] | (optional) |
ScraperValueFormater type
By default, result values return DOM element innerText but can be formated using:
| Format | Description | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | { attr: string } | value will be the given attribute of the DOM container | | "html" | value will be the innerHTML of the DOM container | | "unique" | value will be unique (instead of an array) | | ((value: any) => string); | final value will be formatted during post process based on given function and value set for the element by other formatter |
Versions changelog
2.0.0
- scraping options removed from ctor
- template/selector disambiguation
- process options integrate template
- getCookies added
- verbose can be enabled globally by setting Scraper.isVerboseEnabled to true
1.2.1
- isCookiesPersisted and cookies new options added
1.1.3
- unique formatter no longer throws error when no element found
- dependencies update
1.1.2
- formatter can manage several attribute format ({attr:string})
1.1.1
- fix robots.txt check that invalidate path starting with a disallow rule
1.1.0
- robots.txt check (enable by default)
