scrapeur

v0.1.0

Published

4 years ago

Simple high-level declarative scraper

0High
0Medium
0Low

gwn

scrape scraper JSDOM

scrapeur

Simple high-level declarative scraper.

usage example

scrapeur config.js

example config.js:

module.exports = {
    url: 'https://loremipsum.com',

    parsers: {
        main: document =>
            document._('.menu-bar-list > li')
                .map(categoryEl => ({
                    title: categoryEl._('span')[0].textContent.trim(),
                    link: categoryEl._('a')[0].href,
                })),

        subCategories: document =>
            document._('.category-sidebar li a')
                .map(el => ({
                    title: el.textContent.trim().match(/^(.*) \(\d+\)$/)[1],
                    link: el.href,
                })),
    },

    links: {
        main: {link: 'subCategories'},
        subCategories: {link: 'subCategories'},
    },
}

Note: document._(x) is short for Array.from(document.querySelectorAll(x)). Same with Node.prototype._.

That was a slightly modified version of a real example. Here is how scrapeur executes this:

Start with fetching the page pointed by the given url.
Parse the fetched page with parsers.main.
Recursively look for links called link (as declared by links.main) in the object returned by the parser.
Fetch the pages pointed by the found links, parse them using the parser declared by links.main, and inject the results into the objects with the related link keys.
Goto 3.

Resulting object looks like:

[
    {
        title: 'lorem',
        link: 'http://loremipsum.com/lorem',
        children: [
            {
                title: 'ipsum',
                link: 'http://loremipsum.com/ipsum',
                children: [
                    {
                        title: 'dolor',
                        link: 'http://loremipsum.com/dolor',
                        children: [],
                    },
                    ...
                ],
            },
            {
                ...
            },
        ],
    },
    {
        title: 'sit amet',
        link: 'http://loremipsum.com/sit-amet',
        children: [
            ...
        ],
    },
    ...
]

philosophy

In progress. Basically something about saving you from writing your own request and following logic thanks to scrapeurs declarative mini-DSL.

API

Config object:

{
    url: 'http://loremipsum.com',
    parsers: {
        main: document => ...,
        aux: document => ...,
    },
    links: {
        main: {
            link: {
                parser: 'aux',
                propName: 'children',
            },
            link2: 'aux', // shorthand
        },
    },
    limit: {
        fetch: 1000000,
        level: 1000000,
    },
}

url: URL to start scraping.
parsers: Map of parser functions that accept the document as their single argument and expected to return an array or an object. The parser that parses the url has to be named main.
links: Links to look for and follow in each parser's payload. Links will be followed and parsed by the declared parser and the payload will be injected into the object containing the link.
limit: Limiting options for development. Limiting by a maximum number of fetches or depth are supported.

Where is Cheerio?

JSDOM is being used at the moment.

Cheerio is faster and leaner than JSDOM so support for it will be added sooner or later.

The good thing about JSDOM though is that it's the standard DOM API so if you already know how to work with it you don't have to learn nor remember new stuff.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

scrapeur

usage example

philosophy

API

Where is Cheerio?