scrabt

v1.0.0

Published

2 months ago

Scrape and parse with ease.

0High
0Medium
0Low

mevering

web scrape scraping crawl crawling parse parsing html playwright cheerio

♋︎ scrabt

Versatile webscraping package.

About

Create scrapers and parsers with boilerplate or custom strategies.

Hook into navigation and parse events to asynchronously loop scrapera into pipelines.

Installation

npm i scrabt

Note: To use the PlaywrightScraper, install playwright, as well:

npm i playwright

Why Scrabt?

Scrabt is an easy way to start small and grow your scraper and parser without having to restart from the top. Easily create a basic scraper config, change approach including link selection, scraping engine, and parsing logic. There's no need to start with an HTTP scraper and rewrite the entire thing from the ground-up when you decide to switch to a headless browser.

Additionally, scrabt's event emitter makes it simple to hook scrapers into existing workflows and pipelines. Some potential approaches:

Hook responses and keep a minimal parser for link selection only to offload parsing elsewhere.
Hook parsing to push parsed data into a messaging queue for enrichment later.
Extend strategy to inject custom logic in how the scraper and parser should flow.

Usage

Basic

Get going quickly with defaults.

The below will run an HTTP-requests based scraper that will only scrape links on the same domain, up to a maximum of 5 (default) links.

import { Scrabt } from "scrabt";

const scrabt = new Scrabt();
await scrabt.start("https://example.com");
console.log(scrabt.parsed)

Hooking Events

The below expands on the above. Here the scraping engine has been replaced, now using a headless browser via playwright. The strategy has also been inverted; only different-domain URLs will be selected and enqueue.

The parser has been specified but is unchanged.

import { Scrabt } from "scrabt";

const scrabtWithCallbacks = new Scrabt(
    "bfsDifferent",
    "playwright",
    "simple",
    {
        navCallback: (response) => {
            console.log(`Navigated to: ${response.url}`);
        },
        parseCallback: (data) => {
            console.log(`Parsed data:`, data);
        }
    }
);
await scrabtWithCallbacks.start("https://example.com", 20);
console.log(scrabtWithCallbacks.parsed)

In this case some options have been added, navCallback and parseCallback - these hook into the events emitted by the scrabt strategy, allowing external operations to be executed (e.g. navigation metrics or parsed data publishing).

Custom Strategies

Scrapers, parsers, and strategies can all be fully extended and customised to allow for total control.

The below example is a (somewhat nonsensical) full custom example.

import { Scrabt, Strategy, PlaywrightScraper, RequestOptions, Response, ParsedData, Parser } from "scrabt";


class CustomPlaywrightScraper extends PlaywrightScraper {

    // A modified version of goto that creates a new page for each navigation.
    async goto(url: string, options?: RequestOptions): Promise<Response> {
        if (this._page) {
            await this._page.close();
        }
        return await super.goto(url, options);
    }
}

export class CustomParser extends Parser {

    parse(): ParsedData {
        return {
            title: this.$("div.customTitle").text(),
            body: this.$("div.customBody").text(),
        };
    }

    getHrefs(): string[] {
        // Get text of div.customLink elements as an array.
        return this.$("div.customLink").get().map(el => this.$(el).text());
    }
}

class CustomStrategy extends Strategy {

    maxQueueSize = 10;

    // Custom handle to emit a custom event when the queue is too large.
    async handle(url: string): Promise<void> {
        await super.handle(url);
        if (this.queue.size > this.maxQueueSize) {
            this.emit("QUEUE_TOO_LARGE", this.queue.size);
        }
    }
}

const strategy = new CustomStrategy(new CustomPlaywrightScraper("UserAgent"), new CustomParser());

strategy.on("QUEUE_TOO_LARGE", (size) => {
    console.warn(`Queue size is too large: ${size}`);
});

const customScrabt = new Scrabt(strategy);
await customScrabt.start("https://somesite.com", 2);
console.log(customScrabt.parsed);

Testing

npm run test

To Do

Boilerplate AI-enabled scrapers/parsers/strategies.
Optional dynamic parsing backend based on engine (e.g. give users option to use Playwright locators for parsing if using a Playwright engine).
More involved examples to show real scraping and parsing e2e pipeline.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme