scrabt
v1.0.0
Published
Scrape and parse with ease.
Maintainers
Readme
♋︎ scrabt
Versatile webscraping package.
About
Create scrapers and parsers with boilerplate or custom strategies.
Hook into navigation and parse events to asynchronously loop scrapera into pipelines.
Installation
npm i scrabt
Note: To use the PlaywrightScraper, install playwright, as well:
npm i playwright
Why Scrabt?
Scrabt is an easy way to start small and grow your scraper and parser without having to restart from the top. Easily create a basic scraper config, change approach including link selection, scraping engine, and parsing logic. There's no need to start with an HTTP scraper and rewrite the entire thing from the ground-up when you decide to switch to a headless browser.
Additionally, scrabt's event emitter makes it simple to hook scrapers into existing workflows and pipelines. Some potential approaches:
- Hook responses and keep a minimal parser for link selection only to offload parsing elsewhere.
- Hook parsing to push parsed data into a messaging queue for enrichment later.
- Extend strategy to inject custom logic in how the scraper and parser should flow.
Usage
Basic
Get going quickly with defaults.
The below will run an HTTP-requests based scraper that will only scrape links on the same domain, up to a maximum of 5 (default) links.
import { Scrabt } from "scrabt";
const scrabt = new Scrabt();
await scrabt.start("https://example.com");
console.log(scrabt.parsed)Hooking Events
The below expands on the above. Here the scraping engine has been replaced, now using a
headless browser via playwright. The strategy
has also been inverted; only different-domain URLs will be selected and enqueue.
The parser has been specified but is unchanged.
import { Scrabt } from "scrabt";
const scrabtWithCallbacks = new Scrabt(
"bfsDifferent",
"playwright",
"simple",
{
navCallback: (response) => {
console.log(`Navigated to: ${response.url}`);
},
parseCallback: (data) => {
console.log(`Parsed data:`, data);
}
}
);
await scrabtWithCallbacks.start("https://example.com", 20);
console.log(scrabtWithCallbacks.parsed)In this case some options have been added, navCallback and parseCallback - these
hook into the events emitted by the scrabt strategy, allowing external operations to be
executed (e.g. navigation metrics or parsed data publishing).
Custom Strategies
Scrapers, parsers, and strategies can all be fully extended and customised to allow for total control.
The below example is a (somewhat nonsensical) full custom example.
import { Scrabt, Strategy, PlaywrightScraper, RequestOptions, Response, ParsedData, Parser } from "scrabt";
class CustomPlaywrightScraper extends PlaywrightScraper {
// A modified version of goto that creates a new page for each navigation.
async goto(url: string, options?: RequestOptions): Promise<Response> {
if (this._page) {
await this._page.close();
}
return await super.goto(url, options);
}
}
export class CustomParser extends Parser {
parse(): ParsedData {
return {
title: this.$("div.customTitle").text(),
body: this.$("div.customBody").text(),
};
}
getHrefs(): string[] {
// Get text of div.customLink elements as an array.
return this.$("div.customLink").get().map(el => this.$(el).text());
}
}
class CustomStrategy extends Strategy {
maxQueueSize = 10;
// Custom handle to emit a custom event when the queue is too large.
async handle(url: string): Promise<void> {
await super.handle(url);
if (this.queue.size > this.maxQueueSize) {
this.emit("QUEUE_TOO_LARGE", this.queue.size);
}
}
}
const strategy = new CustomStrategy(new CustomPlaywrightScraper("UserAgent"), new CustomParser());
strategy.on("QUEUE_TOO_LARGE", (size) => {
console.warn(`Queue size is too large: ${size}`);
});
const customScrabt = new Scrabt(strategy);
await customScrabt.start("https://somesite.com", 2);
console.log(customScrabt.parsed);Testing
npm run test
To Do
- Boilerplate AI-enabled scrapers/parsers/strategies.
- Optional dynamic parsing backend based on engine (e.g. give users option to use Playwright locators for parsing if using a Playwright engine).
- More involved examples to show real scraping and parsing e2e pipeline.
