craworm

v0.1.0

Published

2 months ago

Fast & Easy crawling framework deeply integrated with TypeORM. Built on Crawlee.

0High
0Medium
0Low

ytrsoft

crawler scraper web-scraping orm typeorm crawlee typescript decorators playwright cheerio

CrawORM

Fast & Easy crawling framework deeply integrated with TypeORM. Built on top of Crawlee.

CrawORM lets you describe what to scrape and where to store it on the same class, using TypeScript decorators you already know from TypeORM.

@Entity()
@Crawlable({ url: 'https://news.ycombinator.com/', listSelector: 'tr.athing' })
class Story {
  @PrimaryColumn() @Selector({ selector: '', pick: 'attr', attr: 'id' }) id!: string;
  @Column()       @Selector('span.titleline > a')                       title!: string;
  @Column()       @Href('span.titleline > a')                           url!: string;
}

const orm = new CrawORM({ dataSource });
await orm.init();
await orm.crawler(Story).run();

That's it. URLs deduped, requests retried, sessions rotated, results upserted into your database.

Why CrawORM?

Existing options force you to maintain three parallel descriptions of the same data:

A TypeORM entity for the database schema
A scrape function that picks fields out of HTML
A DTO / interface so TypeScript knows the shape

CrawORM collapses all three into one decorated class. The ORM, the extractor, and the type are the same object.

Design Goals

| Goal | How | |---|---| | Fast | Cheerio extraction in-process; bulk transactional inserts; lazy Crawlee imports | | Easy | One class, one decorator-stack, sane defaults — await orm.crawler(X).run() works | | Safe at scale | CrawlState table = idempotent re-runs, crash recovery, conditional GETs | | Escape hatches | crawlRepo.orm is the raw TypeORM repository; crawleeOptions forwards to Crawlee |

Install

npm install craw-orm typeorm crawlee reflect-metadata
# Optional engines:
npm install playwright          # for JS-rendered sites
npm install pg sqlite3 mysql2   # whichever DB driver you need

tsconfig.json must enable decorators:

{
  "compilerOptions": {
    "experimentalDecorators": true,
    "emitDecoratorMetadata": true,
    "target": "ES2022"
  }
}

The Decorators

| Decorator | Purpose | |---|---| | @Crawlable(meta) | Marks a class as a crawl target, declares URL/engine/pagination/conflict strategy | | @Selector(css) | Extract trimmed text from a CSS selector | | @XPath(xpath) | XPath equivalent (Playwright engines only) | | @Attr(name, css) | Extract an attribute | | @Href(css) / @Src(css) | Extract href/src (auto-resolved to absolute URLs) | | @Html(css) | Extract inner HTML | | @Regex(pattern) | Apply a regex on top of the previous selector | | @Constant(value) | Pin a static value to a field | | @Nested(EntityCtor, css) | Extract child entities scoped to a container |

The CrawlState Table

CrawORM auto-creates a craw_orm_state table that tracks every URL ever crawled:

Idempotency — re-running a crawl with staleAfterMs skips fresh URLs
Recovery — resume: true picks up URLs left in pending after a crash
Auditability — query failed URLs, success rates, last-error messages

Add it to your DataSource entities:

import { CrawlState } from 'craw-orm';
const dataSource = new DataSource({ entities: [..., CrawlState] });

Conflict Strategies

Set on @Crawlable({ onConflict: ... }):

skip — first wins, subsequent crawls don't change the row
overwrite — replace all fields
merge — update only non-null fields from the new crawl
upsert — DB-level UPSERT on the primary key (default)
version — relies on @VersionColumn, increments on save

Engine Selection

| Engine | When | |---|---| | cheerio (default) | Static HTML, fastest | | playwright | JS-rendered sites, anti-bot protection, XPath support | | puppeteer | Same as Playwright, alternative driver | | http | Raw response, no parsing — for APIs |

Set on @Crawlable({ engine: 'playwright' }) or override at runtime: orm.crawler(X, { engine: 'playwright' }).

Hooks

@Crawlable({
  hooks: {
    beforeRequest: async ({ url }) => { /* dismiss banners, log in */ },
    afterParse:    async (entity)  => { /* normalise fields */ },
    beforeSave:    (entity)        => entity.price > 0,  // return false to skip
    afterSave:     async (entity)  => { /* notify, index, etc. */ },
    onError:       async (err)     => { /* alerting */ },
  }
})

Escape Hatches

When you need raw access:

// Raw TypeORM repository — for queries, query builder, transactions:
orm.repository(Product).orm.createQueryBuilder('p')...

// Raw Crawlee config — proxy pools, session rotation, fingerprints:
orm.crawler(Product, {
  crawleeOptions: {
    sessionPoolOptions: { maxPoolSize: 200 },
    browserPoolOptions: { /* ... */ },
  }
});

Performance Notes

@Selector extraction is in-memory Cheerio — ~50µs per field on typical pages
Persistence batches in chunks (default 100) inside transactions — ~2-3ms per row on Postgres
For >10k pages/day, prefer Postgres over SQLite (write contention)
For >1M pages/day, consider sharding by tag and running parallel crawler instances

Status

Experimental. API may change. Pin to a specific version in production.

License

MIT.