npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

craworm

v0.1.0

Published

Fast & Easy crawling framework deeply integrated with TypeORM. Built on Crawlee.

Readme

CrawORM

Fast & Easy crawling framework deeply integrated with TypeORM. Built on top of Crawlee.

CrawORM lets you describe what to scrape and where to store it on the same class, using TypeScript decorators you already know from TypeORM.

@Entity()
@Crawlable({ url: 'https://news.ycombinator.com/', listSelector: 'tr.athing' })
class Story {
  @PrimaryColumn() @Selector({ selector: '', pick: 'attr', attr: 'id' }) id!: string;
  @Column()       @Selector('span.titleline > a')                       title!: string;
  @Column()       @Href('span.titleline > a')                           url!: string;
}

const orm = new CrawORM({ dataSource });
await orm.init();
await orm.crawler(Story).run();

That's it. URLs deduped, requests retried, sessions rotated, results upserted into your database.


Why CrawORM?

Existing options force you to maintain three parallel descriptions of the same data:

  1. A TypeORM entity for the database schema
  2. A scrape function that picks fields out of HTML
  3. A DTO / interface so TypeScript knows the shape

CrawORM collapses all three into one decorated class. The ORM, the extractor, and the type are the same object.

Design Goals

| Goal | How | |---|---| | Fast | Cheerio extraction in-process; bulk transactional inserts; lazy Crawlee imports | | Easy | One class, one decorator-stack, sane defaults — await orm.crawler(X).run() works | | Safe at scale | CrawlState table = idempotent re-runs, crash recovery, conditional GETs | | Escape hatches | crawlRepo.orm is the raw TypeORM repository; crawleeOptions forwards to Crawlee |

Install

npm install craw-orm typeorm crawlee reflect-metadata
# Optional engines:
npm install playwright          # for JS-rendered sites
npm install pg sqlite3 mysql2   # whichever DB driver you need

tsconfig.json must enable decorators:

{
  "compilerOptions": {
    "experimentalDecorators": true,
    "emitDecoratorMetadata": true,
    "target": "ES2022"
  }
}

The Decorators

| Decorator | Purpose | |---|---| | @Crawlable(meta) | Marks a class as a crawl target, declares URL/engine/pagination/conflict strategy | | @Selector(css) | Extract trimmed text from a CSS selector | | @XPath(xpath) | XPath equivalent (Playwright engines only) | | @Attr(name, css) | Extract an attribute | | @Href(css) / @Src(css) | Extract href/src (auto-resolved to absolute URLs) | | @Html(css) | Extract inner HTML | | @Regex(pattern) | Apply a regex on top of the previous selector | | @Constant(value) | Pin a static value to a field | | @Nested(EntityCtor, css) | Extract child entities scoped to a container |

The CrawlState Table

CrawORM auto-creates a craw_orm_state table that tracks every URL ever crawled:

  • Idempotency — re-running a crawl with staleAfterMs skips fresh URLs
  • Recoveryresume: true picks up URLs left in pending after a crash
  • Auditability — query failed URLs, success rates, last-error messages

Add it to your DataSource entities:

import { CrawlState } from 'craw-orm';
const dataSource = new DataSource({ entities: [..., CrawlState] });

Conflict Strategies

Set on @Crawlable({ onConflict: ... }):

  • skip — first wins, subsequent crawls don't change the row
  • overwrite — replace all fields
  • merge — update only non-null fields from the new crawl
  • upsert — DB-level UPSERT on the primary key (default)
  • version — relies on @VersionColumn, increments on save

Engine Selection

| Engine | When | |---|---| | cheerio (default) | Static HTML, fastest | | playwright | JS-rendered sites, anti-bot protection, XPath support | | puppeteer | Same as Playwright, alternative driver | | http | Raw response, no parsing — for APIs |

Set on @Crawlable({ engine: 'playwright' }) or override at runtime: orm.crawler(X, { engine: 'playwright' }).

Hooks

@Crawlable({
  hooks: {
    beforeRequest: async ({ url }) => { /* dismiss banners, log in */ },
    afterParse:    async (entity)  => { /* normalise fields */ },
    beforeSave:    (entity)        => entity.price > 0,  // return false to skip
    afterSave:     async (entity)  => { /* notify, index, etc. */ },
    onError:       async (err)     => { /* alerting */ },
  }
})

Escape Hatches

When you need raw access:

// Raw TypeORM repository — for queries, query builder, transactions:
orm.repository(Product).orm.createQueryBuilder('p')...

// Raw Crawlee config — proxy pools, session rotation, fingerprints:
orm.crawler(Product, {
  crawleeOptions: {
    sessionPoolOptions: { maxPoolSize: 200 },
    browserPoolOptions: { /* ... */ },
  }
});

Performance Notes

  • @Selector extraction is in-memory Cheerio — ~50µs per field on typical pages
  • Persistence batches in chunks (default 100) inside transactions — ~2-3ms per row on Postgres
  • For >10k pages/day, prefer Postgres over SQLite (write contention)
  • For >1M pages/day, consider sharding by tag and running parallel crawler instances

Status

Experimental. API may change. Pin to a specific version in production.

License

MIT.