@isdk/proxy-crawlee

v0.3.1

Published

2 months ago

Crawlee interceptor adapter for @isdk/proxy, providing a seamless caching layer for network requests.

0High
0Medium
0Low

isdk

crawlee interceptor cache proxy @isdk/proxy adapter

@isdk/proxy-crawlee

A caching adapter for Crawlee that integrates the @isdk/proxy caching engine into your web scraping workflows.

Features

🚀 Universal Hook: Use a single preNavigationHooks for both CheerioCrawler and PlaywrightCrawler.
🧠 Environment-Aware: Automatically detects the crawler engine and applies the most efficient interception strategy.
🛡️ Request Collapsing: Prevents cache stampede in high-concurrency scraping sessions.
🌊 Native Streaming: Efficiently caches large documents without high memory overhead.
🔄 SWR Support: Background revalidation ensures your crawler always gets data instantly while keeping the cache fresh.
🔌 Flexible Fetcher: Built-in got-scraping support for fingerprinting, but fully customizable.

Installation

pnpm add @isdk/proxy-crawlee @isdk/proxy

Note: This adapter requires crawlee (and optionally got-scraping) to be installed in your project.

Quick Start

import { SmartCache } from '@isdk/proxy';
import { createCrawleeCacheHook } from '@isdk/proxy-crawlee';
import { CheerioCrawler } from 'crawlee';

// 1. Initialize the cache
const cache = new SmartCache({ storagePath: './.cache' });

// 2. Create the hook
const cacheHook = createCrawleeCacheHook({
  cache, // SmartCache instance
  config: {
    methods: ['GET'], forceCache: true
  }
});

// 3. Apply to any crawler
const crawler = new CheerioCrawler({
  preNavigationHooks: [ cacheHook ],
  requestHandler: async ({ request, body }) => {
    console.log(`Fetched ${request.url} (Cache: ${request.headers['x-proxy-cache']})`);
  },
});

await crawler.run(['https://example.com']);

Configuration Options

CrawleeCacheOptions extends FetchWithCacheOptions from @isdk/proxy. All options from FetchWithCacheOptions are available, plus the following:

Interception Strategies

Cheerio / JSDOM (HTTP-only)

The adapter injects a custom handler into gotOptions.handlers. It intercepts the request at the lowest level, preventing got-scraping from making a network call if a cache hit occurs.

Playwright (Browser)

The adapter uses page.route (Playwright) to intercept navigation requests. It fulfills the request directly from the cache, bypassing the browser's network stack for the main document.

Offline Mode

Offline Mode: Disables network access and only uses the local cache. When a cache miss occurs, the crawler will throw OfflineCacheMissError.

For this error to properly fail your crawler, you must configure throwHttpErrors: true in your Crawlee options:

const crawler = new CheerioCrawler({
  preNavigationHooks: [ cacheHook ],
  requestHandler: async ({ request, body }) => {
    console.log(`Fetched ${request.url}`);
  },
  throwHttpErrors: true, // Required for OfflineCacheMissError
});

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@isdk/proxy-crawlee

Features

Installation

Quick Start

Configuration Options

Interception Strategies

Cheerio / JSDOM (HTTP-only)

Playwright (Browser)

Offline Mode

License