@isdk/proxy-crawlee
v0.3.1
Published
Crawlee interceptor adapter for @isdk/proxy, providing a seamless caching layer for network requests.
Maintainers
Readme
@isdk/proxy-crawlee
A caching adapter for Crawlee that integrates the @isdk/proxy caching engine into your web scraping workflows.
Features
- 🚀 Universal Hook: Use a single
preNavigationHooksfor bothCheerioCrawlerandPlaywrightCrawler. - 🧠 Environment-Aware: Automatically detects the crawler engine and applies the most efficient interception strategy.
- 🛡️ Request Collapsing: Prevents cache stampede in high-concurrency scraping sessions.
- 🌊 Native Streaming: Efficiently caches large documents without high memory overhead.
- 🔄 SWR Support: Background revalidation ensures your crawler always gets data instantly while keeping the cache fresh.
- 🔌 Flexible Fetcher: Built-in
got-scrapingsupport for fingerprinting, but fully customizable.
Installation
pnpm add @isdk/proxy-crawlee @isdk/proxyNote: This adapter requires crawlee (and optionally got-scraping) to be installed in your project.
Quick Start
import { SmartCache } from '@isdk/proxy';
import { createCrawleeCacheHook } from '@isdk/proxy-crawlee';
import { CheerioCrawler } from 'crawlee';
// 1. Initialize the cache
const cache = new SmartCache({ storagePath: './.cache' });
// 2. Create the hook
const cacheHook = createCrawleeCacheHook({
cache, // SmartCache instance
config: {
methods: ['GET'], forceCache: true
}
});
// 3. Apply to any crawler
const crawler = new CheerioCrawler({
preNavigationHooks: [ cacheHook ],
requestHandler: async ({ request, body }) => {
console.log(`Fetched ${request.url} (Cache: ${request.headers['x-proxy-cache']})`);
},
});
await crawler.run(['https://example.com']);Configuration Options
CrawleeCacheOptions extends FetchWithCacheOptions from @isdk/proxy. All options from FetchWithCacheOptions are available, plus the following:
| Option | Type | Description |
| :--- | :--- | :--- |
| cache | SmartCache | Required (from FetchWithCacheOptions). The SmartCache instance from @isdk/proxy. |
| config | ProxySiteConfig | Required (from FetchWithCacheOptions). Site-level cache configuration (rules, fingerprinting, etc.). For detailed options like methods, rules, forceCache, see @isdk/proxy. |
| fetcher | Function | Optional. Custom fetcher for real network requests. Defaults to got-scraping. |
| backgroundUpdate | boolean | From FetchWithCacheOptions. Enable SWR (Stale-While-Revalidate). Default: true. |
| refresh | boolean | From FetchWithCacheOptions. Force refresh: Ignores existing cache and always fetches from source. Useful for bypassing bot verification. |
| navigationOnly | boolean | Only cache the main document (Browser only). Default: true. |
| activeCacheWrites | Map | Shared map for request collapsing across crawler instances. |
Interception Strategies
Cheerio / JSDOM (HTTP-only)
The adapter injects a custom handler into gotOptions.handlers. It intercepts the request at the lowest level, preventing got-scraping from making a network call if a cache hit occurs.
Playwright (Browser)
The adapter uses page.route (Playwright) to intercept navigation requests. It fulfills the request directly from the cache, bypassing the browser's network stack for the main document.
Offline Mode
Offline Mode: Disables network access and only uses the local cache. When a cache miss occurs, the crawler will throw OfflineCacheMissError.
For this error to properly fail your crawler, you must configure throwHttpErrors: true in your Crawlee options:
const crawler = new CheerioCrawler({
preNavigationHooks: [ cacheHook ],
requestHandler: async ({ request, body }) => {
console.log(`Fetched ${request.url}`);
},
throwHttpErrors: true, // Required for OfflineCacheMissError
});License
MIT
