@rafikidota/scoutee
v0.19.1
Published
Sometimes, the best way to solve your own problems is to help someone else.
Maintainers
Readme
🕵️ @rafikidota/scoutee
"Sometimes, the best way to solve your own problems is to help someone else."
Scoutee is a NestJS library that wraps Crawlee crawlers into injectable, environment-driven modules. It gives you production-ready HttpCrawler, CheerioCrawler, PlaywrightCrawler, and stealth Camoufox crawlers — all wired up with pre/post navigation hooks, structured logging, and full ConfigService integration out of the box.
📦 Installation
pnpm add @rafikidota/scouteePeer dependencies
Install the crawlers you actually need:
# HTTP / Cheerio (lightweight)
pnpm add crawlee
# Playwright (full browser)
pnpm add crawlee @crawlee/playwright playwright
# Camoufox (stealth browser — anti-bot fingerprint spoofing)
pnpm add crawlee @crawlee/playwright playwright camoufox-jsScoutee also requires a NestJS application context:
pnpm add @nestjs/common @nestjs/core @nestjs/config🗂️ Package exports
Each crawler ships as a separate entry point so you only bundle what you use:
| Import path | What you get |
|---|---|
| @rafikidota/scoutee | All four modules |
| @rafikidota/scoutee/http | HttpModule + HttpService |
| @rafikidota/scoutee/cheerio | CheerioModule + CheerioService |
| @rafikidota/scoutee/playwright | PlaywrightModule + PlaywrightService |
| @rafikidota/scoutee/camoufox | CamoufoxModule + CamoufoxService |
🚀 Quick start
1. Register the module
Import only the modules you need. Each one is self-contained.
// app.module.ts
import { Module } from '@nestjs/common';
import { ConfigModule } from '@nestjs/config';
import { PlaywrightModule } from '@rafikidota/scoutee/playwright';
@Module({
imports: [
ConfigModule.forRoot({ isGlobal: true }),
PlaywrightModule,
],
})
export class AppModule {}2. Inject the service and create a crawler
import { Injectable } from '@nestjs/common';
import { PlaywrightService } from '@rafikidota/scoutee/playwright';
import { Dataset } from 'crawlee';
@Injectable()
export class ScraperService {
constructor(private readonly playwright: PlaywrightService) {}
async run() {
const crawler = this.playwright.create({
async requestHandler({ page, request }) {
const title = await page.title();
await Dataset.pushData({ url: request.url, title });
},
});
await crawler.run(['https://example.com']);
}
}🧩 Modules
🌐 HttpModule
Thin wrapper around Crawlee's HttpCrawler. Best for raw HTTP requests without a browser.
import { HttpModule, HttpService } from '@rafikidota/scoutee/http';
// or from '@rafikidota/scoutee'Service method:
const crawler = httpService.create(options: HttpCrawlerOptions): HttpCrawlerEnvironment variables:
| Variable | Description |
|---|---|
| CRAWLEE_HTTP_MAX_CONCURRENCY | Maximum parallel requests |
| CRAWLEE_HTTP_MIN_CONCURRENCY | Minimum parallel requests |
| CRAWLEE_HTTP_MAX_REQUEST_RETRIES | Retry count per request |
| CRAWLEE_HTTP_TIMEOUT_SECS | Request handler timeout (seconds) |
| CRAWLEE_HTTP_MAX_REQUESTS | Total request cap per run |
| CRAWLEE_HTTP_INITIAL_PAGE | Starting page number |
🍋 CheerioModule
Wrapper around Crawlee's CheerioCrawler. Automatically parses HTML with Cheerio — ideal for static or server-rendered pages.
import { CheerioModule, CheerioService } from '@rafikidota/scoutee/cheerio';Service method:
const crawler = cheerioService.create(options: CheerioCrawlerOptions): CheerioCrawlerEnvironment variables:
| Variable | Description |
|---|---|
| CRAWLEE_CHEERIO_MAX_CONCURRENCY | Maximum parallel requests |
| CRAWLEE_CHEERIO_MIN_CONCURRENCY | Minimum parallel requests |
| CRAWLEE_CHEERIO_MAX_REQUEST_RETRIES | Retry count per request |
| CRAWLEE_CHEERIO_TIMEOUT_SECS | Request handler timeout (seconds) |
| CRAWLEE_CHEERIO_MAX_REQUESTS | Total request cap per run |
| CRAWLEE_CHEERIO_INITIAL_PAGE | Starting page number |
🎭 PlaywrightModule
Full browser automation via Crawlee's PlaywrightCrawler. Supports Chromium, Firefox, and WebKit with session pooling, fingerprinting, and built-in Cloudflare challenge handling.
import { PlaywrightModule, PlaywrightService } from '@rafikidota/scoutee/playwright';Service methods:
// Create a crawler instance
const crawler = playwrightService.create(options: PlaywrightCrawlerOptions): PlaywrightCrawler
// Get a raw browser instance
const browser = await playwrightService.getBrowser()Environment variables:
| Variable | Description |
|---|---|
| CRAWLEE_PLAYWRIGHT_BROWSER | Browser engine: chromium | firefox | webkit |
| CRAWLEE_PLAYWRIGHT_MAX_CONCURRENCY | Maximum parallel browser pages |
| CRAWLEE_PLAYWRIGHT_MIN_CONCURRENCY | Minimum parallel browser pages |
| CRAWLEE_PLAYWRIGHT_MAX_REQUEST_RETRIES | Retry count per request |
| CRAWLEE_PLAYWRIGHT_TIMEOUT_SECS | Navigation and handler timeout (seconds) |
| CRAWLEE_PLAYWRIGHT_MAX_REQUESTS | Total request cap per run |
| CRAWLEE_PLAYWRIGHT_INITIAL_PAGE | Starting page number |
| CRAWLEE_PLAYWRIGHT_HEADLESS | Run browser headless (true | false) |
| CRAWLEE_PLAYWRIGHT_USE_INCOGNITO_PAGES | Use incognito context (true | false) |
| CRAWLEE_PLAYWRIGHT_HANDLE_CLOUDFLARE_CHALLENGE | Auto-solve Cloudflare challenges (true | false) |
Browser types (BrowserType enum):
import { BrowserType } from '@rafikidota/scoutee/playwright';
BrowserType.CHROMIUM // 'chromium'
BrowserType.FIREFOX // 'firefox'
BrowserType.WEBKIT // 'webkit'🦊 CamoufoxModule
Stealth browser powered by Camoufox — a hardened Firefox fork designed to bypass bot detection. Uses PlaywrightCrawler under the hood with fingerprint spoofing, GeoIP emulation, WebRTC blocking, and human-like behavior simulation.
import { CamoufoxModule, CamoufoxService } from '@rafikidota/scoutee/camoufox';Service methods:
// Create a stealth crawler instance
const crawler = await camoufoxService.create(options: PlaywrightCrawlerOptions): Promise<PlaywrightCrawler>
// Get a raw Camoufox browser instance
const browser = await camoufoxService.getBrowser()Environment variables:
| Variable | Description |
|---|---|
| CRAWLEE_CAMOUFOX_MAX_CONCURRENCY | Maximum parallel browser pages |
| CRAWLEE_CAMOUFOX_MIN_CONCURRENCY | Minimum parallel browser pages |
| CRAWLEE_CAMOUFOX_MAX_REQUEST_RETRIES | Retry count per request |
| CRAWLEE_CAMOUFOX_TIMEOUT_SECS | Navigation and handler timeout (seconds) |
| CRAWLEE_CAMOUFOX_MAX_REQUESTS | Total request cap per run |
| CRAWLEE_CAMOUFOX_INITIAL_PAGE | Starting page number |
| CRAWLEE_CAMOUFOX_HEADLESS | Run browser headless (true | false) |
| CRAWLEE_CAMOUFOX_EXECUTABLE_PATH | Custom Camoufox binary path (optional) |
| CRAWLEE_CAMOUFOX_HANDLE_CLOUDFLARE_CHALLENGE | Auto-solve Cloudflare challenges (true | false) |
| CRAWLEE_CAMOUFOX_USE_INCOGNITO_PAGES | Use incognito context (true | false) |
| CRAWLEE_CAMOUFOX_GEOIP | Enable GeoIP emulation (true | false) |
| CRAWLEE_CAMOUFOX_OS | Spoof OS fingerprint: windows | macos | linux |
| CRAWLEE_CAMOUFOX_BLOCK_WEBRTC | Block WebRTC leaks (true | false) |
| CRAWLEE_CAMOUFOX_HUMANIZE | Human-like mouse delay multiplier (number) |
| CRAWLEE_CAMOUFOX_BLOCK_IMAGES | Block image loading for speed (true | false) |
| CRAWLEE_CAMOUFOX_ENABLE_CACHE | Enable browser cache (true | false) |
OS spoof options (CamoufoxOS enum):
import { CamoufoxOS } from '@rafikidota/scoutee/camoufox';
CamoufoxOS.WINDOWS // 'windows'
CamoufoxOS.MACOS // 'macos'
CamoufoxOS.LINUX // 'linux'⚙️ Environment file example
# --- HTTP ---
CRAWLEE_HTTP_MAX_CONCURRENCY=5
CRAWLEE_HTTP_MIN_CONCURRENCY=1
CRAWLEE_HTTP_MAX_REQUEST_RETRIES=3
CRAWLEE_HTTP_TIMEOUT_SECS=30
CRAWLEE_HTTP_MAX_REQUESTS=100
CRAWLEE_HTTP_INITIAL_PAGE=1
# --- Cheerio ---
CRAWLEE_CHEERIO_MAX_CONCURRENCY=5
CRAWLEE_CHEERIO_MIN_CONCURRENCY=1
CRAWLEE_CHEERIO_MAX_REQUEST_RETRIES=3
CRAWLEE_CHEERIO_TIMEOUT_SECS=30
CRAWLEE_CHEERIO_MAX_REQUESTS=100
CRAWLEE_CHEERIO_INITIAL_PAGE=1
# --- Playwright ---
CRAWLEE_PLAYWRIGHT_BROWSER=chromium
CRAWLEE_PLAYWRIGHT_MAX_CONCURRENCY=3
CRAWLEE_PLAYWRIGHT_MIN_CONCURRENCY=1
CRAWLEE_PLAYWRIGHT_MAX_REQUEST_RETRIES=2
CRAWLEE_PLAYWRIGHT_TIMEOUT_SECS=60
CRAWLEE_PLAYWRIGHT_MAX_REQUESTS=50
CRAWLEE_PLAYWRIGHT_INITIAL_PAGE=1
CRAWLEE_PLAYWRIGHT_HEADLESS=true
CRAWLEE_PLAYWRIGHT_USE_INCOGNITO_PAGES=false
CRAWLEE_PLAYWRIGHT_HANDLE_CLOUDFLARE_CHALLENGE=false
# --- Camoufox ---
CRAWLEE_CAMOUFOX_MAX_CONCURRENCY=2
CRAWLEE_CAMOUFOX_MIN_CONCURRENCY=1
CRAWLEE_CAMOUFOX_MAX_REQUEST_RETRIES=2
CRAWLEE_CAMOUFOX_TIMEOUT_SECS=60
CRAWLEE_CAMOUFOX_MAX_REQUESTS=50
CRAWLEE_CAMOUFOX_INITIAL_PAGE=1
CRAWLEE_CAMOUFOX_HEADLESS=true
CRAWLEE_CAMOUFOX_HANDLE_CLOUDFLARE_CHALLENGE=true
CRAWLEE_CAMOUFOX_USE_INCOGNITO_PAGES=false
CRAWLEE_CAMOUFOX_GEOIP=true
CRAWLEE_CAMOUFOX_OS=linux
CRAWLEE_CAMOUFOX_BLOCK_WEBRTC=true
CRAWLEE_CAMOUFOX_HUMANIZE=1
CRAWLEE_CAMOUFOX_BLOCK_IMAGES=false
CRAWLEE_CAMOUFOX_ENABLE_CACHE=false🏗️ Architecture overview
@rafikidota/scoutee
├── HttpModule → HttpService (HttpCrawler)
├── CheerioModule → CheerioService (CheerioCrawler)
├── PlaywrightModule → PlaywrightService (PlaywrightCrawler)
│ ├── BrowserService → browser launcher selection
│ ├── ConfigService → env-driven configuration
│ └── HookService → pre/post navigation hooks + logging
└── CamoufoxModule → CamoufoxService (PlaywrightCrawler + Camoufox)
├── BrowserService → Camoufox launch options
├── ConfigService → env-driven configuration
└── HookService → pre/post navigation hooks + Cloudflare handlingEvery module ships with:
- 📋 ConfigService — reads all settings from
@nestjs/config'sConfigService - 🪝 HookService — injects default pre/post navigation hooks (URL logging, HTTP status logging, Cloudflare challenge handling)
- 🏭 Service — exposes a
create()factory that merges default options with any overrides you pass in
📋 Choosing a crawler
| Scenario | Recommended module |
|---|---|
| Fast data extraction, no JS needed | 🌐 HttpModule |
| Static HTML with CSS selectors | 🍋 CheerioModule |
| JavaScript-heavy SPAs | 🎭 PlaywrightModule |
| Anti-bot / Cloudflare protected sites | 🦊 CamoufoxModule |
🛠️ Development
# Install dependencies
pnpm install
# Build
pnpm run build
# Lint & format
pnpm run lint
pnpm run formatPublishing is automated via GitHub Actions on every v* tag push.
