crawlee-ghost-fetch
v0.7.1
Published
Crawlee BaseHttpClient that delegates every HTTP request to the ghost-fetch unblocker (impit Tier 1 + stealth Chromium fallback). Plugs into CheerioCrawler / HttpCrawler with a one-line swap; SessionPool drives rotation; opt-in false-block recovery for WA
Maintainers
Readme
crawlee-ghost-fetch
A Crawlee BaseHttpClient that delegates every HTTP request to the
ghost-fetch unblocker. Drop
it into a CheerioCrawler (or any HttpCrawler) and your crawl gets
ghost-fetch's full unblocking cascade — Tier 1 impit, Tier 2 stealth
Chromium, country-aware locale, sticky sessions — without any of the
plumbing in your actor.
When to use
You're building (or maintaining) an Apify standby actor and:
- The target site has bot protection — Akamai BMP, Cloudflare, DataDome, or similar.
- You want a single sticky session per actor process so the first call pays the browser-tier warmup and every subsequent call rides the same warmed cookie jar through ghost-fetch's fast Tier 1 (impit) path.
- You're happy to keep the rest of your actor (request builders, handlers, extractors, route mapping) the way you have it today — this library only swaps out the HTTP fetcher.
If you're scraping an unprotected site, you don't need this — vanilla
CheerioCrawler is fine.
Install
npm install crawlee-ghost-fetchcrawlee, @crawlee/core, and apify are peer dependencies —
your actor already has them.
You also need a ghost-fetch endpoint to talk to. Either deploy your
own ghost-fetch standby on Apify (see the
ghost-fetch repo) or run it
locally for development. Pass the URL via ghostFetchUrl on the
client config or the GHOST_FETCH_URL env var. There is no default —
the lib throws on construction if neither is set.
Quick-start (single-country)
import { CheerioCrawler } from 'crawlee';
import {
GhostFetchHttpClient,
ghostFetchCrawlerOptions,
ghostFetchPreNavigationHook,
} from 'crawlee-ghost-fetch';
const httpClient = new GhostFetchHttpClient({
name: 'overstock',
defaultCountry: 'US',
});
const crawler = new CheerioCrawler({
keepAlive: true,
httpClient,
requestHandlerTimeoutSecs: 180,
navigationTimeoutSecs: 120,
maxConcurrency: 20,
maxRequestRetries: 3,
...ghostFetchCrawlerOptions({
// Hook composes through the helper arg — see "Crawler config recipe".
preNavigationHooks: [ghostFetchPreNavigationHook],
}),
requestHandler: async ({ $, request }) => {
// your handler
},
});
await crawler.run(['https://www.overstock.com/some/page']);Multi-country (e.g. kaufland)
When the country is encoded in the URL — TLD, query string, subdomain —
pass a countryFromUrl resolver. The library invokes it per request
and forwards the result to ghost-fetch's country argument:
const KAUFLAND_TLDS = new Set(['DE', 'CZ', 'PL', 'SK', 'AT', 'IT', 'FR']);
const httpClient = new GhostFetchHttpClient({
name: 'kaufland',
defaultCountry: 'DE',
countryFromUrl: (url) => {
try {
const tld = new URL(url).hostname.split('.').pop()?.toUpperCase();
return tld && KAUFLAND_TLDS.has(tld) ? tld : undefined;
} catch {
return undefined;
}
},
});Returning undefined falls back to defaultCountry.
Per-URL warmup and tier hint
Two more optional resolvers, same shape as countryFromUrl:
warmupFromUrl(url) → string | undefined— returns a URL ghost-fetch should navigate to before fetching the target. Useful for sites whose API endpoints require cookies established by a real HTML page nav. Passundefinedto skip.hintFromUrl(url) → 'fast' | 'stealth' | 'residential' | undefined— forces a tier per request. Use'stealth'when you've measured that Tier 1 always fails for that URL pattern (saves the failed round-trip). Use'fast'for known-trivial URLs where you want fail-fast behavior. Don't pass'residential'— it's a no-op in current ghost-fetch.
new GhostFetchHttpClient({
name: 'mysite',
defaultCountry: 'US',
hintFromUrl: (url) => (url.includes('/api/heavy/') ? 'stealth' : undefined),
});Camoufox (Firefox tier)
ghost-fetch's Tier 2 has two browser backends: cloakbrowser (Chromium, the default) and camoufox (anti-detect Firefox). Pick per client or per URL when a target is known to block Chromium fingerprints — Firefox ships a different JA3/JA4 profile and a separate fingerprint stack.
Site-wide Firefox:
new GhostFetchHttpClient({
name: 'datadome-target',
defaultCountry: 'DE',
browser: 'firefox',
fox: {
fox_os: 'windows',
fox_humanize: true,
fox_block_images: true,
},
});Per-URL routing:
new GhostFetchHttpClient({
name: 'multi-target',
defaultCountry: 'US',
browserFromUrl: (url) =>
url.includes('hard-target.com') ? 'firefox' : undefined,
foxFromUrl: (url) =>
url.includes('hard-target.com')
? { fox_os: 'macos', fox_locale: 'en-US' }
: undefined,
});fox_* options are Firefox-only and are dropped on the wire when the
resolved browser is not 'firefox' — safe to set cfg.fox defaults
even when most URLs go through Chromium.
Requires a ghost-fetch deployment with the camoufox tier enabled
(server-side browser / fox_* schema). Cookie cascade behaves
identically across backends — SessionPool rotation, sticky sessions
and the treatAsSuccess recovery path are unchanged.
See FoxOptions in src/types.ts for the full param list (OS, locale,
WebGL config, fonts, addons, blocking toggles, raw fox_config).
Crawler config recipe
ghostFetchCrawlerOptions(opts?) returns the Crawlee opinions that pair
with the client, and ghostFetchPreNavigationHook binds the active
Crawlee session id to the ghost-fetch request. Since 0.7.0, both pre/post
navigation hooks compose through the helper arg, not by spread-after:
import {
GhostFetchHttpClient,
ghostFetchCrawlerOptions,
ghostFetchPreNavigationHook,
} from 'crawlee-ghost-fetch';
const crawler = new CheerioCrawler({
httpClient: new GhostFetchHttpClient({ name: 'site', defaultCountry: 'US' }),
...ghostFetchCrawlerOptions({
preNavigationHooks: [ghostFetchPreNavigationHook],
// Optional retire predicate (see below). Receives status +
// headers only — body content is not yet parsed at this point.
// shouldRetire: (_, r) => r.statusCode === 403,
}),
requestHandler,
});Do not spread caller hooks AFTER ghostFetchCrawlerOptions(). That
overwrites the helper's hooks (including the generated retire hook when
shouldRetire is set). Always pass them through the helper arg.
The helper expands to:
{
additionalMimeTypes: ['application/octet-stream'],
useSessionPool: true,
persistCookiesPerSession: true,
sessionPoolOptions: { /* sticky-by-default — see below */ },
preNavigationHooks: [...callerHooks],
postNavigationHooks: [...generatedRetireHook?, ...callerHooks],
}Sticky-by-default session pool
sessionPoolOptions defaults to ONE session per actor process, never
auto-retired on usage / age / 4xx:
| Knob | Default | Why |
|---|---|---|
| maxPoolSize | 1 | All requests share one session → one IP, one cookie jar, one JA4. Hot-path Tier 1 amortizes fully. |
| sessionOptions.maxUsageCount | Number.MAX_SAFE_INTEGER | Don't auto-retire on usage count. |
| sessionOptions.maxErrorScore | Number.MAX_SAFE_INTEGER | Don't auto-retire on error score. |
| sessionOptions.maxAgeSecs | 31_536_000 (1 year) | Finite (Crawlee builds a Date from this; MAX_SAFE_INTEGER overflows). |
| blockedStatusCodes | [] | 4xx is often cookie-rotation noise on WAFs that bind cookies per-response, not a real block. Use shouldRetire for status + header decisions Crawlee's enum can't express. |
Operator overrides via opts.sessionPoolOptions, deep-merged with these
defaults (opts.sessionPoolOptions.sessionOptions merges nested-key-wise,
not replaces).
Do not enable
retryOnBlocked: trueon the crawler. WithblockedStatusCodes: [],HttpCrawler.isRequestBlocked()falls back to Crawlee's built-in default[401, 403, 429]when the pool list is empty — reintroducing automatic retire outside yourshouldRetirepredicate.
shouldRetire — retire-and-retry on status + headers
Pass a predicate to retire the active session and re-queue the request
when ghost-fetch's response looks bad on signals Crawlee's
blockedStatusCodes enum can't express:
ghostFetchCrawlerOptions({
preNavigationHooks: [ghostFetchPreNavigationHook],
// Predicate sees status code + headers only; the body has not been
// parsed at this point. For body-shape decisions, do them inside
// the request handler and call `ctx.session.retire()` + throw yourself.
shouldRetire: (_ctx, response) => response.statusCode === 403,
})The predicate's response arg is a RetirableResponse:
interface RetirableResponse {
statusCode?: number;
headers?: Record<string, string | string[] | undefined>;
}Body parsing happens AFTER postNavigationHooks (where the generated
retire hook lives), so any body-shape check inside shouldRetire would
see undefined / unparsed data. If you need to retire based on parsed
body content, do it inside requestHandler:
requestHandler: async (ctx) => {
if (looksLikeChallenge(ctx.$('body').text())) {
ctx.session?.retire();
throw new Error('challenge body — retiring session');
}
// ...normal extraction
}Returning true from shouldRetire → the active session is marked
bad, retired, and the request is thrown so Crawlee re-queues it with a
fresh session (new session.id → new ghost-fetch session arg → fresh
exit IP + empty cookie jar).
The generated hook runs before any caller-supplied
postNavigationHooks, so a known-blocked response can't trigger caller
side-effects before the retry throw.
Each shouldRetire: true consumes a maxRequestRetries slot. For
block-heavy sites, bump maxRequestRetries from the default 3 to ~8 so
a temporary streak of retires doesn't kill the request.
shouldRetire v1 only supports retire-and-retry semantics. There is no
"retire but let this response flow to the handler" mode: Crawlee's
session.markGood() call after a successful handler subtracts 0.5 from
errorScore, undoing Session.retire()'s max-errorScore set. The
retire decision would silently evaporate.
What ghostFetchPreNavigationHook does
A one-line hook that copies ctx.session.id into
ctx.request.userData.sid. GhostFetchHttpClient reads it from there
and forwards as ghost-fetch's session arg, binding the Apify
residential exit IP, cookie jar, and (via ghost-fetch 0.4.0's
session-engine recording) JA4 to the Crawlee session lifecycle. Required
for retire-driven rotation to actually change identity downstream.
Without it, the client falls back to a static per-process sessionToken (one IP for the actor's life). That works but means Crawlee can't rotate identity on a real block — every retried request hits the same exit IP
- same cookie jar.
Recommended caller-side knobs:
| Option | Recommendation | Why |
|---|---|---|
| maxConcurrency | 20 | Higher and ghost-fetch's session lock will serialize anyway |
| maxRequestRetries | 3 (or 8 with shouldRetire) | One cold-start retry + headroom; bump for retire-heavy sites |
| navigationTimeoutSecs | 120 | Cold-start browser warmup can take ~30-60s |
| requestHandlerTimeoutSecs | 180 | Outer envelope above |
| keepAlive | true | Standby actors are long-lived |
| retryOnBlocked | false (don't set true) | Bypasses our blockedStatusCodes: [] and reintroduces Crawlee's built-in [401, 403, 429] retire |
False-block recovery (treatAsSuccess)
Some WAFs return an error status (typically 403) alongside the
real page body as a deception layer. The wire-level status says
"blocked", but the body contains exactly the data the actor wants —
Product JSON-LD, full listing tiles, etc.
Without intervention this round-trips badly through Crawlee:
- The HTTP client returns a response with
statusCode: 403(the truth from the wire). - Crawlee's
CheerioCrawlerchecks the status againstblockedStatusCodes(default[401, 403, 429]). - On match, it calls
session.retire()and throws before the request handler runs. - The request retries with a fresh session — fresh exit IP, fresh empty cookie jar, fresh cold-warmup cost. The session that just succeeded (the body was real) and the warm cookie jar tied to it are both discarded.
GhostFetchHttpClient exposes an opt-in predicate to recover from
this:
import { GhostFetchHttpClient } from 'crawlee-ghost-fetch';
new GhostFetchHttpClient({
name: 'site',
defaultCountry: 'US',
treatAsSuccess: (payload) => {
const c = payload.content ?? '';
// Length floor rejects hard-block challenge bodies (~1 KB).
// Endpoint-specific marker rejects WAF blank-shell error pages
// (right outer shape, every field empty).
return c.length > 50_000
&& /"@type":\s*"Product"[^]*?"name":\s*"[^"]+"/.test(c);
},
});When the predicate returns true for a 4xx/5xx response:
statusCodeis rewritten to200so SessionPool keeps the session and the request handler runs against the real body.statusMessagebecomesOK (upstream <orig>).- The original upstream status is preserved at
response.headers['x-ghost-fetch-upstream-status']for handlers that care about wire-level truth.
Unset or returning false → ghost-fetch's status flows straight
through (the safe default; legitimate blocks must reach SessionPool
so it can rotate to a fresh IP).
Ghost-fetch telemetry headers
GhostFetchHttpClient exposes selected ghost-fetch payload metadata on
the Crawlee response headers so request handlers can record how a page was
resolved:
| Header | Source payload field |
|---|---|
| x-ghost-fetch-resolved-via | resolved_via |
| x-ghost-fetch-latency-ms | latency_ms |
| x-ghost-fetch-effective-browser | effective_browser |
| x-ghost-fetch-blocked | blocked |
Example:
crawler.router.addDefaultHandler(async ({ response }) => {
const via = response?.headers['x-ghost-fetch-resolved-via'];
const latency = response?.headers['x-ghost-fetch-latency-ms'];
// Push into dataset metadata, logs, or run metrics.
});How to design the predicate
Two components, both required:
Body-length floor. Hard-block challenge bodies are tiny — 500 B to 5 KB depending on the WAF. Real pages are 50+ KB. A floor of ~30–50 KB rules out the obvious hard-block class without thinking.
Endpoint-specific success marker. This is the load-bearing check. Length alone false-passes WAF blank-shell error pages — pages that look like the right shape but render every field empty. Pick a marker that's only present when the data is real:
- Product detail page: Product JSON-LD with a non-empty name.
The blank-shell page has
"name": "", so"name":\s*"[^"]+"is enough. Don't just match"@type":\s*"Product"— the shell has that too. - Listing / search results: at least one product card / tile element from your Phase 2 selector. Challenge and shell pages don't render the card grid.
- Detail page with API-loaded data: a bound DOM attribute that
only the real-data path produces (
data-product-id="…"matching a non-empty value, etc.). - JSON API: if the API returns
application/json, parse it and check for the expected envelope key. Don't string-match — valid API errors often have the same outer shape.
- Product detail page: Product JSON-LD with a non-empty name.
The blank-shell page has
When an actor serves multiple endpoints
Dispatch off the URL inside the predicate. The predicate receives the full payload but not the request URL; pull it from your config or keep markers permissive (any marker satisfies):
treatAsSuccess: (p) => {
const c = p.content ?? '';
if (c.length < 30_000) return false;
if (/"@type":\s*"Product"[^]*?"name":\s*"[^"]+"/.test(c)) return true;
if (/data-tile-id="[^"]+"/.test(c)) return true;
return false;
},If the actor talks to a mix of WAF'd and unprotected hosts (e.g. main storefront + a CDN-hosted reviews API), the predicate is only consulted when status ≥ 400, so unprotected hosts (which return 200) bypass it naturally.
What this is NOT
Not a substitute for retry policy. Real blocks still need rotation. The predicate is a discriminator between "the body has what we asked for despite the status code" and "the body is a challenge / error page". Get the predicate wrong on the conservative side and a real block flows through to the handler → handler crashes / returns nulls → actor loses a retry budget. Get it wrong on the permissive side and SessionPool hangs on to a dead session → every subsequent request through that session fails.
Not a replacement for the warmup hook. A WAF that demands cookies before serving content still needs
warmupFromUrl. The predicate only handles cases where the warmup worked but the WAF is cosmetically masking success.Not free. The predicate runs on every response. Keep it regex-based and small — don't parse the full DOM.
Environment variables
| Var | Required | Default | Purpose |
|---|---|---|---|
| GHOST_FETCH_URL | yes (or pass ghostFetchUrl on the client) | — | Points at your ghost-fetch deployment, e.g. https://<user>--ghost-fetch.apify.actor. The lib throws on construction if neither this nor the constructor arg is set. |
| APIFY_TOKEN | yes when ghost-fetch is the Apify standby (Apify-hosted actors auto-inject) | — | Apify auth for the standby HTTP frontend in front of ghost-fetch. Set explicitly only for local dev or non-Apify hosting. |
| GHOST_FETCH_SESSION | no | ${name}.{uuid} per process | Pin a session token across process restarts. Rarely useful in production — SessionPool drives rotation. |
Cold-start behavior
The first request through the client triggers ghost-fetch's browser
tier (cloak_fetch) to mint a validated session cookie set —
typically 20-40s wall clock. Every subsequent request in the same
process reuses that session token, so ghost-fetch routes them through
its Tier 1 impit path with the warmed cookies — typically 1-3s.
For an actor's first user-facing call to succeed in spite of the cold
start, the standard Crawlee retry config (maxRequestRetries: 3)
bridges the gap: if the cold call returns cloak_fetch 403 (Akamai
sometimes rejects the very first request even with valid sensor data),
the retry sees freshly-banked cookies and lands on Tier 1.
Apify deployment notes
The Apify-hosted actor build runs npm install against the public
npm registry — no extra setup. APIFY_TOKEN is auto-injected at
runtime for any Apify-hosted actor. Set GHOST_FETCH_URL on the
actor (Settings → Environment variables) pointing at your ghost-fetch
standby.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| crawlee-ghost-fetch: ghostFetchUrl is required thrown on first request | Neither ghostFetchUrl config nor GHOST_FETCH_URL env is set | Pass one. There is no default. |
| First call times out (>120s) | Cold-start browser warmup ran into an interactive challenge (Turnstile / hCaptcha) | Try a different country — Apify's residential pools vary; CZ/DE pools often clean for EU sites |
| Every call resolves via cloak_fetch, never escalates to http_impersonate | Tier 1 fingerprint mismatch with the residential exit's geo | Likely a ghost-fetch issue — open a ticket. Confirmed working: locale-aware Accept-Language is set per-country (ghost-fetch v0.1.16+) |
| First call returns 500 from your actor with Request failed: ... | Crawlee aborted on application/octet-stream content-type before handler ran | Confirm you spread ...ghostFetchCrawlerOptions() into your crawler config |
| crawlee-ghost-fetch: APIFY_TOKEN env var (or apifyToken config) is required | No token visible to the actor at runtime | Ensure APIFY_TOKEN is in the actor's runtime env (Apify auto-injects it for Apify-hosted actors, set it manually for local dev) |
| Multi-country actor: every request goes to defaultCountry | countryFromUrl returning undefined for valid URLs | Log inside the resolver — common bug is checking tld === 'de' (lowercase) when the ToUpperCase comparison expects 'DE' |
| SessionPool keeps retiring sessions even when handlers extract data fine | WAF returns 4xx alongside the real page body; default blockedStatusCodes retires the session before your handler can confirm success | Set treatAsSuccess on the client with a length floor + endpoint-specific marker |
| Cold call rate is fine but the actor's IP-quality lottery never settles — every request pays the cold warmup | Same as above: the warm session that minted real cookies was retired the moment it returned 4xx, so nothing in the pool ever ages | Same fix; verify by inspecting response.headers['x-ghost-fetch-upstream-status'] in the handler — if the rewrite is firing, the same session id should appear across consecutive successful calls |
Migration from an inline ghost-fetch client
If your actor already has a hand-rolled ghost-fetch-client.ts, the
diff is roughly:
- import { GhostFetchHttpClient } from './utils/ghost-fetch-client.js';
+ import {
+ GhostFetchHttpClient,
+ ghostFetchCrawlerOptions,
+ ghostFetchPreNavigationHook,
+ } from 'crawlee-ghost-fetch';
const crawler = new CheerioCrawler({
- httpClient: new GhostFetchHttpClient(),
+ httpClient: new GhostFetchHttpClient({ name: 'overstock', defaultCountry: 'US' }),
- additionalMimeTypes: ['application/octet-stream'],
- useSessionPool: false,
+ ...ghostFetchCrawlerOptions({
+ preNavigationHooks: [ghostFetchPreNavigationHook],
+ }),
...
});Then delete src/crawler/utils/ghost-fetch-client.ts from your actor.
For multi-country actors that previously used a preNavigationHook to
stamp x-ghost-country onto request headers, drop the hook entirely
— countryFromUrl runs inside the client's sendRequest, no header
round-trip needed.
Versioning
This package follows semver. The 0.x line tracks breaking changes
freely; we plan to cut 1.0.0 once the API has stabilized across at
least three production actors.
See also
- ghost-fetch — the unblocker this library talks to.
- Crawlee — the framework this library plugs into.
- Apify standby actors — the deployment shape both reference actors use.
