crawlee-ghost-fetch

v0.7.1

Published

a day ago

Crawlee BaseHttpClient that delegates every HTTP request to the ghost-fetch unblocker (impit Tier 1 + stealth Chromium fallback). Plugs into CheerioCrawler / HttpCrawler with a one-line swap; SessionPool drives rotation; opt-in false-block recovery for WA

0High
0Medium
0Low

europa6

crawlee ghost-fetch mcp scraping unblocker apify

crawlee-ghost-fetch

A Crawlee BaseHttpClient that delegates every HTTP request to the ghost-fetch unblocker. Drop it into a CheerioCrawler (or any HttpCrawler) and your crawl gets ghost-fetch's full unblocking cascade — Tier 1 impit, Tier 2 stealth Chromium, country-aware locale, sticky sessions — without any of the plumbing in your actor.

When to use

You're building (or maintaining) an Apify standby actor and:

The target site has bot protection — Akamai BMP, Cloudflare, DataDome, or similar.
You want a single sticky session per actor process so the first call pays the browser-tier warmup and every subsequent call rides the same warmed cookie jar through ghost-fetch's fast Tier 1 (impit) path.
You're happy to keep the rest of your actor (request builders, handlers, extractors, route mapping) the way you have it today — this library only swaps out the HTTP fetcher.

If you're scraping an unprotected site, you don't need this — vanilla CheerioCrawler is fine.

Install

npm install crawlee-ghost-fetch

crawlee, @crawlee/core, and apify are peer dependencies — your actor already has them.

You also need a ghost-fetch endpoint to talk to. Either deploy your own ghost-fetch standby on Apify (see the ghost-fetch repo) or run it locally for development. Pass the URL via ghostFetchUrl on the client config or the GHOST_FETCH_URL env var. There is no default — the lib throws on construction if neither is set.

Quick-start (single-country)

import { CheerioCrawler } from 'crawlee';
import {
    GhostFetchHttpClient,
    ghostFetchCrawlerOptions,
    ghostFetchPreNavigationHook,
} from 'crawlee-ghost-fetch';

const httpClient = new GhostFetchHttpClient({
    name: 'overstock',
    defaultCountry: 'US',
});

const crawler = new CheerioCrawler({
    keepAlive: true,
    httpClient,
    requestHandlerTimeoutSecs: 180,
    navigationTimeoutSecs: 120,
    maxConcurrency: 20,
    maxRequestRetries: 3,
    ...ghostFetchCrawlerOptions({
        // Hook composes through the helper arg — see "Crawler config recipe".
        preNavigationHooks: [ghostFetchPreNavigationHook],
    }),
    requestHandler: async ({ $, request }) => {
        // your handler
    },
});

await crawler.run(['https://www.overstock.com/some/page']);

Multi-country (e.g. kaufland)

When the country is encoded in the URL — TLD, query string, subdomain — pass a countryFromUrl resolver. The library invokes it per request and forwards the result to ghost-fetch's country argument:

const KAUFLAND_TLDS = new Set(['DE', 'CZ', 'PL', 'SK', 'AT', 'IT', 'FR']);

const httpClient = new GhostFetchHttpClient({
    name: 'kaufland',
    defaultCountry: 'DE',
    countryFromUrl: (url) => {
        try {
            const tld = new URL(url).hostname.split('.').pop()?.toUpperCase();
            return tld && KAUFLAND_TLDS.has(tld) ? tld : undefined;
        } catch {
            return undefined;
        }
    },
});

Returning undefined falls back to defaultCountry.

Per-URL warmup and tier hint

Two more optional resolvers, same shape as countryFromUrl:

warmupFromUrl(url) → string | undefined — returns a URL ghost-fetch should navigate to before fetching the target. Useful for sites whose API endpoints require cookies established by a real HTML page nav. Pass undefined to skip.
hintFromUrl(url) → 'fast' | 'stealth' | 'residential' | undefined — forces a tier per request. Use 'stealth' when you've measured that Tier 1 always fails for that URL pattern (saves the failed round-trip). Use 'fast' for known-trivial URLs where you want fail-fast behavior. Don't pass 'residential' — it's a no-op in current ghost-fetch.

new GhostFetchHttpClient({
    name: 'mysite',
    defaultCountry: 'US',
    hintFromUrl: (url) => (url.includes('/api/heavy/') ? 'stealth' : undefined),
});

Camoufox (Firefox tier)

ghost-fetch's Tier 2 has two browser backends: cloakbrowser (Chromium, the default) and camoufox (anti-detect Firefox). Pick per client or per URL when a target is known to block Chromium fingerprints — Firefox ships a different JA3/JA4 profile and a separate fingerprint stack.

Site-wide Firefox:

new GhostFetchHttpClient({
    name: 'datadome-target',
    defaultCountry: 'DE',
    browser: 'firefox',
    fox: {
        fox_os: 'windows',
        fox_humanize: true,
        fox_block_images: true,
    },
});

Per-URL routing:

new GhostFetchHttpClient({
    name: 'multi-target',
    defaultCountry: 'US',
    browserFromUrl: (url) =>
        url.includes('hard-target.com') ? 'firefox' : undefined,
    foxFromUrl: (url) =>
        url.includes('hard-target.com')
            ? { fox_os: 'macos', fox_locale: 'en-US' }
            : undefined,
});

fox_* options are Firefox-only and are dropped on the wire when the resolved browser is not 'firefox' — safe to set cfg.fox defaults even when most URLs go through Chromium.

Requires a ghost-fetch deployment with the camoufox tier enabled (server-side browser / fox_* schema). Cookie cascade behaves identically across backends — SessionPool rotation, sticky sessions and the treatAsSuccess recovery path are unchanged.

See FoxOptions in src/types.ts for the full param list (OS, locale, WebGL config, fonts, addons, blocking toggles, raw fox_config).

Crawler config recipe

ghostFetchCrawlerOptions(opts?) returns the Crawlee opinions that pair with the client, and ghostFetchPreNavigationHook binds the active Crawlee session id to the ghost-fetch request. Since 0.7.0, both pre/post navigation hooks compose through the helper arg, not by spread-after:

import {
    GhostFetchHttpClient,
    ghostFetchCrawlerOptions,
    ghostFetchPreNavigationHook,
} from 'crawlee-ghost-fetch';

const crawler = new CheerioCrawler({
    httpClient: new GhostFetchHttpClient({ name: 'site', defaultCountry: 'US' }),
    ...ghostFetchCrawlerOptions({
        preNavigationHooks: [ghostFetchPreNavigationHook],
        // Optional retire predicate (see below). Receives status +
        // headers only — body content is not yet parsed at this point.
        // shouldRetire: (_, r) => r.statusCode === 403,
    }),
    requestHandler,
});

Do not spread caller hooks AFTER ghostFetchCrawlerOptions(). That overwrites the helper's hooks (including the generated retire hook when shouldRetire is set). Always pass them through the helper arg.

The helper expands to:

{
    additionalMimeTypes: ['application/octet-stream'],
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: { /* sticky-by-default — see below */ },
    preNavigationHooks: [...callerHooks],
    postNavigationHooks: [...generatedRetireHook?, ...callerHooks],
}

Sticky-by-default session pool

sessionPoolOptions defaults to ONE session per actor process, never auto-retired on usage / age / 4xx:

| Knob | Default | Why | |---|---|---| | maxPoolSize | 1 | All requests share one session → one IP, one cookie jar, one JA4. Hot-path Tier 1 amortizes fully. | | sessionOptions.maxUsageCount | Number.MAX_SAFE_INTEGER | Don't auto-retire on usage count. | | sessionOptions.maxErrorScore | Number.MAX_SAFE_INTEGER | Don't auto-retire on error score. | | sessionOptions.maxAgeSecs | 31_536_000 (1 year) | Finite (Crawlee builds a Date from this; MAX_SAFE_INTEGER overflows). | | blockedStatusCodes | [] | 4xx is often cookie-rotation noise on WAFs that bind cookies per-response, not a real block. Use shouldRetire for status + header decisions Crawlee's enum can't express. |

Operator overrides via opts.sessionPoolOptions, deep-merged with these defaults (opts.sessionPoolOptions.sessionOptions merges nested-key-wise, not replaces).

Do not enable retryOnBlocked: true on the crawler. With blockedStatusCodes: [], HttpCrawler.isRequestBlocked() falls back to Crawlee's built-in default [401, 403, 429] when the pool list is empty — reintroducing automatic retire outside your shouldRetire predicate.

`shouldRetire` — retire-and-retry on status + headers

Pass a predicate to retire the active session and re-queue the request when ghost-fetch's response looks bad on signals Crawlee's blockedStatusCodes enum can't express:

ghostFetchCrawlerOptions({
    preNavigationHooks: [ghostFetchPreNavigationHook],
    // Predicate sees status code + headers only; the body has not been
    // parsed at this point. For body-shape decisions, do them inside
    // the request handler and call `ctx.session.retire()` + throw yourself.
    shouldRetire: (_ctx, response) => response.statusCode === 403,
})

The predicate's response arg is a RetirableResponse:

interface RetirableResponse {
    statusCode?: number;
    headers?: Record<string, string | string[] | undefined>;
}

Body parsing happens AFTER postNavigationHooks (where the generated retire hook lives), so any body-shape check inside shouldRetire would see undefined / unparsed data. If you need to retire based on parsed body content, do it inside requestHandler:

requestHandler: async (ctx) => {
    if (looksLikeChallenge(ctx.$('body').text())) {
        ctx.session?.retire();
        throw new Error('challenge body — retiring session');
    }
    // ...normal extraction
}

Returning true from shouldRetire → the active session is marked bad, retired, and the request is thrown so Crawlee re-queues it with a fresh session (new session.id → new ghost-fetch session arg → fresh exit IP + empty cookie jar).

The generated hook runs before any caller-supplied postNavigationHooks, so a known-blocked response can't trigger caller side-effects before the retry throw.

Each shouldRetire: true consumes a maxRequestRetries slot. For block-heavy sites, bump maxRequestRetries from the default 3 to ~8 so a temporary streak of retires doesn't kill the request.

shouldRetire v1 only supports retire-and-retry semantics. There is no "retire but let this response flow to the handler" mode: Crawlee's session.markGood() call after a successful handler subtracts 0.5 from errorScore, undoing Session.retire()'s max-errorScore set. The retire decision would silently evaporate.

What `ghostFetchPreNavigationHook` does

A one-line hook that copies ctx.session.id into ctx.request.userData.sid. GhostFetchHttpClient reads it from there and forwards as ghost-fetch's session arg, binding the Apify residential exit IP, cookie jar, and (via ghost-fetch 0.4.0's session-engine recording) JA4 to the Crawlee session lifecycle. Required for retire-driven rotation to actually change identity downstream.

Without it, the client falls back to a static per-process sessionToken (one IP for the actor's life). That works but means Crawlee can't rotate identity on a real block — every retried request hits the same exit IP

same cookie jar.

Recommended caller-side knobs:

| Option | Recommendation | Why | |---|---|---| | maxConcurrency | 20 | Higher and ghost-fetch's session lock will serialize anyway | | maxRequestRetries | 3 (or 8 with shouldRetire) | One cold-start retry + headroom; bump for retire-heavy sites | | navigationTimeoutSecs | 120 | Cold-start browser warmup can take ~30-60s | | requestHandlerTimeoutSecs | 180 | Outer envelope above | | keepAlive | true | Standby actors are long-lived | | retryOnBlocked | false (don't set true) | Bypasses our blockedStatusCodes: [] and reintroduces Crawlee's built-in [401, 403, 429] retire |

False-block recovery (`treatAsSuccess`)

Some WAFs return an error status (typically 403) alongside the real page body as a deception layer. The wire-level status says "blocked", but the body contains exactly the data the actor wants — Product JSON-LD, full listing tiles, etc.

Without intervention this round-trips badly through Crawlee:

The HTTP client returns a response with statusCode: 403 (the truth from the wire).
Crawlee's CheerioCrawler checks the status against blockedStatusCodes (default [401, 403, 429]).
On match, it calls session.retire() and throws before the request handler runs.
The request retries with a fresh session — fresh exit IP, fresh empty cookie jar, fresh cold-warmup cost. The session that just succeeded (the body was real) and the warm cookie jar tied to it are both discarded.

GhostFetchHttpClient exposes an opt-in predicate to recover from this:

import { GhostFetchHttpClient } from 'crawlee-ghost-fetch';

new GhostFetchHttpClient({
    name: 'site',
    defaultCountry: 'US',
    treatAsSuccess: (payload) => {
        const c = payload.content ?? '';
        // Length floor rejects hard-block challenge bodies (~1 KB).
        // Endpoint-specific marker rejects WAF blank-shell error pages
        // (right outer shape, every field empty).
        return c.length > 50_000
            && /"@type":\s*"Product"[^]*?"name":\s*"[^"]+"/.test(c);
    },
});

When the predicate returns true for a 4xx/5xx response:

statusCode is rewritten to 200 so SessionPool keeps the session and the request handler runs against the real body.
statusMessage becomes OK (upstream <orig>).
The original upstream status is preserved at response.headers['x-ghost-fetch-upstream-status'] for handlers that care about wire-level truth.

Unset or returning false → ghost-fetch's status flows straight through (the safe default; legitimate blocks must reach SessionPool so it can rotate to a fresh IP).

Ghost-fetch telemetry headers

GhostFetchHttpClient exposes selected ghost-fetch payload metadata on the Crawlee response headers so request handlers can record how a page was resolved:

| Header | Source payload field | |---|---| | x-ghost-fetch-resolved-via | resolved_via | | x-ghost-fetch-latency-ms | latency_ms | | x-ghost-fetch-effective-browser | effective_browser | | x-ghost-fetch-blocked | blocked |

Example:

crawler.router.addDefaultHandler(async ({ response }) => {
    const via = response?.headers['x-ghost-fetch-resolved-via'];
    const latency = response?.headers['x-ghost-fetch-latency-ms'];
    // Push into dataset metadata, logs, or run metrics.
});

How to design the predicate

Two components, both required:

Body-length floor. Hard-block challenge bodies are tiny — 500 B to 5 KB depending on the WAF. Real pages are 50+ KB. A floor of ~30–50 KB rules out the obvious hard-block class without thinking.
Endpoint-specific success marker. This is the load-bearing check. Length alone false-passes WAF blank-shell error pages — pages that look like the right shape but render every field empty. Pick a marker that's only present when the data is real:
- Product detail page: Product JSON-LD with a non-empty name. The blank-shell page has "name": "", so "name":\s*"[^"]+" is enough. Don't just match "@type":\s*"Product" — the shell has that too.
- Listing / search results: at least one product card / tile element from your Phase 2 selector. Challenge and shell pages don't render the card grid.
- Detail page with API-loaded data: a bound DOM attribute that only the real-data path produces (data-product-id="…" matching a non-empty value, etc.).
- JSON API: if the API returns application/json, parse it and check for the expected envelope key. Don't string-match — valid API errors often have the same outer shape.

When an actor serves multiple endpoints

Dispatch off the URL inside the predicate. The predicate receives the full payload but not the request URL; pull it from your config or keep markers permissive (any marker satisfies):

treatAsSuccess: (p) => {
    const c = p.content ?? '';
    if (c.length < 30_000) return false;
    if (/"@type":\s*"Product"[^]*?"name":\s*"[^"]+"/.test(c)) return true;
    if (/data-tile-id="[^"]+"/.test(c)) return true;
    return false;
},

If the actor talks to a mix of WAF'd and unprotected hosts (e.g. main storefront + a CDN-hosted reviews API), the predicate is only consulted when status ≥ 400, so unprotected hosts (which return 200) bypass it naturally.

What this is NOT

Not a substitute for retry policy. Real blocks still need rotation. The predicate is a discriminator between "the body has what we asked for despite the status code" and "the body is a challenge / error page". Get the predicate wrong on the conservative side and a real block flows through to the handler → handler crashes / returns nulls → actor loses a retry budget. Get it wrong on the permissive side and SessionPool hangs on to a dead session → every subsequent request through that session fails.
Not a replacement for the warmup hook. A WAF that demands cookies before serving content still needs warmupFromUrl. The predicate only handles cases where the warmup worked but the WAF is cosmetically masking success.
Not free. The predicate runs on every response. Keep it regex-based and small — don't parse the full DOM.

Environment variables

| Var | Required | Default | Purpose | |---|---|---|---| | GHOST_FETCH_URL | yes (or pass ghostFetchUrl on the client) | — | Points at your ghost-fetch deployment, e.g. https://<user>--ghost-fetch.apify.actor. The lib throws on construction if neither this nor the constructor arg is set. | | APIFY_TOKEN | yes when ghost-fetch is the Apify standby (Apify-hosted actors auto-inject) | — | Apify auth for the standby HTTP frontend in front of ghost-fetch. Set explicitly only for local dev or non-Apify hosting. | | GHOST_FETCH_SESSION | no | ${name}.{uuid} per process | Pin a session token across process restarts. Rarely useful in production — SessionPool drives rotation. |

Cold-start behavior

The first request through the client triggers ghost-fetch's browser tier (cloak_fetch) to mint a validated session cookie set — typically 20-40s wall clock. Every subsequent request in the same process reuses that session token, so ghost-fetch routes them through its Tier 1 impit path with the warmed cookies — typically 1-3s.

For an actor's first user-facing call to succeed in spite of the cold start, the standard Crawlee retry config (maxRequestRetries: 3) bridges the gap: if the cold call returns cloak_fetch 403 (Akamai sometimes rejects the very first request even with valid sensor data), the retry sees freshly-banked cookies and lands on Tier 1.

Apify deployment notes

The Apify-hosted actor build runs npm install against the public npm registry — no extra setup. APIFY_TOKEN is auto-injected at runtime for any Apify-hosted actor. Set GHOST_FETCH_URL on the actor (Settings → Environment variables) pointing at your ghost-fetch standby.

Troubleshooting

| Symptom | Likely cause | Fix | |---|---|---| | crawlee-ghost-fetch: ghostFetchUrl is required thrown on first request | Neither ghostFetchUrl config nor GHOST_FETCH_URL env is set | Pass one. There is no default. | | First call times out (>120s) | Cold-start browser warmup ran into an interactive challenge (Turnstile / hCaptcha) | Try a different country — Apify's residential pools vary; CZ/DE pools often clean for EU sites | | Every call resolves via cloak_fetch, never escalates to http_impersonate | Tier 1 fingerprint mismatch with the residential exit's geo | Likely a ghost-fetch issue — open a ticket. Confirmed working: locale-aware Accept-Language is set per-country (ghost-fetch v0.1.16+) | | First call returns 500 from your actor with Request failed: ... | Crawlee aborted on application/octet-stream content-type before handler ran | Confirm you spread ...ghostFetchCrawlerOptions() into your crawler config | | crawlee-ghost-fetch: APIFY_TOKEN env var (or apifyToken config) is required | No token visible to the actor at runtime | Ensure APIFY_TOKEN is in the actor's runtime env (Apify auto-injects it for Apify-hosted actors, set it manually for local dev) | | Multi-country actor: every request goes to defaultCountry | countryFromUrl returning undefined for valid URLs | Log inside the resolver — common bug is checking tld === 'de' (lowercase) when the ToUpperCase comparison expects 'DE' | | SessionPool keeps retiring sessions even when handlers extract data fine | WAF returns 4xx alongside the real page body; default blockedStatusCodes retires the session before your handler can confirm success | Set treatAsSuccess on the client with a length floor + endpoint-specific marker | | Cold call rate is fine but the actor's IP-quality lottery never settles — every request pays the cold warmup | Same as above: the warm session that minted real cookies was retired the moment it returned 4xx, so nothing in the pool ever ages | Same fix; verify by inspecting response.headers['x-ghost-fetch-upstream-status'] in the handler — if the rewrite is firing, the same session id should appear across consecutive successful calls |

Migration from an inline ghost-fetch client

If your actor already has a hand-rolled ghost-fetch-client.ts, the diff is roughly:

- import { GhostFetchHttpClient } from './utils/ghost-fetch-client.js';
+ import {
+     GhostFetchHttpClient,
+     ghostFetchCrawlerOptions,
+     ghostFetchPreNavigationHook,
+ } from 'crawlee-ghost-fetch';

  const crawler = new CheerioCrawler({
-     httpClient: new GhostFetchHttpClient(),
+     httpClient: new GhostFetchHttpClient({ name: 'overstock', defaultCountry: 'US' }),
-     additionalMimeTypes: ['application/octet-stream'],
-     useSessionPool: false,
+     ...ghostFetchCrawlerOptions({
+         preNavigationHooks: [ghostFetchPreNavigationHook],
+     }),
      ...
  });

Then delete src/crawler/utils/ghost-fetch-client.ts from your actor.

For multi-country actors that previously used a preNavigationHook to stamp x-ghost-country onto request headers, drop the hook entirely — countryFromUrl runs inside the client's sendRequest, no header round-trip needed.

Versioning

This package follows semver. The 0.x line tracks breaking changes freely; we plan to cut 1.0.0 once the API has stabilized across at least three production actors.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

crawlee-ghost-fetch

When to use

Install

Quick-start (single-country)

Multi-country (e.g. kaufland)

Per-URL warmup and tier hint

Camoufox (Firefox tier)

Crawler config recipe

Sticky-by-default session pool

shouldRetire — retire-and-retry on status + headers

What ghostFetchPreNavigationHook does

False-block recovery (treatAsSuccess)

Ghost-fetch telemetry headers

How to design the predicate

When an actor serves multiple endpoints

What this is NOT

Environment variables

Cold-start behavior

Apify deployment notes

Troubleshooting

Migration from an inline ghost-fetch client

Versioning

See also

`shouldRetire` — retire-and-retry on status + headers

What `ghostFetchPreNavigationHook` does

False-block recovery (`treatAsSuccess`)