@icjia/forgecrawl
v0.1.7
Published
MCP web-scraping server for Claude Code — fetches URLs, runs Mozilla Readability, and returns clean Markdown optimized for Claude's context window. Works with SPAs via headless Chromium.
Maintainers
Readme
@icjia/forgecrawl
A self-contained MCP server that fetches a URL and returns clean Markdown — Mozilla Readability + Turndown for static pages, headless Chromium (Playwright) for SPAs. No external service. No API key. Stdio in, Markdown out.
Why?
A typical web page is 50–500 KB of HTML. That's tens of thousands of tokens you'd be sending to Claude just to read what's on the page — most of it nav chrome, scripts, styles, and tracking. Forgecrawl runs the same extraction pipeline a human reader-mode tool runs (Mozilla's Readability, the same engine Firefox uses) and hands Claude only the meaningful content as Markdown.
The other half of the win is output discipline at the boundary. Forgecrawl defaults to a summary response — title, excerpt, word count, structured metadata, no body. Claude can decide whether the page is worth pulling in full before paying for it.
You: "Read these 10 articles and tell me which mention WCAG 2.2"
Claude: [calls scrape_url x10 with mode=summary] ← cheap; ~5k tokens total
Claude: "Three look relevant. Pulling those in full."
Claude: [calls scrape_url x3 with mode=markdown] ← only pays for what mattersWhat it does
- Fetches URLs with built-in
fetch()and converts the result to Markdown - Uses Mozilla Readability as the primary extractor; falls back to a cleaned full-body extraction when Readability captures less than 40% of the page text (common on landing pages, gov sites, docs hubs)
- Renders single-page apps via headless Chromium when needed —
render: "auto"(default) tries static fetch first and falls back automatically;render: "js"forces the browser path - Returns three verbosity tiers —
summary(no body),preview(first ~1k chars),markdown(full) - Plumbs CSS
selectors.include/selectors.excludethrough to trim the DOM before Markdown conversion - Strips
<a>and<img>by default (toggle on withincludeLinks/includeImages) - Caps output at 50,000 characters — matches the convention in sibling ICJIA servers (
lightcap,axecap) - Caches results in-memory (5 min TTL, 50 entries) keyed on the full request shape; surfaces
cached: trueso Claude knows when a call was free - SSRF-protected via CIDR-class IP classification (link-local / unspecified / multicast / reserved always blocked; private/loopback/CGNAT off by default, opt-in via
FORGECRAWL_BLOCK_PRIVATE=1) - Detects
sitemap.xmland reports URL counts for crawl scoping - Standalone CLI for use outside MCP clients (
forgecrawl scrape,forgecrawl sitemap,forgecrawl status) - Runs as a local MCP server over stdio (no HTTP, no ports, no remote attack surface)
Installation
Prerequisites
- Node.js >= 20 (check with
node --version) - Disk space for Chromium (~200 MB — used for SPA rendering; downloaded automatically by Playwright on first install). Skip with
PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1if you only need static scraping.
Option 1: npx (recommended, no install needed)
npx -y @icjia/forgecrawl --helpOption 2: Global install
npm install -g @icjia/forgecrawlOption 3: Clone for development
git clone https://github.com/ICJIA/forgecrawl-mcp.git
cd forgecrawl-mcp
npm installSetup with Claude Code
Claude Code manages MCP server lifecycle automatically — register once, and Claude Code starts/stops it with each session.
Using npx (recommended)
# All projects (user-level)
claude mcp add forgecrawl -s user -- npx -y @icjia/forgecrawl
# Current project only
claude mcp add forgecrawl -s project -- npx -y @icjia/forgecrawlUsing a local clone
claude mcp add forgecrawl -s user -- node /absolute/path/to/forgecrawl-mcp/src/server.jsManual config
Edit ~/.claude/settings.json:
{
"mcpServers": {
"forgecrawl": {
"command": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
}
}Then run /mcp in Claude Code to verify it's listed, and try:
"Use forgecrawl to summarize https://en.wikipedia.org/wiki/Markdown"
Setup with Cursor
Global
~/.cursor/mcp.json:
{
"mcpServers": {
"forgecrawl": {
"command": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
}
}Project-level
.cursor/mcp.json in your project root:
{
"mcpServers": {
"forgecrawl": {
"command": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
}
}Restart Cursor after adding the configuration.
Setup with Cline (VS Code)
Cline reads MCP servers from cline_mcp_settings.json. Open the Cline panel → MCP Servers → Edit MCP Settings:
{
"mcpServers": {
"forgecrawl": {
"command": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
}
}Setup with Continue.dev
Continue uses ~/.continue/config.json. Add forgecrawl to the mcpServers block:
{
"mcpServers": [
{
"name": "forgecrawl",
"command": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
]
}Setup with Windsurf
Windsurf supports MCP servers via Settings → Cascade → MCP. Add an entry:
{
"mcpServers": {
"forgecrawl": {
"command": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
}
}Setup with Zed
Zed reads MCP configuration from ~/.config/zed/settings.json:
{
"context_servers": {
"forgecrawl": {
"command": {
"path": "npx",
"args": ["-y", "@icjia/forgecrawl"]
}
}
}
}Setup with any other MCP client
Forgecrawl works with any MCP client that supports stdio transport. Configure your client to spawn:
npx -y @icjia/forgecrawlNo HTTP ports, no environment variables required.
Troubleshooting
Failed to connect or sh: forgecrawl: command not found
If your MCP client reports "Failed to connect", or running npx -y @icjia/forgecrawl in a terminal prints sh: forgecrawl: command not found, your local npx run cache likely has a stale entry (often seeded by an earlier failed install — e.g. a typo'd package name returning 404). Clear it and retry:
rm -rf ~/.npm/_npx
npx -y @icjia/forgecrawl --helpIf that succeeds, restart your MCP client. publish.sh runs the same clear-and-retry as a post-publish smoke test, so a published artifact reaching the registry has already been verified to launch from a clean cache.
Working inside the forgecrawl-mcp repo? Same symptom, different cause: your cwd's package.json has the same name as the requested package, so npx looks for the bin in this repo's node_modules/.bin/ and finds nothing (a package isn't installed as its own dependency). Clearing the npx cache won't help — register the dev MCP via the Local development snippet instead.
First-run timeout (Chromium download)
The postinstall step downloads Chromium (~150 MB) for SPA support. On the very first npx -y @icjia/forgecrawl, this can take longer than the MCP client's startup window and surface as "Failed to connect." Either warm the cache once in a terminal:
npx -y @icjia/forgecrawl --help…or skip the Chromium download (loses JS-render / SPA support) by setting PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 in the server's environment.
MCP Tools
scrape_url
Fetch a URL and return Markdown. Default mode is summary.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| url | string | (required) | HTTP/HTTPS URL to scrape |
| mode | string | summary | summary | preview | markdown |
| maxBodyChars | number | 50,000 | Cap body length (markdown/preview) |
| previewChars | number | 1,000 | Preview length (preview only) |
| selectors.include | string | — | CSS selector to narrow DOM before extraction |
| selectors.exclude | string | — | CSS selector(s) to remove |
| includeLinks | boolean | false | Keep <a> hrefs |
| includeImages | boolean | false | Keep <img> |
| bypassCache | boolean | false | Skip in-memory cache |
| render | string | auto | auto | static | js |
| waitUntil | string | domcontentloaded | load | domcontentloaded | networkidle (JS render) |
| waitFor | string | — | CSS selector to wait for (JS render) |
| waitMs | number | — | Extra delay after page load (JS render) |
Returns (mode: summary):
{
"url": "https://en.wikipedia.org/wiki/Markdown",
"title": "Markdown - Wikipedia",
"excerpt": "Markdown is a lightweight markup language for creating formatted text…",
"wordCount": 4823,
"metadata": { "language": "en", "site": "Wikipedia", "image": "…" },
"cached": false,
"renderedWith": "static"
}Returns (mode: markdown): adds a body field with the full Markdown (capped at 50,000 chars; truncated flag set when capped).
extract_html
Convert HTML you already have to Markdown — no fetch.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| html | string | (required) | Raw HTML content |
| url | string | — | Optional source URL (metadata + relative-link resolution only) |
| mode | string | summary | Same as scrape_url |
| (other params) | — | — | Same as scrape_url minus the network-only ones |
check_sitemap
Detect /sitemap.xml and report URL count + a sample. Cheap way to scope a crawl before fanning out into scrape_url calls.
| Parameter | Type | Description |
|-----------|------|-------------|
| url | string | Any URL on the target domain — only the origin is used |
Returns:
{
"sitemapUrl": "https://example.com/sitemap.xml",
"urlCount": 142,
"sample": ["https://example.com/...", "..."]
}get_status
Server + dependency versions and update availability.
forgecrawl status
Server: @icjia/forgecrawl v0.1.0 (latest)
Readability: v0.5.0
Turndown: v7.2.0
Playwright: v1.49.0
linkedom: v0.18.0
cheerio: v1.0.0
Node: v22.22.0
Platform: darwin arm64
Default mode:summaryCLI (standalone usage)
# Summary (no body)
forgecrawl scrape https://example.com
# Full markdown
forgecrawl scrape https://en.wikipedia.org/wiki/Markdown -m markdown
# Preview with selector trim
forgecrawl scrape https://example.com -m preview --selectors-include "main"
# Force JS rendering for an SPA
forgecrawl scrape https://react.dev -r js -m summary
# Auto mode (will fall back to JS if static is too thin)
forgecrawl scrape https://vuejs.org -r auto -m preview
# Convert HTML from stdin
cat page.html | forgecrawl extract-html -m markdown
# Sitemap discovery
forgecrawl sitemap https://example.com
# Versions + update check
forgecrawl status
# Verbose logging
forgecrawl --verbose scrape https://example.com
# Pipe HTML you already have to Markdown
cat saved-page.html | forgecrawl extract-html -m markdownWhen run without a subcommand, forgecrawl starts in MCP server mode (stdio transport).
What you'd actually use this for
The point isn't "Claude can read a webpage." Claude already can. The point is Claude can read 30 webpages cheaply, decide which 3 matter, and pull those in full — because forgecrawl gives the model a summary lever, an SPA-capable renderer, and pre-fetch DOM trimming. Concrete workflows:
For developers
Compare framework / library docs without burning context
You: "Compare React Server Components, Solid, Qwik, Astro, and Svelte 5
for an SSR-heavy app. Skim each project's docs site first."
Claude: [scrape_url × 5 with mode='summary'] ← ~2.5k tokens total
Claude: "Astro and Svelte 5 look most aligned with your constraints.
Pulling those in full."
Claude: [scrape_url × 2 with mode='markdown'] ← only pays for what mattersWithout forgecrawl: 5 raw HTML pages × ~25k tokens each = a context window already half-spent before the comparison starts.
Find the migration / breaking-changes section in a long changelog
You: "What's the migration path from v17 to v18 of @opentelemetry/api?"
Claude: [scrape_url … mode='preview' previewChars=2000]
Claude: "The migration section is at #migration-from-v17. Re-fetching just that."
Claude: [scrape_url … selectors.include='#migration-from-v17' mode='markdown']selectors.include runs in Cheerio before Readability — the rest of the changelog never enters the model's context.
Read a GitHub README without git clone
You: "What does the @icjia/lightcap MCP server do? Look at its README."
Claude: [scrape_url https://github.com/ICJIA/lightcap-mcp mode='markdown']GitHub's rendered README hits Readability cleanly. Static path, no auth, no rate-limit, no clone.
Scrape an SPA admin dashboard or doc site
You: "Read the React docs on Suspense and explain the async render fence."
Claude: [scrape_url https://react.dev/reference/react/Suspense
render='auto' mode='markdown']Static fetch returns near-empty HTML (React mounts client-side). Auto-fallback fires Playwright, page mounts, content is extracted, response includes renderedWith: 'js' so Claude knows it paid the browser-launch cost.
Convert HTML you already have on disk
cat saved-page.html | forgecrawl extract-html -m markdownOr via MCP, extract_html lets Claude convert HTML it already pulled (from a logged-in session, a fixture, a CMS export) without paying for the network round-trip. No data leaves your machine.
For designers / content people
Pull copy and structure from a competitor's pricing page
You: "I'm redesigning our pricing page. Pull the structure and copy
from Stripe, Vercel, and Linear's pricing pages."
Claude: [scrape_url × 3 mode='markdown' selectors.include='main']The selectors.include='main' skips nav and footer. Markdown output preserves the hierarchy of headings, tier names, feature lists, and CTA copy — it's effectively a content audit in a structured form.
Reference content for a redesign — without the design
You: "What's actually IN the IRS Form 1040 instructions PDF page on
irs.gov? Just the content, no nav."
Claude: [scrape_url … selectors.exclude='nav, .breadcrumbs, footer'
mode='markdown' includeLinks=true]Designers often need the content model of a page — the headings, the hierarchy, the cross-references. forgecrawl with includeLinks: true gives you exactly that as Markdown.
For technical writers / researchers
Build a literature review across many sources
You: "I'm writing a brief on procedural justice in pretrial release.
Summarize each of these 12 academic and policy sources."
Claude: [scrape_url × 12 mode='summary'] ← ~6k tokens
Claude: [composes the brief, citing each by URL + author + date from the
structured metadata field]The metadata block on every response carries author, published, site, language — Claude has citation data without you handing it over manually.
Track policy changes across jurisdictions
You: "Pull the current bail-reform statute language from the AOICs of
Illinois, New Jersey, and California."
Claude: [scrape_url × 3 mode='markdown' selectors.include='article, main']
Claude: [diffs the statutes, flags the structural differences]For accessibility auditors
Pair forgecrawl with axecap for content + a11y audits
You: "Audit the about page on icjia.illinois.gov for accessibility AND
check whether the page actually says what it claims to."
Claude: [axecap.audit_url … ] ← what's broken
Claude: [scrape_url …] ← what the page actually says
Claude: [composes report combining a11y findings with content critique]Most a11y tools tell you that a heading is wrong, not what the heading says. forgecrawl supplies the content half of that picture.
Why an MCP server, not "just have Claude fetch the URL"
| Concern | Without forgecrawl | With forgecrawl | |---------|-------------------|-----------------| | 30 URLs to triage | 30 × ~25k tokens raw HTML | 30 × ~400 tokens summary, then drill into 2-3 | | SPA / JS-rendered page | Returns ~0 content | Auto-fallback to headless Chromium | | Internal corp page | Fetch leaves through whatever the model's network is | Local fetch, never leaves your host | | AWS / GCP / Azure metadata endpoints | No SSRF guarantees | Hard-blocked at every redirect hop | | Bidi / Unicode-tag prompt-injection | Page content reaches model verbatim | Stripped before output |
Token economy
The point of the design is to never load more than the model needs.
| Mode | Typical output | Use when |
|------|---------------|----------|
| summary | 200–500 tokens | "Is this page about X?" / triaging multiple URLs |
| preview | ~1,500 tokens | Need a snippet but not the full text |
| markdown | up to ~13,000 tokens | Doing real work with the page content |
Compression strategies in play:
- Mode tiers — caller picks scope; default is
summary - Pre-fetch trimming —
selectors.include/excluderuns before extraction, shrinking the DOM that ever sees Readability - Frontmatter as structured field — page metadata is returned as a JSON object, not embedded in the Markdown body, so the caller never pays tokens for both
- Default link/image stripping — Turndown rules drop
<a>hrefs and<img>URLs unless explicitly requested - Body cap at 50k chars — matches axecap/lightcap; truncation is reported in the response
- In-memory cache — repeat calls (same URL + same options) return instantly with
cached: true
Configuration (env vars, all optional)
| Variable | Default | What it does |
|----------|---------|--------------|
| FORGECRAWL_DEFAULT_MODE | summary | Default response tier |
| FORGECRAWL_PREVIEW_CHARS | 1000 | Default previewChars |
| FORGECRAWL_MAX_BODY_CHARS | 200000 | Hard ceiling on maxBodyChars |
| FORGECRAWL_TIMEOUT | 30000 | HTTP fetch timeout (ms) |
| FORGECRAWL_MAX_REDIRECTS | 10 | Redirect cap |
| FORGECRAWL_USER_AGENT | forgecrawl-mcp/<version> | UA string |
| FORGECRAWL_MAX_HTML_BYTES | 5242880 | Body byte cap (5 MB) |
| FORGECRAWL_CACHE_TTL | 300000 | Cache entry TTL (ms) |
| FORGECRAWL_CACHE_MAX | 50 | Cache size cap |
| FORGECRAWL_AUTO_MIN_WORDS | 50 | Auto-fallback threshold (words) |
| FORGECRAWL_AUTO_MIN_CHARS | 200 | Auto-fallback threshold (body chars) |
| FORGECRAWL_NAV_TIMEOUT | 30000 | Playwright navigation timeout |
| FORGECRAWL_WAITFOR_MAX | 15000 | Max delay for waitFor/waitMs |
| FORGECRAWL_BROWSER_IDLE | 60000 | Close Chromium after this idle period (ms) |
| FORGECRAWL_MAX_CONCURRENT_PAGES | 4 | Cap on inflight Chromium contexts |
| FORGECRAWL_VIEWPORT_WIDTH | 1280 | Browser viewport width |
| FORGECRAWL_VIEWPORT_HEIGHT | 800 | Browser viewport height |
| FORGECRAWL_EXTRACT_TIMEOUT | 10000 | Reserved for the planned worker-thread extraction timeout (not yet enforced) |
| FORGECRAWL_BLOCK_PRIVATE | (off) | Set to 1 to block RFC1918 / loopback / CGNAT. Recommended for any non-localhost-dev deployment. |
| PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD | (off) | Set to 1 to skip Chromium install |
| PLAYWRIGHT_DOWNLOAD_HOST | (default mirror) | Use a private Chromium mirror |
Testing
npm testCovers urlValidate (CIDR matrix, IPv4-mapped IPv6, link-local, multicast), sanitizeError, cache (TTL + LRU), converter (link/image strip, GFM tables, fenced code), extractor (Readability + fallback + selector trim), compress.applyMode (all three tiers + truncation), and fetcher (loopback HTTP server, redirects, content-type rejection, sitemap fetch).
Local development
No build step. Plain JavaScript with ES modules. Edit source, restart Claude Code, repeat.
git clone https://github.com/ICJIA/forgecrawl-mcp.git
cd forgecrawl-mcp
npm install
claude mcp add forgecrawl -s user -- node $(pwd)/src/server.js
# restart Claude Code, then ask it to scrape somethingDon't register the MCP as npx -y @icjia/forgecrawl while working inside this repo. With cwd in the project, npx sees the matching local package.json, tries to spawn forgecrawl from node_modules/.bin/ — which doesn't exist (a package isn't installed as its own dep) — and exits 127 with sh: forgecrawl: command not found. The node $(pwd)/src/server.js form above sidesteps the collision and runs your working tree, so edits take effect on the next MCP restart.
Publishing to npm
This package follows the ICJIA publish convention:
# First-time publish (uses current version in package.json):
./publish.sh
# Subsequent releases — bump and publish:
./publish.sh patch # 0.1.0 → 0.1.1
./publish.sh minor # 0.1.1 → 0.2.0
./publish.sh major # 0.2.0 → 1.0.0
# Dry run only:
./publish.sh --dry-runThe script:
- Verifies you're logged in to npm (
npm whoami); promptsnpm loginif not - Verifies a clean git working tree
- Detects whether
@icjia/forgecrawlexists on npm — first-time publish uses--access public - Bumps the version with
npm version <bump> - Runs
npm publish --dry-runand prompts for confirmation - Publishes
- Commits the version bump, tags
vX.Y.Z, and pushes (with tags)
Aborts cleanly if anything fails or you say no at the prompt.
Architecture
src/
├── server.js ........ MCP server: registers 4 tools, stdio transport
├── cli.js ........... Dual-mode entry: subcommand → CLI, otherwise → server
├── config.js ........ CONFIG (env-driven) + log()/setVerbosity()
├── scraper.js ....... Pipeline orchestration: validate → cache → fetch (static or JS) → extract → convert → finalize
├── fetcher.js ....... fetch() with manual redirect handling + per-hop SSRF re-validation; sitemap helper
├── browser.js ....... Lazy Playwright Chromium lifecycle (single instance, idle close, withPage helper)
├── extractor.js ..... Readability primary, Cheerio-based fallback, selector pre-filter, metadata
├── converter.js ..... Turndown + GFM, link/image strip
├── compress.js ...... applyMode (summary/preview/markdown), truncate, buildMetadata, sanitize
├── cache.js ......... In-memory LRU with TTL
├── urlValidate.js ... SSRF: CIDR classification, DNS resolution, scheme allowlist
└── sanitizeError.js . KNOWN_ERRORS allowlist + connection/timeout/DNS pattern matching| File | Role |
|------|------|
| server.js | MCP init, Zod schemas for 4 tools, request routing, error handling |
| scraper.js | The core pipeline. scrape(), extractFromHtml(), checkSitemap() |
| fetcher.js | HTTP fetch with manual redirect handling. Each hop re-validates. Body byte cap. |
| browser.js | Single shared Chromium, lazy-launched, closed on idle. withPage(fn) helper. |
| extractor.js | Readability primary; sparse-content fallback at 40% capture ratio. |
| converter.js | Turndown + GFM. Strips links/images by default. |
| compress.js | Mode tiers + char ceilings + structured metadata. |
| urlValidate.js | CIDR-based IP classification. Always blocks link-local/unspecified/multicast/reserved. Loopback/private/CGNAT opt-in via FORGECRAWL_BLOCK_PRIVATE=1. |
| sanitizeError.js | Known-safe messages pass through; everything else maps to a generic message. Stack traces never leak. |
| cache.js | Map-based LRU keyed on a stable JSON of the request shape. |
Dependencies
| Package | Purpose |
|---------|---------|
| @modelcontextprotocol/server | MCP SDK (stdio transport, tool registration) |
| @mozilla/readability | Reader-mode extraction (the same engine Firefox uses) |
| linkedom | Lightweight DOM that pairs cleanly with Readability on the server |
| cheerio | Used for selector trimming and the fallback extraction path |
| turndown + turndown-plugin-gfm | HTML → Markdown |
| playwright | Headless Chromium for SPA rendering |
| commander | CLI parsing |
| zod | Schema validation for MCP tool parameters |
Security
Forgecrawl runs locally over stdio — no network listener, no ports, no remote attack surface. The hardening below addresses misuse via prompt injection and SSRF.
A pre-publish red/blue team audit (see CHANGELOG.md → 0.1.0 → Security) identified 0 critical, 3 high, 4 medium, and 5 low findings. Highs and mediums are fixed; relevant lows are fixed; remaining items are documented in the changelog. npm audit: 0 vulnerabilities.
Threat model
The MCP server is local-only (stdio), so the threats are:
- A page the model is asked to scrape is treated as untrusted: it can redirect to internal addresses, embed
<img>pointing at metadata endpoints (when JS-rendered), or stuff invisible Unicode into its content trying to influence the model. - A URL the model is given via tool-call may be attacker-influenced (prompt-injected or sourced from an untrusted document).
- Local users with
$PATHwrite access are out of scope (they already own the host).
SSRF prevention
- Scheme allowlist. Only
http:andhttps:URLs are accepted, inscrape_url,extract_html's optionalurl, andcheck_sitemap. - Hostname blocklist. AWS metadata (
169.254.169.254), GCP (metadata.google.internal), Azure (metadata.azure.com), and0.0.0.0are blocked by name before DNS even runs. - CIDR-based IP classification. Hostnames are resolved via DNS, every returned address is normalized (IPv4-mapped IPv6 unwrapped, including the compressed hex form
::ffff:7f00:1) and classified. Always-blocked categories: link-local (169.254.0.0/16IPv4 +fe80::/10IPv6, catches all AWS IMDS addresses), unspecified (0.0.0.0,::), multicast (224.0.0.0/4,ff00::/8), reserved. - All-records lookup. The validator examines every A/AAAA record returned by
dns.lookupand rejects if any is non-public — closes a multi-A rebinding-style bypass where a single-record check could miss the private one. - Opt-in private/loopback blocking.
FORGECRAWL_BLOCK_PRIVATE=1blocks RFC1918, CGNAT (100.64.0.0/10), loopback (127/8,::1), and IPv6 ULA (fc00::/7). Off by default so the dev workflow keeps working; recommended on for any non-localhost-dev deployment. - Per-hop redirect re-validation, both code paths.
fetchPage(static) andfetchText(sitemap) both useredirect: 'manual'and run everyLocation:target through the full validator (with fresh DNS) before the next request. - Subresource and redirect re-validation in JS render. When Playwright is used, a
page.route('**/*')handler runsvalidateUrlon every request URL the page makes — documents, redirects, and any subresource. Non-essential resource types (images, fonts, media, stylesheets) are blocked outright; the scraper only consumes the document tree. - Fail-closed DNS. Resolution failures block the request rather than allow.
Prompt-injection prevention
- Sanitizer covers the smuggling channels. Output strings are stripped of: C0/C1 controls, soft hyphen, combining grapheme joiner, Arabic letter mark, zero-width chars and LTR/RTL marks (U+200B–U+200F), line/paragraph separators, bidi embedding/override (U+202A–U+202E, the "Trojan Source" range), bidi isolate / word joiner / function-application (U+2060–U+206F), variation selectors (U+FE00–U+FE0F), BOM, and Unicode tag characters (U+E0000–U+E007F) — the active LLM-jailbreak smuggling channel.
- Body sanitization. A second variant of the sanitizer is applied to the body markdown in
previewandmarkdownmode responses (preserving\n,\t,\rso structure survives) — page body content cannot smuggle invisible instructions to the model. - Char ceiling. Total output is capped at 50,000 chars (with truncation note) so a hostile page cannot inflate token usage.
- Safe URL emission. Even with
includeLinks: true/includeImages: true, links and images whose href/src is nothttp:orhttps:are stripped from Markdown output.javascript:,data:,file:, and any other scheme cannot reach the response body. - No raw HTML to the model. Even in
mode: "markdown", the response is post-extraction Markdown — not the raw HTML the page served. Nothing unmediated.
Postinstall hardening
scripts/postinstall.mjs resolves Playwright's CLI through Node's module resolver (require.resolve('playwright/cli.js')) rather than $PATH — this closes the shadowed-binary hijack vector. Spawns via process.execPath. Honors PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1. Emits actionable error messages on failure rather than silent broken installs.
Resource limits
| Resource | Limit | Enforced by |
|----------|-------|-------------|
| URL length | 2,048 chars | Zod schema |
| HTML body | 5 MB | fetcher.js (streaming with byte counter) |
| HTTP timeout | 30 s | fetcher.js (AbortController) |
| Redirects | 10 | fetcher.js (manual loop) |
| Browser navigation timeout | 30 s | browser.js / scraper.js |
| Inflight Chromium contexts | 4 | browser.js (semaphore) |
| Browser idle close | 60 s | browser.js (bumpIdleTimer) |
| waitFor / waitMs | 15 s | Zod schema + scraper enforcement |
| Output chars | 50,000 | compress.js |
| Cache entries | 50 | cache.js (LRU eviction) |
License
MIT. See LICENSE.
