pi-webaio
v0.3.3
Published
All-in-one web tools for pi with search (Google, Brave, DDG) and fetch with headless browser AI summarization
Maintainers
Readme

pi-webaio
All-in-one web access tools for pi with search, fetch, crawl, extraction, anti-bot TLS fingerprinting, and intelligent resilience.
Installation
pi install npm:pi-webaioOr from git:
pi install git:github.com/apmantza/pi-webaioTools
| Tool | Description |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| aio-websearch | Search the web using DuckDuckGo, Brave, and Google in parallel (no API keys required). Returns compact results with title, URL, and snippet. 7s cap — returns whatever is ready. Google runs via headless Chrome CDP (auto-launched). 10-minute cache. |
| aio-webfetch | Fetch a single URL (or batch of URLs) and convert to markdown with anti-bot TLS fingerprinting. Long content is AI-summarized via Google AI Mode; full file always saved. Detects PDFs, GitHub repos, and Next.js RSC. Supports auto escalation. |
| aio-webcontent | Retrieve previously fetched content from session storage by URL. Returns full untruncated content — no data loss. |
| aio-webmap | Discovery-only tool — finds pages via robots.txt, sitemaps, navigation links, and llms.txt without fetching content. Returns structured URL list. |
| aio-webresult | Retrieve a previously fetched result by persistent response ID. Survives restarts. Shows recent results if ID not found. |
| aio-webpull | Pull any public website or docs site into local markdown files with anti-bot TLS fingerprinting. Discovers pages via sitemap, navigation links, or crawling. Rewrites internal links to relative .md paths. Supports auto escalation and context package compilation. |
Tool Parameters
aio-websearch
| Parameter | Type | Default | Description |
| --------- | --------- | ------- | --------------------------------------------------------------------------------- |
| query | string | — | Search query (e.g. 'React Server Components RFC') |
| max | number | 10 | Max results per engine. Up to 25 total after dedup across all engines. |
| google | boolean | true | Also search Google via headless Chrome CDP. Set to false to use only DDG/Brave. |
aio-webfetch
| Parameter | Type | Default | Description |
| ----------------- | ---------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- |
| url | string | — | Single URL to fetch. Use either url or urls, not both. |
| urls | string[] | — | Multiple URLs to fetch in parallel. Use either url or urls, not both. |
| out | string | auto-derived | Output file path under temp (for single url only) |
| mode | string | auto | Scrape mode: auto (escalates), fast, fingerprint, or browser |
| browser | string | latest | Browser profile for TLS fingerprinting. Auto-selects latest Chrome. Options: chrome_145, firefox_147, safari_26, edge_145 |
| os | string | windows | OS profile for fingerprinting. Options: windows, macos, linux, android, ios |
| proxy | string | — | Proxy URL (http://user:pass@host:port or socks5://host:port). Supports HTTP, HTTPS, SOCKS5. |
| cacheTtlSeconds | number | — | Opt-in cache TTL in seconds. Omit for fresh fetches. |
| compile | boolean | false | Compile batch results into a single context package |
| prune | number | — | Prune markdown to token budget (e.g. 3000) |
| interactive | boolean | false | Extract interactive elements as numbered refs |
| start_index | number | 0 | Return content starting from this character index (0-based). Use with max_length for pagination. |
| max_length | number | unlimited | Maximum characters to return. Use with start_index for pagination. |
aio-webcontent
| Parameter | Type | Default | Description |
| --------- | -------- | ------- | --------------------------------- |
| url | string | — | URL of previously fetched content |
aio-webmap
| Parameter | Type | Default | Description |
| --------- | -------- | --------- | -------------------------------------- |
| url | string | — | URL to discover pages for |
| max | number | 100 | Max URLs to discover |
| browser | string | latest | Browser profile for TLS fingerprinting |
| os | string | windows | OS profile for fingerprinting |
aio-webresult
| Parameter | Type | Default | Description |
| --------- | -------- | ------- | ----------------------------------------- |
| id | string | — | Response ID from a previous webfetch call |
aio-webpull
| Parameter | Type | Default | Description |
| --------- | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- |
| url | string | — | URL to pull (e.g. https://docs.example.com) |
| out | string | <hostname> | Output directory under temp |
| max | number | 100 | Max pages to pull |
| mode | string | auto | Scrape mode: auto (escalates), fast, fingerprint, or browser |
| browser | string | latest | Browser profile for TLS fingerprinting. Auto-selects latest Chrome. Options: chrome_145, firefox_147, safari_26, edge_145 |
| os | string | windows | OS profile for fingerprinting. Options: windows, macos, linux, android, ios |
| proxy | string | — | Proxy URL (http://user:pass@host:port or socks5://host:port). Supports HTTP, HTTPS, SOCKS5. |
| compile | boolean | false | Compile pulled pages into a single context package |
Features
Fetching & Extraction
- Anti-bot TLS fingerprinting —
wreq-jswith dynamic browser profiles (auto-selects latest Chrome, with fallbacks tofirefox_147,safari_26,edge_145) - Auto escalation pipeline —
mode: "auto"escalates from fast fetch → fingerprint rotation → Cloudflare UA bypass → Playwright rendering when bot protection is detected - Bot-block detection — Structured detection of Cloudflare, Anubis, PerimeterX, DataDome, Incapsula, Akamai with confidence scoring and retry advice
- Cloudflare challenge bypass — Detects CF challenges via header + body markers, retries with alternate UA before falling through to fingerprint rotation
- Playwright fallback — If
wreq-jsfails, dynamically imports Playwright to render JS-heavy pages (zero-config, optional dependency) - Smart retry logic — Exponential backoff (1s → 2s) for
429/500/502/503/504and transient network errors. Non-retryable (400/401/403/404) fail fast. - Provider cooldown system — Search engines (DDG, Brave, Google) track failures with TTL cooldowns (10min quota / 2min network). Skipped engines don't waste time.
- HTTP→HTTPS auto-upgrade — Normalizes
http://requests and responses - Cross-host redirect detection — Surfaces a warning notice when a fetch redirects to a different domain
- GitHub-aware fetch — Detects repos, trees, blobs; clones repos or uses API. Special handling for GitHub Actions run URLs — fetches job details, step-by-step status, and failed job log excerpts
- Architecture detection — Analyzes cloned repos for Docker, CI/CD platforms, test frameworks, monorepo tooling, package managers, and security signals
- PDF extraction — Extracts text from PDFs (
pdf-parse) - RSC extraction — Extracts Next.js React Server Components flight data
- JSON auto-detection — Detects
application/jsoncontent-type or body starting with{/[, returns pretty-printed - Plain text handling — Detects
text/plain, wraps in code block (unless already markdown) - Binary download detection — Detects null bytes or >30% non-ASCII, streams to temp with filename from Content-Disposition
- Vertical extractors — 7 API-first extractors for npm, PyPI, Hacker News, Reddit, arXiv, YouTube, and docs sites — hit structured APIs instead of scraping HTML. The YouTube extractor fetches video transcripts + metadata via
youtube-transcript-plus(Innertube API, no API key required). Supports standard URLs,youtu.be, Shorts, and embeds. Returns title, channel, duration, views, tags, description, and full transcript with configurable format (text,vtt, orsegmentswith timestamps). Language selection with auto-fallback. - SPA data-island recovery — Extracts JSON hydration data from
<script>tags and framework globals for JS-rendered pages - Client-side meta redirect — Follows
<meta http-equiv="refresh">up to 5 hops recursively - Proxy support — Routes all requests through HTTP, HTTPS, or SOCKS5 proxy
- Structured error info — Failed fetches include
errorCode,phase,retryableflag, andstatusCode
Cookie Consent Banner Stripping
Cookie consent / CMP banners are stripped server-side during HTML pre-cleaning (preCleanHtml()). 80+ selectors covering 17+ named CMPs (OneTrust, Cookiebot, Didomi, Quantcast, Usercentrics, TrustArc, Klaro, Sourcepoint, CookieYes, Osano, CookieFirst, Adobe PMC, SmartConsent, CookieHub, TermsFeed, Google, YouTube, BBC, Amazon) plus generic class/id patterns ([class*="cookie-banner"], [class*="consent-modal"], [class*="gdpr-banner"], [class*="privacy-notice"]) and ARIA/data-attribute patterns. Runs before Readability/Defuddle extraction so banner noise never reaches content heuristics.
Content Extraction Pipeline
When fetching a page, pi-webaio tries the following backends in order, falling through until one returns clean content. HTML is pre-cleaned (nav/footer/header/svg/cookie consent banners removed via DOM) before entering the extraction pipeline. At every stage, if extracted content is <30 words or <1% of original HTML, the pipeline falls through to the next backend.
- Vertical extractors — API-first: npm registry, PyPI JSON, Hacker News Firebase, Reddit .json, arXiv Atom, platform docs-sites
- GitHub special-case — Clones repos or fetches via GitHub API
- Binary download — Detects non-text content before attempting text fetch
- PDF — Extracts text from PDF files (by URL or content-type)
- JSON — Detects
application/jsoncontent-type, pretty-prints in code block - Plain text — Wraps
.txt, configs, logs in code block (unless already markdown) - Client-side meta redirect — Follows
<meta http-equiv="refresh">up to 5 hops - Cloudflare challenge bypass — Detects CF 403, retries with OpenCode UA
- Jina AI Reader (
r.jina.ai) — Re-fetches via Jina's proxy - Mozilla Readability — Local article extraction. If <30 words or <1% of original → skip
- Next.js RSC — Extracts React Server Components flight data
- SPA data-island recovery — Extracts hydration JSON from
<script>tags - Defuddle — Local HTML→markdown conversion (extractor comments stripped, whitespace normalized)
- Fallback — Bare-minimum title + text extraction
Alternate link fallback: at every stage, if content is thin, <link rel="alternate" type="application/json"> is tried.
Security & Safety
- DNS-based SSRF protection — Resolves hostnames and validates all returned IPs against full RFC 1918/RFC 6598/RFC 3927 private ranges, blocks cloud metadata endpoints (
169.254.169.254,metadata.google.internal) - Redirect-hop SSRF re-validation — Validates every redirect target (up to 5 hops) — prevents
302 → internal IPbypass attacks - IPv6 tunnel detection — Blocks tunneled private IPv4 inside IPv6 addresses (::ffff, IPv4-compatible, 6to4, Teredo)
- Content trust boundaries — All fetched content wrapped in
[UNTRUSTED WEB CONTENT]markers - Secret scanning — Blocks requests containing API keys, tokens, or passwords in URLs
- Prompt injection detection — Categorizes and warns/redacts/tags suspicious content
- Provider cooldown system — Search engines track failures with TTL cooldowns to skip dead providers
Metadata & Frontmatter
- Rich YAML frontmatter — Saved markdown files include
title,url,author,published,site,language, andword_countin the frontmatter when available from extraction (Defuddle) - Stored in session cache — Metadata is captured alongside content in the session store for retrieval via
aio-webcontent
Caching & Performance
- Session cache — 30-minute TTL, LRU eviction (max 100 entries). Keys normalized for consistency (
http://→https://, root trailing slashes deduplicated). - Persistent disk cache — On startup, all previously fetched
.mdfiles underBASE_TEMPare scanned and registered in the session store. Content is lazy-loaded from disk on first access — survives restarts. - Search cache — 10-minute TTL, persisted to disk for cross-session reuse
- Preview truncation —
aio-webfetchtool results show ~500 tokens in-context; full file is always written to disk for inspection via thereadtool - Rate limiter — Token-bucket per domain (5 req/s, burst 10) in
smartFetch. All tools are throttled politely.
AI-Powered Summarization
- Google AI Mode (udm=50) — Long fetched content is auto-summarized by Google AI via headless Chrome CDP (15s timeout). The AI reads the URL directly and returns a concise bullet-point summary.
- Search context bridging — When
aio-webfetchfollows a recentaio-websearch(within 5 min), the original query is injected into the summarization prompt for more focused summaries. - Graceful fallback — If Google AI is unavailable (Chrome not installed, CDP files missing), falls back to truncation.
Google CDP Search
- Parallel search —
aio-websearchruns DuckDuckGo, Brave, and Google in parallel. Google uses a headless Chrome instance (auto-launched) with locale-agnostictextarea[name="q"]selectors. - 7-second cap — Returns whatever results are ready by the deadline. No waiting for slow engines.
- Result deduplication — Merges and deduplicates results across all engines by URL.
Usage Examples
Search the web
Use aio-websearch to find the latest React documentationGoogle search is on by default (via headless Chrome). To skip it:
Use aio-websearch to search for "Rust serde" (google: false)Fetch a single URL
Use aio-webfetch to download https://example.com/articleAfter fetching, use the built-in read tool to inspect the full saved file.
Fetch multiple URLs in batch
Use aio-webfetch to download these URLs:
- https://example.com/page1
- https://example.com/page2
- https://example.com/page3Fetch with a specific browser fingerprint
Use aio-webfetch to download https://example.com (browser: "firefox_147", os: "linux")Retrieve stored content (no re-download)
Use aio-webcontent to get the full content from https://example.com/articlePull an entire site
Use aio-webpull to download https://docs.example.com (max: 50 pages)Pull a site with custom fingerprint
Use aio-webpull to download https://docs.example.com (max: 50, browser: "edge_145", os: "macos")License
MIT
