@johnnywu/pi-webfetch
v1.1.0
Published
Fetch web pages and URLs from pi with readable text, Markdown, HTML, or JSON output.
Maintainers
Readme
pi-webfetch
A pi package that adds a webfetch tool for fetching and cleaning URL content with Scrapling, Defuddle, and gh for GitHub URLs.
Given a user-provided URL, webfetch routes GitHub URLs through GitHub CLI, otherwise chooses a Scrapling fetcher strategy, runs Scrapling through its CLI shell, and returns cleaned Markdown/HTML/text content to pi.
Install
pi install npm:@johnnywu/pi-webfetchOr via local path in ~/.pi/agent/settings.json while developing:
{
"packages": ["~/dev/jwu/pi-webfetch"]
}Requirements
webfetch calls Scrapling through:
scrapling shell -L warning -c "..."Make sure the scrapling executable is available in the environment where pi runs.
Defuddle conversion is bundled as an npm dependency and is used by default for non-GitHub Markdown output. It can be disabled in settings.
Configuration
Add webfetch settings to .pi/settings.json (project) or ~/.pi/agent/settings.json (global) to override defaults:
{
"webfetch": {
"useDefuddle": true,
"qualityJudge": false,
"qualityJudgeModel": "google/gemini-2.5-flash",
"qualityJudgeThinkLevel": "off"
}
}Defuddle behavior:
| webfetch.useDefuddle | Markdown behavior |
|---|---|
| omitted | Scrapling fetches cleaned HTML, then Defuddle converts that HTML to Markdown |
| true | Same as omitted: use Scrapling HTML plus Defuddle Markdown conversion |
| false | Scrapling fetches and extracts Markdown directly |
Quality judge behavior:
| Setting | Default | Description |
|---|---:|---|
| webfetch.qualityJudge | false | When enabled, ask an LLM whether the fetched Markdown is usable before accepting a Scrapling strategy. If the judge returns unusable, webfetch records that strategy as failed and tries the next one. |
| webfetch.qualityJudgeModel | current pi model | Optional judge model in provider/model form, for example google/gemini-2.5-flash. |
| webfetch.qualityJudgeThinkLevel | off | Optional judge thinking level: off, minimal, low, medium, high, or xhigh. Unsupported levels are clamped for the selected model. |
Project settings override global settings. For compatibility, the dotted key form also works:
{
"webfetch.useDefuddle": true,
"webfetch.qualityJudge": true,
"webfetch.qualityJudgeModel": "google/gemini-2.5-flash",
"webfetch.qualityJudgeThinkLevel": "off"
}The switch affects non-GitHub Markdown output. Explicit mode: "html" or mode: "text" still uses direct extraction. GitHub URLs are handled by gh and do not use Defuddle.
Tool
webfetch
Fetch and clean an HTTP(S) URL with gh for GitHub URLs or Scrapling for other sites.
| Parameter | Type | Default | Description |
|---|---:|---:|---|
| url | string | required | HTTP(S) URL to inspect and fetch |
| mode | markdown | html | text | markdown | Output mode. Markdown may be converted by Scrapling or Defuddle depending on settings. |
Fetch strategy
For non-GitHub URLs, webfetch uses an explicit built-in site-to-strategy mapping first.
Current mapping:
| Site | Strategy | Reason |
|---|---|---|
| shadertoy.com and subdomains | StealthyFetcher | Cloudflare protection; static/dynamic fetchers often return 403 or challenge HTML |
| x.com, twitter.com and subdomains | StealthyFetcher | SPA and anti-bot behavior; future login-state support can build on this |
For sites that are not in the mapping, webfetch uses sequential escalation from the Scrapling guide:
Fetcher.get(url)— fastest static fetcher- if it fails, returns HTTP
>= 400, or extracts empty content, tryDynamicFetcher.fetch(url, network_idle=True, wait=3000) - if that also fails or extracts empty content, try
StealthyFetcher.fetch(url, network_idle=True, wait=3000)
Each failed attempt is recorded in errors, so the result explains why webfetch adjusted to the next strategy.
When Defuddle is enabled for Markdown output, each Scrapling strategy is considered successful only after both steps succeed: Scrapling extracts cleaned HTML, then Defuddle returns non-empty Markdown. If Defuddle fails or returns empty Markdown for a strategy, that strategy is recorded as failed and webfetch continues to the next Scrapling strategy.
When webfetch.qualityJudge is enabled, the selected judge model receives a sample of the fetched Markdown and returns a JSON usability decision. Unusable content, such as boilerplate, captcha/challenge pages, error pages, or unrelated content, is treated as a failed strategy so webfetch can continue to the next Scrapling strategy. If the judge cannot run, webfetch fails open and uses the fetched content rather than making the tool unusable.
Content extraction uses:
Convertor._extract_content(page, extraction_type=mode, main_content_only=True)When webfetch.useDefuddle is not false and Markdown output is requested, mode sent to Scrapling is html; the returned cleaned HTML is then parsed with Defuddle using markdown: true.
Output behavior
- Only
http://andhttps://URLs are accepted. - Failed Scrapling strategies are included in tool details.
- Tool output is truncated with pi's standard limits: 2000 lines or 50 KiB, whichever is hit first.
- If output is truncated, the full extracted content is saved to a temp file and the path is included in the result.
Development
# Install dependencies
bun install
# Run tests
bun test
# Type check
bun run typecheck
# Format
bun run format
# Release (local, requires GH_TOKEN and NPM_TOKEN)
bun run releaseThis project uses semantic-release with conventional commits.
License
MIT
