@fredrikpaulin/webscan
v0.3.1
Published
Bun-native metasearch + content-aware fetch/crawl/extract. Static-HTML parsing engine, born-digital PDF text extractor, multi-format content dispatcher (HTML / PDF / JSON / XML / feeds / CSV / text / Markdown / binary), optional headless rendering over Ch
Downloads
320
Maintainers
Readme
WEBSCAN
A Bun-native toolkit for getting useful text out of the web. Fetches a URL and returns readable text, metadata, and outbound links — for HTML, PDF, JSON, RSS/Atom, CSV, Markdown, and plain text. Extracts structured data against a JSON Schema. Crawls a small surface with robots.txt support. Renders JavaScript-heavy pages through any already-installed Chromium-family browser over the Chrome DevTools Protocol. Aggregates web search and serves infoboxes when you want a search UI too.
Zero runtime dependencies. No Puppeteer, no Playwright, no PDF library, no browser binary shipped.
Runs in three modes: standalone server with a browser UI, MCP tool
module for tiny-mcp-server, or library for direct use.
Install
bun add @fredrikpaulin/webscan # as a library
bun install -g @fredrikpaulin/webscan # as the `webscan` standalone serverimport { fetchUrl, extract, crawl } from '@fredrikpaulin/webscan'
import { parsePdf } from '@fredrikpaulin/webscan/content/pdf'
import webscan from '@fredrikpaulin/webscan/mcp'Quick start
Fetch any URL — HTML, PDF, JSON, feed, CSV, text — and get back the same shape:
import { fetchUrl } from '@fredrikpaulin/webscan'
const article = await fetchUrl('https://example.com/article')
article.kind // 'html'
article.title // 'Fixture Article'
article.text // 'Headline\nBody paragraph one. …'
article.links // [{ href, text, rel, internal, … }, …]
const report = await fetchUrl('https://example.com/q4-report.pdf')
report.kind // 'pdf'
report.title // 'Quarterly report' (from /Info)
report.text // 'Page 1 text…\n\n--- Page 2 ---\n\n…'
report.metadata // { pageCount: 12, author, encrypted: false, … }
const feed = await fetchUrl('https://example.com/feed.rss')
feed.kind // 'feed'
feed.parts // [{ type:'feed-item', title, url, publishedAt, … }, …]For schema-driven extraction:
import { extract } from '@fredrikpaulin/webscan'
const { data } = await extract(article, {
type: 'object',
properties: {
title: { type: 'string', 'x-extract': { selector: 'h1' } },
price: { type: 'number', 'x-extract': { selector: '.price', transform: 'number' } },
sku: { type: 'string', 'x-extract': { regex: { pattern: 'SKU:\\s*([A-Z0-9-]+)', group: 1 } } },
author: { type: 'string' } // un-hinted → optional local LLM fallback
}
})For JavaScript-rendered pages:
const spa = await fetchUrl('https://app.example.com/', {
render: { waitForSelector: '.product-grid', actions: [
{ type: 'click', selector: '.load-more' },
{ type: 'waitForSelector', selector: '.page-2' }
] }
})
spa.rendered // true — DOM is the post-interaction stateFor a bounded crawl:
import { crawl } from '@fredrikpaulin/webscan'
for await (const r of crawl({
seeds: ['https://example.com/'],
depth: 2,
maxPages: 50,
respectRobots: true
})) {
if (r.error) console.warn(r.url, r.error.code)
else console.log(r.url, r.page.title)
}For the standalone search server:
bun start # http://127.0.0.1:5000
bun run dev # with hot reloadThe package root (src/index.js) is side-effect-free. Importing it starts
no server.
What it does
Content-aware fetch. fetchUrl(url) reads bytes once and dispatches
through a content layer: HTML through a zero-dependency tree parser and
metadata extractor, PDF through a born-digital text extractor, JSON, XML,
RSS/Atom feeds, CSV/TSV, Markdown, plain text. Every kind returns the same
ParsedContent shape so downstream code doesn't branch. See
docs/fetch.md, docs/content.md,
docs/pdf.md.
Schema-driven extract. extract(input, schema) resolves
x-extract hints — CSS selectors, regex against text, metadata path
lookups — per field. Un-hinted fields fall through to an optional local
Ollama call, with Ollama's format: schema
constraint giving you valid JSON without an API key or a third-party
round-trip. The LLM seam is provider-agnostic: pass your own
(prompt, schema) => object function to use Anthropic, OpenAI, or
anything else. See docs/extract.md.
Bounded BFS crawl. crawl(opts) is an async iterable over outbound
links. Respects depth, page caps, same-domain rules, include/exclude
regexes, concurrency, per-host politeness, and robots.txt. Non-HTML
content (PDFs, CSVs, feeds) follows the same parse path. Iterator
teardown (for await … of early break) closes everything it owns. See
docs/crawl.md, docs/robots.md.
Optional headless rendering. fetchUrl(url, { render: true }) drives
Chrome, Chromium, Edge, or Brave — whichever is already installed —
over the Chrome DevTools Protocol. Click, type, scroll, screenshot.
Real coordinate-based input events through Input.dispatchMouseEvent so
SPA event handlers fire as if a real user touched them. URL policy
blocks loopback, private networks, and cloud metadata at both the
navigation layer and request interception. The MCP render_page tool
hard-codes the safety boundaries — no evaluate, no password fills, no
inline screenshot bytes. See docs/render.md,
docs/browser-setup.md.
Search + infobox UI. The original v0.1 features stay shipped: parallel scrape against DuckDuckGo Lite, Startpage, Yahoo and Bing Images; calculator, unit conversion (44 units), currency conversion (70+ fiat + 30+ crypto), dictionary, Wikipedia summary; SQLite cache; proxy rotation with failure tracking; OpenSearch XML for browser integration. See docs/usage.md, docs/search-engines.md, docs/infobox.md.
MCP tools. fetch_url, extract, crawl, render_page,
web_search, image_search, calculate, convert_units,
convert_currency, define_word, wikipedia_summary. The crawl tool
caps depth at 3, pages at 200, concurrency at 8 — the LLM-friendly
bounded version. The render tool hard-codes the §8 safety boundaries:
no allowPrivateNetwork in the schema, no evaluate action, no inline
screenshot bytes. See docs/mcp-integration.md.
Why this exists
Most web-fetch libraries handle HTML and stop. The interesting research
material is mixed: PDFs from regulators, RSS from blogs, CSVs from open
data portals, JSON from APIs, occasionally an SPA whose useful content
only appears after JavaScript runs. WEBSCAN treats those as the same
problem with the same return shape — fetch the bytes, route them to the
right parser, return readable text. A crawler that follows HTML links
but downloads PDFs lossily (or worse, silently) is a tool you have to
work around. One that returns the same ParsedContent for both is one
you can build on.
The browser path exists for the cases where a full DOM is required and nothing else works. It uses whatever Chromium-family browser is already on the machine, drives it over CDP from ~140 lines of JSON-over-WebSocket code, and doesn't pull a thousand-package dependency tree to do so. It runs on a developer laptop and on a Raspberry Pi 5 with the same code path.
The LLM path defaults to local Ollama because the contract for "give me JSON matching this schema" finally landed on the local side in late 2025. Cloud LLMs are still better at messy extraction; they're no longer the only way to get structured output. The library defaults to the option that doesn't require an API key, and the seam is open for callers who want to use Anthropic, OpenAI, or anything else.
Configuration
WEBSCAN_PORT=8080 WEBSCAN_HOST=0.0.0.0 bun start
WEBSCAN_FETCH_HTTP_TIMEOUT=10000 \
WEBSCAN_CRAWL_CONCURRENCY=4 \
WEBSCAN_EXTRACT_LLM_ENABLED=true \
WEBSCAN_EXTRACT_LLM_MODEL=llama3.2 \
bun startOr a webscan.config.json. The schema covers 31 keys across three eras
(v0.1 search/server, v0.2 fetch/crawl/extract, v0.3 content/browser).
See docs/configuration.md.
MCP integration
import webscan from '@fredrikpaulin/webscan/mcp'
const module = webscan({
cacheDir: ':memory:',
extractLlmEnabled: true,
extractLlmModel: 'llama3.2'
})The MCP module registers eleven tools, with safety boundaries baked
into the schemas where it matters (no evaluate in render_page, no
render in crawl, no inline binary in either). See
docs/mcp-integration.md.
Tests
bun test tests/427 tests across 59 files. Real-browser integration tests are gated
behind WEBSCAN_TEST_BROWSER=1 so CI without a browser stays green:
WEBSCAN_TEST_BROWSER=1 bun test tests/browser-render.test.jsA small SPA fixture in tests/fixtures/spa/ drives the gated tests
deterministically.
Documentation
- Usage — endpoints, browser integration, server config
- Configuration — all keys, env vars, JSON config
- MCP Integration — tool schemas, programmatic API
- Fetch —
fetchUrl, return shape, conditional fetch - Content dispatcher —
parseContent, format support, ParsedContent - PDF — born-digital text extractor, scope, diagnostics
- Crawl — BFS, robots, content-type filters, render lifecycle
- Robots —
checkRobots, parsing rules, permissive-on-error - Extract — schema,
x-extracthints, Ollama backing - Render —
renderoptions, actions, screenshots, MCP boundaries - Browser setup — detection, override paths, troubleshooting
- Parsing engine —
parseWebsite, selectors, the tree parser - Search engines — built-in engines, parsers
- Infobox handlers — calculator, units, currency, dictionary, Wikipedia
- Template engine —
{{ var }}, loops, conditionals, includes - Roadmap — planned features, version targets
- Changelog — release history
License
MIT — see LICENSE.md.
