@fredrikpaulin/webscan

v0.3.1

Published

17 days ago

Bun-native metasearch + content-aware fetch/crawl/extract. Static-HTML parsing engine, born-digital PDF text extractor, multi-format content dispatcher (HTML / PDF / JSON / XML / feeds / CSV / text / Markdown / binary), optional headless rendering over Ch

Downloads

320

WEBSCAN

A Bun-native toolkit for getting useful text out of the web. Fetches a URL and returns readable text, metadata, and outbound links — for HTML, PDF, JSON, RSS/Atom, CSV, Markdown, and plain text. Extracts structured data against a JSON Schema. Crawls a small surface with robots.txt support. Renders JavaScript-heavy pages through any already-installed Chromium-family browser over the Chrome DevTools Protocol. Aggregates web search and serves infoboxes when you want a search UI too.

Zero runtime dependencies. No Puppeteer, no Playwright, no PDF library, no browser binary shipped.

Runs in three modes: standalone server with a browser UI, MCP tool module for tiny-mcp-server, or library for direct use.

Install

bun add @fredrikpaulin/webscan          # as a library
bun install -g @fredrikpaulin/webscan   # as the `webscan` standalone server

import { fetchUrl, extract, crawl } from '@fredrikpaulin/webscan'
import { parsePdf } from '@fredrikpaulin/webscan/content/pdf'
import webscan from '@fredrikpaulin/webscan/mcp'

Quick start

Fetch any URL — HTML, PDF, JSON, feed, CSV, text — and get back the same shape:

import { fetchUrl } from '@fredrikpaulin/webscan'

const article = await fetchUrl('https://example.com/article')
article.kind        // 'html'
article.title       // 'Fixture Article'
article.text        // 'Headline\nBody paragraph one. …'
article.links       // [{ href, text, rel, internal, … }, …]

const report = await fetchUrl('https://example.com/q4-report.pdf')
report.kind         // 'pdf'
report.title        // 'Quarterly report'  (from /Info)
report.text         // 'Page 1 text…\n\n--- Page 2 ---\n\n…'
report.metadata     // { pageCount: 12, author, encrypted: false, … }

const feed = await fetchUrl('https://example.com/feed.rss')
feed.kind           // 'feed'
feed.parts          // [{ type:'feed-item', title, url, publishedAt, … }, …]

For schema-driven extraction:

import { extract } from '@fredrikpaulin/webscan'

const { data } = await extract(article, {
  type: 'object',
  properties: {
    title: { type: 'string', 'x-extract': { selector: 'h1' } },
    price: { type: 'number', 'x-extract': { selector: '.price', transform: 'number' } },
    sku:   { type: 'string', 'x-extract': { regex: { pattern: 'SKU:\\s*([A-Z0-9-]+)', group: 1 } } },
    author: { type: 'string' }   // un-hinted → optional local LLM fallback
  }
})

For JavaScript-rendered pages:

const spa = await fetchUrl('https://app.example.com/', {
  render: { waitForSelector: '.product-grid', actions: [
    { type: 'click', selector: '.load-more' },
    { type: 'waitForSelector', selector: '.page-2' }
  ] }
})
spa.rendered    // true — DOM is the post-interaction state

For a bounded crawl:

import { crawl } from '@fredrikpaulin/webscan'

for await (const r of crawl({
  seeds: ['https://example.com/'],
  depth: 2,
  maxPages: 50,
  respectRobots: true
})) {
  if (r.error) console.warn(r.url, r.error.code)
  else console.log(r.url, r.page.title)
}

For the standalone search server:

bun start                # http://127.0.0.1:5000
bun run dev              # with hot reload

The package root (src/index.js) is side-effect-free. Importing it starts no server.

What it does

Content-aware fetch. fetchUrl(url) reads bytes once and dispatches through a content layer: HTML through a zero-dependency tree parser and metadata extractor, PDF through a born-digital text extractor, JSON, XML, RSS/Atom feeds, CSV/TSV, Markdown, plain text. Every kind returns the same ParsedContent shape so downstream code doesn't branch. See docs/fetch.md, docs/content.md, docs/pdf.md.

Schema-driven extract. extract(input, schema) resolves x-extract hints — CSS selectors, regex against text, metadata path lookups — per field. Un-hinted fields fall through to an optional local Ollama call, with Ollama's format: schema constraint giving you valid JSON without an API key or a third-party round-trip. The LLM seam is provider-agnostic: pass your own (prompt, schema) => object function to use Anthropic, OpenAI, or anything else. See docs/extract.md.

Bounded BFS crawl. crawl(opts) is an async iterable over outbound links. Respects depth, page caps, same-domain rules, include/exclude regexes, concurrency, per-host politeness, and robots.txt. Non-HTML content (PDFs, CSVs, feeds) follows the same parse path. Iterator teardown (for await … of early break) closes everything it owns. See docs/crawl.md, docs/robots.md.

Optional headless rendering. fetchUrl(url, { render: true }) drives Chrome, Chromium, Edge, or Brave — whichever is already installed — over the Chrome DevTools Protocol. Click, type, scroll, screenshot. Real coordinate-based input events through Input.dispatchMouseEvent so SPA event handlers fire as if a real user touched them. URL policy blocks loopback, private networks, and cloud metadata at both the navigation layer and request interception. The MCP render_page tool hard-codes the safety boundaries — no evaluate, no password fills, no inline screenshot bytes. See docs/render.md, docs/browser-setup.md.

Search + infobox UI. The original v0.1 features stay shipped: parallel scrape against DuckDuckGo Lite, Startpage, Yahoo and Bing Images; calculator, unit conversion (44 units), currency conversion (70+ fiat + 30+ crypto), dictionary, Wikipedia summary; SQLite cache; proxy rotation with failure tracking; OpenSearch XML for browser integration. See docs/usage.md, docs/search-engines.md, docs/infobox.md.

MCP tools. fetch_url, extract, crawl, render_page, web_search, image_search, calculate, convert_units, convert_currency, define_word, wikipedia_summary. The crawl tool caps depth at 3, pages at 200, concurrency at 8 — the LLM-friendly bounded version. The render tool hard-codes the §8 safety boundaries: no allowPrivateNetwork in the schema, no evaluate action, no inline screenshot bytes. See docs/mcp-integration.md.

Why this exists

Most web-fetch libraries handle HTML and stop. The interesting research material is mixed: PDFs from regulators, RSS from blogs, CSVs from open data portals, JSON from APIs, occasionally an SPA whose useful content only appears after JavaScript runs. WEBSCAN treats those as the same problem with the same return shape — fetch the bytes, route them to the right parser, return readable text. A crawler that follows HTML links but downloads PDFs lossily (or worse, silently) is a tool you have to work around. One that returns the same ParsedContent for both is one you can build on.

The browser path exists for the cases where a full DOM is required and nothing else works. It uses whatever Chromium-family browser is already on the machine, drives it over CDP from ~140 lines of JSON-over-WebSocket code, and doesn't pull a thousand-package dependency tree to do so. It runs on a developer laptop and on a Raspberry Pi 5 with the same code path.

The LLM path defaults to local Ollama because the contract for "give me JSON matching this schema" finally landed on the local side in late 2025. Cloud LLMs are still better at messy extraction; they're no longer the only way to get structured output. The library defaults to the option that doesn't require an API key, and the seam is open for callers who want to use Anthropic, OpenAI, or anything else.

Configuration

WEBSCAN_PORT=8080 WEBSCAN_HOST=0.0.0.0 bun start
WEBSCAN_FETCH_HTTP_TIMEOUT=10000 \
WEBSCAN_CRAWL_CONCURRENCY=4 \
WEBSCAN_EXTRACT_LLM_ENABLED=true \
WEBSCAN_EXTRACT_LLM_MODEL=llama3.2 \
  bun start

Or a webscan.config.json. The schema covers 31 keys across three eras (v0.1 search/server, v0.2 fetch/crawl/extract, v0.3 content/browser). See docs/configuration.md.

MCP integration

import webscan from '@fredrikpaulin/webscan/mcp'

const module = webscan({
  cacheDir: ':memory:',
  extractLlmEnabled: true,
  extractLlmModel: 'llama3.2'
})

The MCP module registers eleven tools, with safety boundaries baked into the schemas where it matters (no evaluate in render_page, no render in crawl, no inline binary in either). See docs/mcp-integration.md.

Tests

bun test tests/

427 tests across 59 files. Real-browser integration tests are gated behind WEBSCAN_TEST_BROWSER=1 so CI without a browser stays green:

WEBSCAN_TEST_BROWSER=1 bun test tests/browser-render.test.js

A small SPA fixture in tests/fixtures/spa/ drives the gated tests deterministically.

Documentation

Usage — endpoints, browser integration, server config
Configuration — all keys, env vars, JSON config
MCP Integration — tool schemas, programmatic API
Fetch — fetchUrl, return shape, conditional fetch
Content dispatcher — parseContent, format support, ParsedContent
PDF — born-digital text extractor, scope, diagnostics
Crawl — BFS, robots, content-type filters, render lifecycle
Robots — checkRobots, parsing rules, permissive-on-error
Extract — schema, x-extract hints, Ollama backing
Render — render options, actions, screenshots, MCP boundaries
Browser setup — detection, override paths, troubleshooting
Parsing engine — parseWebsite, selectors, the tree parser
Search engines — built-in engines, parsers
Infobox handlers — calculator, units, currency, dictionary, Wikipedia
Template engine — {{ var }}, loops, conditionals, includes
Roadmap — planned features, version targets
Changelog — release history

License

MIT — see LICENSE.md.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme