npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@fredrikpaulin/webscan

v0.3.1

Published

Bun-native metasearch + content-aware fetch/crawl/extract. Static-HTML parsing engine, born-digital PDF text extractor, multi-format content dispatcher (HTML / PDF / JSON / XML / feeds / CSV / text / Markdown / binary), optional headless rendering over Ch

Downloads

320

Readme

WEBSCAN

A Bun-native toolkit for getting useful text out of the web. Fetches a URL and returns readable text, metadata, and outbound links — for HTML, PDF, JSON, RSS/Atom, CSV, Markdown, and plain text. Extracts structured data against a JSON Schema. Crawls a small surface with robots.txt support. Renders JavaScript-heavy pages through any already-installed Chromium-family browser over the Chrome DevTools Protocol. Aggregates web search and serves infoboxes when you want a search UI too.

Zero runtime dependencies. No Puppeteer, no Playwright, no PDF library, no browser binary shipped.

Runs in three modes: standalone server with a browser UI, MCP tool module for tiny-mcp-server, or library for direct use.

Install

bun add @fredrikpaulin/webscan          # as a library
bun install -g @fredrikpaulin/webscan   # as the `webscan` standalone server
import { fetchUrl, extract, crawl } from '@fredrikpaulin/webscan'
import { parsePdf } from '@fredrikpaulin/webscan/content/pdf'
import webscan from '@fredrikpaulin/webscan/mcp'

Quick start

Fetch any URL — HTML, PDF, JSON, feed, CSV, text — and get back the same shape:

import { fetchUrl } from '@fredrikpaulin/webscan'

const article = await fetchUrl('https://example.com/article')
article.kind        // 'html'
article.title       // 'Fixture Article'
article.text        // 'Headline\nBody paragraph one. …'
article.links       // [{ href, text, rel, internal, … }, …]

const report = await fetchUrl('https://example.com/q4-report.pdf')
report.kind         // 'pdf'
report.title        // 'Quarterly report'  (from /Info)
report.text         // 'Page 1 text…\n\n--- Page 2 ---\n\n…'
report.metadata     // { pageCount: 12, author, encrypted: false, … }

const feed = await fetchUrl('https://example.com/feed.rss')
feed.kind           // 'feed'
feed.parts          // [{ type:'feed-item', title, url, publishedAt, … }, …]

For schema-driven extraction:

import { extract } from '@fredrikpaulin/webscan'

const { data } = await extract(article, {
  type: 'object',
  properties: {
    title: { type: 'string', 'x-extract': { selector: 'h1' } },
    price: { type: 'number', 'x-extract': { selector: '.price', transform: 'number' } },
    sku:   { type: 'string', 'x-extract': { regex: { pattern: 'SKU:\\s*([A-Z0-9-]+)', group: 1 } } },
    author: { type: 'string' }   // un-hinted → optional local LLM fallback
  }
})

For JavaScript-rendered pages:

const spa = await fetchUrl('https://app.example.com/', {
  render: { waitForSelector: '.product-grid', actions: [
    { type: 'click', selector: '.load-more' },
    { type: 'waitForSelector', selector: '.page-2' }
  ] }
})
spa.rendered    // true — DOM is the post-interaction state

For a bounded crawl:

import { crawl } from '@fredrikpaulin/webscan'

for await (const r of crawl({
  seeds: ['https://example.com/'],
  depth: 2,
  maxPages: 50,
  respectRobots: true
})) {
  if (r.error) console.warn(r.url, r.error.code)
  else console.log(r.url, r.page.title)
}

For the standalone search server:

bun start                # http://127.0.0.1:5000
bun run dev              # with hot reload

The package root (src/index.js) is side-effect-free. Importing it starts no server.

What it does

Content-aware fetch. fetchUrl(url) reads bytes once and dispatches through a content layer: HTML through a zero-dependency tree parser and metadata extractor, PDF through a born-digital text extractor, JSON, XML, RSS/Atom feeds, CSV/TSV, Markdown, plain text. Every kind returns the same ParsedContent shape so downstream code doesn't branch. See docs/fetch.md, docs/content.md, docs/pdf.md.

Schema-driven extract. extract(input, schema) resolves x-extract hints — CSS selectors, regex against text, metadata path lookups — per field. Un-hinted fields fall through to an optional local Ollama call, with Ollama's format: schema constraint giving you valid JSON without an API key or a third-party round-trip. The LLM seam is provider-agnostic: pass your own (prompt, schema) => object function to use Anthropic, OpenAI, or anything else. See docs/extract.md.

Bounded BFS crawl. crawl(opts) is an async iterable over outbound links. Respects depth, page caps, same-domain rules, include/exclude regexes, concurrency, per-host politeness, and robots.txt. Non-HTML content (PDFs, CSVs, feeds) follows the same parse path. Iterator teardown (for await … of early break) closes everything it owns. See docs/crawl.md, docs/robots.md.

Optional headless rendering. fetchUrl(url, { render: true }) drives Chrome, Chromium, Edge, or Brave — whichever is already installed — over the Chrome DevTools Protocol. Click, type, scroll, screenshot. Real coordinate-based input events through Input.dispatchMouseEvent so SPA event handlers fire as if a real user touched them. URL policy blocks loopback, private networks, and cloud metadata at both the navigation layer and request interception. The MCP render_page tool hard-codes the safety boundaries — no evaluate, no password fills, no inline screenshot bytes. See docs/render.md, docs/browser-setup.md.

Search + infobox UI. The original v0.1 features stay shipped: parallel scrape against DuckDuckGo Lite, Startpage, Yahoo and Bing Images; calculator, unit conversion (44 units), currency conversion (70+ fiat + 30+ crypto), dictionary, Wikipedia summary; SQLite cache; proxy rotation with failure tracking; OpenSearch XML for browser integration. See docs/usage.md, docs/search-engines.md, docs/infobox.md.

MCP tools. fetch_url, extract, crawl, render_page, web_search, image_search, calculate, convert_units, convert_currency, define_word, wikipedia_summary. The crawl tool caps depth at 3, pages at 200, concurrency at 8 — the LLM-friendly bounded version. The render tool hard-codes the §8 safety boundaries: no allowPrivateNetwork in the schema, no evaluate action, no inline screenshot bytes. See docs/mcp-integration.md.

Why this exists

Most web-fetch libraries handle HTML and stop. The interesting research material is mixed: PDFs from regulators, RSS from blogs, CSVs from open data portals, JSON from APIs, occasionally an SPA whose useful content only appears after JavaScript runs. WEBSCAN treats those as the same problem with the same return shape — fetch the bytes, route them to the right parser, return readable text. A crawler that follows HTML links but downloads PDFs lossily (or worse, silently) is a tool you have to work around. One that returns the same ParsedContent for both is one you can build on.

The browser path exists for the cases where a full DOM is required and nothing else works. It uses whatever Chromium-family browser is already on the machine, drives it over CDP from ~140 lines of JSON-over-WebSocket code, and doesn't pull a thousand-package dependency tree to do so. It runs on a developer laptop and on a Raspberry Pi 5 with the same code path.

The LLM path defaults to local Ollama because the contract for "give me JSON matching this schema" finally landed on the local side in late 2025. Cloud LLMs are still better at messy extraction; they're no longer the only way to get structured output. The library defaults to the option that doesn't require an API key, and the seam is open for callers who want to use Anthropic, OpenAI, or anything else.

Configuration

WEBSCAN_PORT=8080 WEBSCAN_HOST=0.0.0.0 bun start
WEBSCAN_FETCH_HTTP_TIMEOUT=10000 \
WEBSCAN_CRAWL_CONCURRENCY=4 \
WEBSCAN_EXTRACT_LLM_ENABLED=true \
WEBSCAN_EXTRACT_LLM_MODEL=llama3.2 \
  bun start

Or a webscan.config.json. The schema covers 31 keys across three eras (v0.1 search/server, v0.2 fetch/crawl/extract, v0.3 content/browser). See docs/configuration.md.

MCP integration

import webscan from '@fredrikpaulin/webscan/mcp'

const module = webscan({
  cacheDir: ':memory:',
  extractLlmEnabled: true,
  extractLlmModel: 'llama3.2'
})

The MCP module registers eleven tools, with safety boundaries baked into the schemas where it matters (no evaluate in render_page, no render in crawl, no inline binary in either). See docs/mcp-integration.md.

Tests

bun test tests/

427 tests across 59 files. Real-browser integration tests are gated behind WEBSCAN_TEST_BROWSER=1 so CI without a browser stays green:

WEBSCAN_TEST_BROWSER=1 bun test tests/browser-render.test.js

A small SPA fixture in tests/fixtures/spa/ drives the gated tests deterministically.

Documentation

  • Usage — endpoints, browser integration, server config
  • Configuration — all keys, env vars, JSON config
  • MCP Integration — tool schemas, programmatic API
  • FetchfetchUrl, return shape, conditional fetch
  • Content dispatcherparseContent, format support, ParsedContent
  • PDF — born-digital text extractor, scope, diagnostics
  • Crawl — BFS, robots, content-type filters, render lifecycle
  • RobotscheckRobots, parsing rules, permissive-on-error
  • Extract — schema, x-extract hints, Ollama backing
  • Renderrender options, actions, screenshots, MCP boundaries
  • Browser setup — detection, override paths, troubleshooting
  • Parsing engineparseWebsite, selectors, the tree parser
  • Search engines — built-in engines, parsers
  • Infobox handlers — calculator, units, currency, dictionary, Wikipedia
  • Template engine{{ var }}, loops, conditionals, includes
  • Roadmap — planned features, version targets
  • Changelog — release history

License

MIT — see LICENSE.md.