npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

feedstock

v0.5.0

Published

High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright

Readme

Features

  • Single & multi-page crawling with concurrent execution
  • Deep crawling — BFS, DFS, BestFirst, Bandit (UCB1 online learning), and Focused (RL/Q-learning) strategies
  • Content extraction — CSS selectors, regex, XPath, table, accessibility tree, and composite multi-strategy extraction
  • Markdown generation with citation support
  • Smart caching with ETag/Last-Modified validation via bun:sqlite and multi-signal freshness evaluation (sitemap, HTTP headers, content hash, time decay)
  • URL filtering — pattern, domain, and content-type filters
  • URL scoring — keyword relevance, path depth, freshness, domain authority, neural quality estimation (online-learning), and UCB1 bandit scoring
  • Rate limiting — per-domain with exponential backoff
  • Robots.txt parsing and compliance
  • Built-in stealth mode — one flag enables random user-agents, navigator.webdriver override, plugin/language spoofing, human-like mouse/scroll simulation. Enhanced mode generates spatially-consistent fingerprint profiles (UA, platform, WebGL, screen, canvas all agree)
  • Anti-bot detection with auto-retry on blocked pages
  • Multiple browser backends — Playwright (Chromium/Firefox/WebKit), generic CDP (Browserbase, Browserless, etc.), or Lightpanda (local/cloud)
  • Proxy rotation — round-robin strategy with health tracking
  • URL seeding — discover URLs from sitemaps
  • Accessibility snapshots — compact semantic page representation with @e refs for AI consumption
  • Fetch-first engine system — tries lightweight HTTP before launching browser, auto-escalates for SPAs
  • Rich metadata — 50+ fields: Open Graph, Twitter Cards, Dublin Core, JSON-LD, favicons, feeds
  • Content processing — chunking (regex, sliding window, fixed-size) and filtering (pruning, BM25)
  • Interactive element detection — finds all clickable elements including cursor:pointer and onclick handlers
  • Storage state persistence — save/load cookies and localStorage between sessions
  • AI-friendly errors — converts 20+ error patterns into actionable messages
  • Hooks — inject custom behavior at 5 lifecycle points (page created, before/after navigation, etc.)
  • Resource blocking — named profiles (fast, minimal, media-only) or custom patterns for faster crawls
  • Navigation strategies — configurable waitUntil: commit (fastest), domcontentloaded, load, networkidle. Hydration-aware readiness detection auto-detects SPA frameworks and waits for content stability instead of fixed timeouts
  • DOM downsampling — preprocess HTML to reduce DOM size before extraction (strips boilerplate, collapses containers, filters attributes). 30-85% size reduction depending on page complexity
  • In-page extraction — extract links/media/metadata directly in the browser via page.evaluate(), skipping HTML serialization
  • Change tracking — detect new/changed/unchanged/removed pages between crawl runs with text diffs
  • User-agent rotation — pool of 9 realistic browser user-agents with round-robin rotation
  • Graceful shutdown — SIGINT/SIGTERM handlers auto-close browser processes
  • Session management — LRU eviction at 20 concurrent sessions, cache TTL pruning
  • Input validation — friendly error messages for invalid URLs, automatic retry on transient network errors
  • Layered configfeedstock.json project file + FEEDSTOCK_* environment variables with programmatic overrides
  • Incremental crawling — content hashing in cache detects unchanged pages via cache.hasChanged()
  • Benchmarking — scenario-based benchmark suite with warmup, p50/stddev stats, and JSON output
  • Crawler monitoring — real-time stats tracking (pages/sec, success rates, data volume) with live Bun.serve dashboard
  • Configurable logging — pluggable Logger interface with ConsoleLogger and SilentLogger
  • Agent-first CLI — JSON output by default, runtime schema introspection (feedstock schema), --fields for context window discipline, --dry-run, structured errors

CLI

Agent-first CLI with JSON output by default when piped, runtime schema introspection, and structured errors.

# Install globally
bun add -g feedstock
bunx playwright install chromium

# Single page
feedstock crawl https://example.com
feedstock crawl https://example.com --fields url,markdown --output json

# Batch crawl
echo "https://a.com\nhttps://b.com" | feedstock crawl-many --stdin --concurrency 10

# Deep crawl a docs site
feedstock deep-crawl https://docs.example.com --max-depth 3 --max-pages 50 --domain-filter docs.example.com

# Process raw HTML
echo '<h1>Hello</h1>' | feedstock process --fields markdown

# Agent introspection — discover all parameters
feedstock schema crawl

# Cache management
feedstock cache stats
feedstock cache prune --older-than 86400000

All commands support --output json|ndjson|text, --fields for context window discipline, and --json for raw config passthrough. See feedstock --help or feedstock schema <command> for full details.

Quick Start

bun add feedstock
bunx playwright install chromium
import { WebCrawler, CacheMode } from "feedstock";

const crawler = new WebCrawler();

const result = await crawler.crawl("https://example.com", {
  cacheMode: CacheMode.Bypass,
});

console.log(result.markdown?.rawMarkdown);
console.log(result.links.internal);
console.log(result.media.images);

await crawler.close();

Deep Crawling

Recursively crawl entire sites with filters, rate limiting, and robots.txt compliance:

import {
  WebCrawler, CacheMode,
  FilterChain, DomainFilter, ContentTypeFilter, URLPatternFilter,
  RateLimiter, RobotsParser,
  CompositeScorer, KeywordRelevanceScorer, PathDepthScorer,
} from "feedstock";

const crawler = new WebCrawler();

const results = await crawler.deepCrawl(
  "https://example.com",
  { cacheMode: CacheMode.Bypass },
  {
    maxDepth: 3,
    maxPages: 100,
    concurrency: 5,
    filterChain: new FilterChain()
      .add(new DomainFilter({ allowed: ["example.com"] }))
      .add(new ContentTypeFilter())
      .add(new URLPatternFilter({ exclude: [/\/admin/, /\/login/] })),
    rateLimiter: new RateLimiter({ baseDelay: 500 }),
    robotsParser: new RobotsParser(),
    scorer: new CompositeScorer()
      .add(new KeywordRelevanceScorer(["docs", "api"], 2.0))
      .add(new PathDepthScorer()),
  },
);

Stream results as they arrive for large crawls:

for await (const result of crawler.deepCrawlStream(startUrl, {}, config)) {
  console.log(`Crawled: ${result.url}`);
}

Structured Extraction

CSS Selectors

const result = await crawler.crawl("https://example.com/products", {
  extractionStrategy: {
    type: "css",
    params: {
      name: "products",
      baseSelector: ".product",
      fields: [
        { name: "title", selector: "h2", type: "text" },
        { name: "price", selector: ".price", type: "text" },
        { name: "image", selector: "img", type: "attribute", attribute: "src" },
        { name: "tags", selector: ".tag", type: "list" },
      ],
    },
  },
});

const products = JSON.parse(result.extractedContent!)
  .map((item) => JSON.parse(item.content));

Tables

import { TableExtractionStrategy } from "feedstock";

const strategy = new TableExtractionStrategy({ minRows: 2 });
const tables = await strategy.extract(url, html);
// [{ headers: ["Name", "Age"], rows: [["Alice", "30"], ...], caption: "Users" }]

Regex

import { RegexExtractionStrategy } from "feedstock";

const strategy = new RegexExtractionStrategy([
  /\$(?<dollars>\d+)\.(?<cents>\d{2})/g,
]);
const prices = await strategy.extract(url, html);
// prices[0].metadata.groups = { dollars: "9", cents: "99" }

Browser Backends

Playwright (Default)

const crawler = new WebCrawler({
  config: {
    browserType: "chromium", // or "firefox", "webkit"
    headless: true,
  },
});

Generic CDP (any cloud provider)

const crawler = new WebCrawler({
  config: {
    backend: { kind: "cdp", wsUrl: "wss://cloud.browserbase.com/v1/sessions/..." },
  },
});

Lightpanda

// Local (requires: bun add @lightpanda/browser)
const crawler = new WebCrawler({
  config: {
    backend: { kind: "lightpanda", mode: "local" },
  },
});

// Cloud
const crawler = new WebCrawler({
  config: {
    backend: {
      kind: "lightpanda",
      mode: "cloud",
      token: process.env.LIGHTPANDA_TOKEN!,
    },
  },
});

Stealth Mode

One flag — random user-agents, navigator overrides, and human simulation:

// Enable stealth at browser level
const crawler = new WebCrawler({
  config: { stealth: true },
});

// Human simulation per-crawl
const result = await crawler.crawl(url, {
  simulateUser: true, // random mouse movements + scrolling
});
// Auto-retry on blocks
import { withRetry, isBlocked } from "feedstock";

const { result, retries } = await withRetry(
  () => crawler.crawl(url),
  (r) => isBlocked(r.html, r.statusCode ?? 200),
  { maxRetries: 3, retryDelay: 2000 },
);

Proxy Rotation

import { ProxyRotationStrategy } from "feedstock";

const rotation = new ProxyRotationStrategy([
  { server: "http://proxy1:8080" },
  { server: "http://proxy2:8080" },
  { server: "http://proxy3:8080" },
]);

const proxy = rotation.getProxy(); // round-robin, skips unhealthy
rotation.reportResult(proxy, true); // track health

Content Processing

Chunking

import { SlidingWindowChunking, RegexChunking } from "feedstock";

// Split by paragraphs
new RegexChunking().chunk(text);

// Sliding window with overlap
new SlidingWindowChunking(500, 50).chunk(text);

Content Filtering

import { PruningContentFilter, BM25ContentFilter } from "feedstock";

// Remove boilerplate
new PruningContentFilter({ minWords: 5 }).filter(content);

// Keep only relevant blocks
new BM25ContentFilter({ threshold: 0.1 }).filter(content, "TypeScript crawler");

Resource Blocking & Fast Navigation

// Named profile — block images, fonts, and media (keeps CSS/JS)
await crawler.crawl(url, { blockResources: "fast" });

// Block everything except HTML and JS
await crawler.crawl(url, { blockResources: "minimal" });

// Block only heavy media (images, video, audio)
await crawler.crawl(url, { blockResources: "media-only" });

// Custom — block specific patterns and resource types
await crawler.crawl(url, {
  blockResources: { patterns: ["**/*.woff2"], resourceTypes: ["font"] },
});

// Boolean still works (true = "fast" profile)
await crawler.crawl(url, {
  blockResources: true,
  navigationWaitUntil: "commit", // fastest — returns as soon as server responds
});

In-Page Extraction

Extract data directly in the browser — skips HTML serialization round-trip:

import { extractInPage } from "feedstock";

crawler.setHook("beforeReturnHtml", async (page) => {
  const data = await extractInPage(page);
  // data.links, data.media, data.metadata — extracted inside browser context
});

URL Discovery

import { URLSeeder } from "feedstock";

const seeder = new URLSeeder();
const { urls, sitemaps } = await seeder.seed("example.com");
// Discovers URLs from robots.txt -> sitemap.xml chain

Monitoring

import { CrawlerMonitor } from "feedstock";

const monitor = new CrawlerMonitor();
monitor.start();

// Track each page
monitor.recordPageComplete({
  success: true,
  fromCache: false,
  responseTimeMs: 150,
  bytesDownloaded: 45_000,
});

console.log(monitor.formatStats());
// Pages: 1 (1 ok, 0 failed, 0 cached)
// Time: 0.2s | 6.7 pages/s | avg 150ms/page
// Downloaded: 0.04 MB

Accessibility Snapshots

Compact semantic page representation — 3-10x smaller than HTML:

const result = await crawler.crawl("https://example.com", {
  snapshot: true,
});

console.log(result.snapshot);
// @e1 [heading] "Example Domain" [level=1]
// @e2 [link] "More information..." [-> https://www.iana.org/domains/example]

Hooks

Inject custom behavior at key lifecycle points:

crawler.setHook("afterGoto", async (page) => {
  // Dismiss cookie banners, expand sections, etc.
  const banner = page.locator('[class*="cookie"]');
  if (await banner.isVisible()) await banner.locator("button").first().click();
});

Available hooks: onPageCreated, beforeGoto, afterGoto, onExecutionStarted, beforeReturnHtml.

Interactive Element Detection

Find all clickable elements including those without ARIA roles:

import { detectInteractiveElements } from "feedstock";

crawler.setHook("beforeReturnHtml", async (page) => {
  const elements = await detectInteractiveElements(page);
  console.log(`Found ${elements.length} interactive elements`);
  // Each has: tag, text, href, role, type, selector
});

Storage State

Persist cookies/localStorage between sessions:

import { saveStorageState, loadStorageState } from "feedstock";

// Save after login
await saveStorageState(page.context());

// Load in next session
const state = loadStorageState();

Layered Configuration

Config is loaded from multiple sources with clear precedence: programmatic > env vars > project file > defaults.

import { loadConfig, createBrowserConfig, createCrawlerRunConfig } from "feedstock";

// Automatically finds feedstock.json in cwd or parent directories
const layered = loadConfig();
const browserConfig = createBrowserConfig({ ...layered.browser, headless: false });
const crawlConfig = createCrawlerRunConfig({ ...layered.crawl });

feedstock.json:

{
  "browser": { "headless": true, "stealth": true },
  "crawl": { "blockResources": "fast", "pageTimeout": 30000 }
}

Environment variables: FEEDSTOCK_CDP_URL, FEEDSTOCK_HEADLESS, FEEDSTOCK_PROXY, FEEDSTOCK_BLOCK_RESOURCES, FEEDSTOCK_PAGE_TIMEOUT, FEEDSTOCK_SCREENSHOT, FEEDSTOCK_GENERATE_MARKDOWN, etc.

Accessibility Extraction

Extract semantic content using the accessibility tree — headings, links, buttons, inputs:

const result = await crawler.crawl("https://example.com", {
  extractionStrategy: {
    type: "accessibility",
    params: { roles: ["heading", "link"] }, // optional filter
  },
});

const items = JSON.parse(result.extractedContent!);
// [{ content: "Page Title", metadata: { role: "heading", ref: "e1", level: 1 } }, ...]

Process HTML Without Browser

const html = "<html><body><h1>Hello</h1><p>World</p></body></html>";
const result = await crawler.processHtml(html, { snapshot: true });
console.log(result.markdown?.rawMarkdown);
console.log(result.snapshot);

Documentation

Full documentation at feedstockai.com.

Development

bun install
bun test              # 325 tests
bun test tests/unit   # unit tests only
bun run typecheck     # tsc --noEmit
bun run lint          # biome check
bun run check         # lint + typecheck
bun run dogfood.ts    # 148 checks against real sites

License

Apache-2.0