feedstock
v0.5.0
Published
High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright
Maintainers
Readme
Features
- Single & multi-page crawling with concurrent execution
- Deep crawling — BFS, DFS, BestFirst, Bandit (UCB1 online learning), and Focused (RL/Q-learning) strategies
- Content extraction — CSS selectors, regex, XPath, table, accessibility tree, and composite multi-strategy extraction
- Markdown generation with citation support
- Smart caching with ETag/Last-Modified validation via
bun:sqliteand multi-signal freshness evaluation (sitemap, HTTP headers, content hash, time decay) - URL filtering — pattern, domain, and content-type filters
- URL scoring — keyword relevance, path depth, freshness, domain authority, neural quality estimation (online-learning), and UCB1 bandit scoring
- Rate limiting — per-domain with exponential backoff
- Robots.txt parsing and compliance
- Built-in stealth mode — one flag enables random user-agents, navigator.webdriver override, plugin/language spoofing, human-like mouse/scroll simulation. Enhanced mode generates spatially-consistent fingerprint profiles (UA, platform, WebGL, screen, canvas all agree)
- Anti-bot detection with auto-retry on blocked pages
- Multiple browser backends — Playwright (Chromium/Firefox/WebKit), generic CDP (Browserbase, Browserless, etc.), or Lightpanda (local/cloud)
- Proxy rotation — round-robin strategy with health tracking
- URL seeding — discover URLs from sitemaps
- Accessibility snapshots — compact semantic page representation with
@erefs for AI consumption - Fetch-first engine system — tries lightweight HTTP before launching browser, auto-escalates for SPAs
- Rich metadata — 50+ fields: Open Graph, Twitter Cards, Dublin Core, JSON-LD, favicons, feeds
- Content processing — chunking (regex, sliding window, fixed-size) and filtering (pruning, BM25)
- Interactive element detection — finds all clickable elements including cursor:pointer and onclick handlers
- Storage state persistence — save/load cookies and localStorage between sessions
- AI-friendly errors — converts 20+ error patterns into actionable messages
- Hooks — inject custom behavior at 5 lifecycle points (page created, before/after navigation, etc.)
- Resource blocking — named profiles (
fast,minimal,media-only) or custom patterns for faster crawls - Navigation strategies — configurable
waitUntil:commit(fastest),domcontentloaded,load,networkidle. Hydration-aware readiness detection auto-detects SPA frameworks and waits for content stability instead of fixed timeouts - DOM downsampling — preprocess HTML to reduce DOM size before extraction (strips boilerplate, collapses containers, filters attributes). 30-85% size reduction depending on page complexity
- In-page extraction — extract links/media/metadata directly in the browser via
page.evaluate(), skipping HTML serialization - Change tracking — detect new/changed/unchanged/removed pages between crawl runs with text diffs
- User-agent rotation — pool of 9 realistic browser user-agents with round-robin rotation
- Graceful shutdown — SIGINT/SIGTERM handlers auto-close browser processes
- Session management — LRU eviction at 20 concurrent sessions, cache TTL pruning
- Input validation — friendly error messages for invalid URLs, automatic retry on transient network errors
- Layered config —
feedstock.jsonproject file +FEEDSTOCK_*environment variables with programmatic overrides - Incremental crawling — content hashing in cache detects unchanged pages via
cache.hasChanged() - Benchmarking — scenario-based benchmark suite with warmup, p50/stddev stats, and JSON output
- Crawler monitoring — real-time stats tracking (pages/sec, success rates, data volume) with live Bun.serve dashboard
- Configurable logging — pluggable Logger interface with ConsoleLogger and SilentLogger
- Agent-first CLI — JSON output by default, runtime schema introspection (
feedstock schema),--fieldsfor context window discipline,--dry-run, structured errors
CLI
Agent-first CLI with JSON output by default when piped, runtime schema introspection, and structured errors.
# Install globally
bun add -g feedstock
bunx playwright install chromium
# Single page
feedstock crawl https://example.com
feedstock crawl https://example.com --fields url,markdown --output json
# Batch crawl
echo "https://a.com\nhttps://b.com" | feedstock crawl-many --stdin --concurrency 10
# Deep crawl a docs site
feedstock deep-crawl https://docs.example.com --max-depth 3 --max-pages 50 --domain-filter docs.example.com
# Process raw HTML
echo '<h1>Hello</h1>' | feedstock process --fields markdown
# Agent introspection — discover all parameters
feedstock schema crawl
# Cache management
feedstock cache stats
feedstock cache prune --older-than 86400000All commands support --output json|ndjson|text, --fields for context window discipline, and --json for raw config passthrough. See feedstock --help or feedstock schema <command> for full details.
Quick Start
bun add feedstock
bunx playwright install chromiumimport { WebCrawler, CacheMode } from "feedstock";
const crawler = new WebCrawler();
const result = await crawler.crawl("https://example.com", {
cacheMode: CacheMode.Bypass,
});
console.log(result.markdown?.rawMarkdown);
console.log(result.links.internal);
console.log(result.media.images);
await crawler.close();Deep Crawling
Recursively crawl entire sites with filters, rate limiting, and robots.txt compliance:
import {
WebCrawler, CacheMode,
FilterChain, DomainFilter, ContentTypeFilter, URLPatternFilter,
RateLimiter, RobotsParser,
CompositeScorer, KeywordRelevanceScorer, PathDepthScorer,
} from "feedstock";
const crawler = new WebCrawler();
const results = await crawler.deepCrawl(
"https://example.com",
{ cacheMode: CacheMode.Bypass },
{
maxDepth: 3,
maxPages: 100,
concurrency: 5,
filterChain: new FilterChain()
.add(new DomainFilter({ allowed: ["example.com"] }))
.add(new ContentTypeFilter())
.add(new URLPatternFilter({ exclude: [/\/admin/, /\/login/] })),
rateLimiter: new RateLimiter({ baseDelay: 500 }),
robotsParser: new RobotsParser(),
scorer: new CompositeScorer()
.add(new KeywordRelevanceScorer(["docs", "api"], 2.0))
.add(new PathDepthScorer()),
},
);Stream results as they arrive for large crawls:
for await (const result of crawler.deepCrawlStream(startUrl, {}, config)) {
console.log(`Crawled: ${result.url}`);
}Structured Extraction
CSS Selectors
const result = await crawler.crawl("https://example.com/products", {
extractionStrategy: {
type: "css",
params: {
name: "products",
baseSelector: ".product",
fields: [
{ name: "title", selector: "h2", type: "text" },
{ name: "price", selector: ".price", type: "text" },
{ name: "image", selector: "img", type: "attribute", attribute: "src" },
{ name: "tags", selector: ".tag", type: "list" },
],
},
},
});
const products = JSON.parse(result.extractedContent!)
.map((item) => JSON.parse(item.content));Tables
import { TableExtractionStrategy } from "feedstock";
const strategy = new TableExtractionStrategy({ minRows: 2 });
const tables = await strategy.extract(url, html);
// [{ headers: ["Name", "Age"], rows: [["Alice", "30"], ...], caption: "Users" }]Regex
import { RegexExtractionStrategy } from "feedstock";
const strategy = new RegexExtractionStrategy([
/\$(?<dollars>\d+)\.(?<cents>\d{2})/g,
]);
const prices = await strategy.extract(url, html);
// prices[0].metadata.groups = { dollars: "9", cents: "99" }Browser Backends
Playwright (Default)
const crawler = new WebCrawler({
config: {
browserType: "chromium", // or "firefox", "webkit"
headless: true,
},
});Generic CDP (any cloud provider)
const crawler = new WebCrawler({
config: {
backend: { kind: "cdp", wsUrl: "wss://cloud.browserbase.com/v1/sessions/..." },
},
});Lightpanda
// Local (requires: bun add @lightpanda/browser)
const crawler = new WebCrawler({
config: {
backend: { kind: "lightpanda", mode: "local" },
},
});
// Cloud
const crawler = new WebCrawler({
config: {
backend: {
kind: "lightpanda",
mode: "cloud",
token: process.env.LIGHTPANDA_TOKEN!,
},
},
});Stealth Mode
One flag — random user-agents, navigator overrides, and human simulation:
// Enable stealth at browser level
const crawler = new WebCrawler({
config: { stealth: true },
});
// Human simulation per-crawl
const result = await crawler.crawl(url, {
simulateUser: true, // random mouse movements + scrolling
});// Auto-retry on blocks
import { withRetry, isBlocked } from "feedstock";
const { result, retries } = await withRetry(
() => crawler.crawl(url),
(r) => isBlocked(r.html, r.statusCode ?? 200),
{ maxRetries: 3, retryDelay: 2000 },
);Proxy Rotation
import { ProxyRotationStrategy } from "feedstock";
const rotation = new ProxyRotationStrategy([
{ server: "http://proxy1:8080" },
{ server: "http://proxy2:8080" },
{ server: "http://proxy3:8080" },
]);
const proxy = rotation.getProxy(); // round-robin, skips unhealthy
rotation.reportResult(proxy, true); // track healthContent Processing
Chunking
import { SlidingWindowChunking, RegexChunking } from "feedstock";
// Split by paragraphs
new RegexChunking().chunk(text);
// Sliding window with overlap
new SlidingWindowChunking(500, 50).chunk(text);Content Filtering
import { PruningContentFilter, BM25ContentFilter } from "feedstock";
// Remove boilerplate
new PruningContentFilter({ minWords: 5 }).filter(content);
// Keep only relevant blocks
new BM25ContentFilter({ threshold: 0.1 }).filter(content, "TypeScript crawler");Resource Blocking & Fast Navigation
// Named profile — block images, fonts, and media (keeps CSS/JS)
await crawler.crawl(url, { blockResources: "fast" });
// Block everything except HTML and JS
await crawler.crawl(url, { blockResources: "minimal" });
// Block only heavy media (images, video, audio)
await crawler.crawl(url, { blockResources: "media-only" });
// Custom — block specific patterns and resource types
await crawler.crawl(url, {
blockResources: { patterns: ["**/*.woff2"], resourceTypes: ["font"] },
});
// Boolean still works (true = "fast" profile)
await crawler.crawl(url, {
blockResources: true,
navigationWaitUntil: "commit", // fastest — returns as soon as server responds
});In-Page Extraction
Extract data directly in the browser — skips HTML serialization round-trip:
import { extractInPage } from "feedstock";
crawler.setHook("beforeReturnHtml", async (page) => {
const data = await extractInPage(page);
// data.links, data.media, data.metadata — extracted inside browser context
});URL Discovery
import { URLSeeder } from "feedstock";
const seeder = new URLSeeder();
const { urls, sitemaps } = await seeder.seed("example.com");
// Discovers URLs from robots.txt -> sitemap.xml chainMonitoring
import { CrawlerMonitor } from "feedstock";
const monitor = new CrawlerMonitor();
monitor.start();
// Track each page
monitor.recordPageComplete({
success: true,
fromCache: false,
responseTimeMs: 150,
bytesDownloaded: 45_000,
});
console.log(monitor.formatStats());
// Pages: 1 (1 ok, 0 failed, 0 cached)
// Time: 0.2s | 6.7 pages/s | avg 150ms/page
// Downloaded: 0.04 MBAccessibility Snapshots
Compact semantic page representation — 3-10x smaller than HTML:
const result = await crawler.crawl("https://example.com", {
snapshot: true,
});
console.log(result.snapshot);
// @e1 [heading] "Example Domain" [level=1]
// @e2 [link] "More information..." [-> https://www.iana.org/domains/example]Hooks
Inject custom behavior at key lifecycle points:
crawler.setHook("afterGoto", async (page) => {
// Dismiss cookie banners, expand sections, etc.
const banner = page.locator('[class*="cookie"]');
if (await banner.isVisible()) await banner.locator("button").first().click();
});Available hooks: onPageCreated, beforeGoto, afterGoto, onExecutionStarted, beforeReturnHtml.
Interactive Element Detection
Find all clickable elements including those without ARIA roles:
import { detectInteractiveElements } from "feedstock";
crawler.setHook("beforeReturnHtml", async (page) => {
const elements = await detectInteractiveElements(page);
console.log(`Found ${elements.length} interactive elements`);
// Each has: tag, text, href, role, type, selector
});Storage State
Persist cookies/localStorage between sessions:
import { saveStorageState, loadStorageState } from "feedstock";
// Save after login
await saveStorageState(page.context());
// Load in next session
const state = loadStorageState();Layered Configuration
Config is loaded from multiple sources with clear precedence: programmatic > env vars > project file > defaults.
import { loadConfig, createBrowserConfig, createCrawlerRunConfig } from "feedstock";
// Automatically finds feedstock.json in cwd or parent directories
const layered = loadConfig();
const browserConfig = createBrowserConfig({ ...layered.browser, headless: false });
const crawlConfig = createCrawlerRunConfig({ ...layered.crawl });feedstock.json:
{
"browser": { "headless": true, "stealth": true },
"crawl": { "blockResources": "fast", "pageTimeout": 30000 }
}Environment variables: FEEDSTOCK_CDP_URL, FEEDSTOCK_HEADLESS, FEEDSTOCK_PROXY, FEEDSTOCK_BLOCK_RESOURCES, FEEDSTOCK_PAGE_TIMEOUT, FEEDSTOCK_SCREENSHOT, FEEDSTOCK_GENERATE_MARKDOWN, etc.
Accessibility Extraction
Extract semantic content using the accessibility tree — headings, links, buttons, inputs:
const result = await crawler.crawl("https://example.com", {
extractionStrategy: {
type: "accessibility",
params: { roles: ["heading", "link"] }, // optional filter
},
});
const items = JSON.parse(result.extractedContent!);
// [{ content: "Page Title", metadata: { role: "heading", ref: "e1", level: 1 } }, ...]Process HTML Without Browser
const html = "<html><body><h1>Hello</h1><p>World</p></body></html>";
const result = await crawler.processHtml(html, { snapshot: true });
console.log(result.markdown?.rawMarkdown);
console.log(result.snapshot);Documentation
Full documentation at feedstockai.com.
Development
bun install
bun test # 325 tests
bun test tests/unit # unit tests only
bun run typecheck # tsc --noEmit
bun run lint # biome check
bun run check # lint + typecheck
bun run dogfood.ts # 148 checks against real sitesLicense
Apache-2.0
