@vinxi/scraper
v0.1.1
Published
Agent-native scraping engine: defineScraper() runtime with stealth Playwright, checkpoints, task queues, and reconnaissance tools (hydration payload walker, XHR capture, visual audit, validation scorecards). Built to be driven by the site-reconnaissance a
Maintainers
Readme
@vinxi/scraper
Agent-native scraping engine. An AI agent (or you) writes a small per-site scraper with defineScraper(); the runtime provides stealth browsing, HTTP transport, checkpointing, task queues, rate-limit-aware pacing, and streaming JSONL output. A companion set of reconnaissance tools helps the agent understand a site first — hydration payload walking, XHR capture, visual audits, and validation scorecards.
Designed to be driven by the site-reconnaissance agent skill:
npx skills add nksaraf/skills --skill site-reconnaissanceInstall
Runs on Bun.
bun add @vinxi/scraper
bun add -d playwright # only needed for browser-tier scrapers
bunx playwright install chromiumWrite a scraper
Scrapers live in scrapers/<id>.ts in your project:
// scrapers/example-quotes.ts
import { defineScraper } from "@vinxi/scraper"
export default defineScraper({
id: "example-quotes",
source: "quotes.toscrape.com",
async run(ctx) {
const res = await ctx.fetch("https://quotes.toscrape.com/")
const html = await res.text()
// ...extract with cheerio, or use ctx.browser for JS-rendered sites
ctx.pushData({ quote: "...", author: "..." })
},
})Run it from your project root:
bunx vinxi-scraper run example-quotes --city=delhi # --key=value args reach ctx.args
bunx vinxi-scraper run example-quotes --rpm=60 --min-delay=500 --concurrency=8
bunx vinxi-scraper list
bunx vinxi-scraper export example-quotes --format json # or csv | jsonlEach run streams to data/results/<id>/<timestamp>.jsonl, and data/results/<id>/latest.jsonl points at the most recent successful run. Checkpoints and task queues persist under data/ so interrupted runs resume. Set SCRAPER_HOME to point the CLI at a different project root.
Pacing defaults to a polite 30 requests/min with a 2s floor per domain — tune it per run with --rpm, --min-delay (ms), and --concurrency. CSV export JSON-encodes nested columns (e.g. a variants array), so commerce rows round-trip cleanly.
The scraper context
run(ctx) receives:
| | |
|---|---|
| ctx.fetch | HTTP client with realistic headers, proxy-rotation-ready |
| ctx.browser | Lazy stealth Playwright facade — newPage() survives Akamai / Cloudflare / DataDome checks |
| ctx.pushData | Stream a result row out (persisted immediately) |
| ctx.checkpoint | Key-value store persisted to disk |
| ctx.tasks | Deduplicating, persistent task queue |
| ctx.human | Human-like pauses, mouse movement, warmup navigation, smooth scrolling |
| ctx.log / ctx.sleep / ctx.args | Structured logging, pacing, CLI args (--key=value) |
Reconnaissance tools
import {
extractAllHydrationPayloads, walkSchema, filterLeaves, // hydration payloads (__NEXT_DATA__, JSON-LD, window.*)
startXhrCapture, capturedToPayloads, // intercept JSON/XHR traffic before navigation
auditPage, findMissingFields, // visual ground truth: screenshots + label/value pairs
scoreCard, formatScoreCard, // completeness / quality / authenticity / freshness scoring
} from "@vinxi/scraper/recon"These are deliberately noisy scaffolding: they help an agent see a page fast, then the agent writes a small per-site extractor. The methodology — six recon phases with quality gates, the recon–extract loop, validation scorecards — is documented in the site-reconnaissance skill.
Canonical retail schema
@vinxi/scraper/retail exports RetailStore — a cross-brand store/dealer/branch/ATM location schema — plus helpers: makeRetailStore() (build a record from a partial, filling safe defaults), extraOf(), normaliseHours(), normaliseCountry().
Validation scorecard
scoreCard() scores a cohort of store-locator / geo records (0–100 across completeness, data quality, source authenticity, freshness). It expects RetailStore shapes — build them with makeRetailStore(). It defaults to India geography; pass geoBounds / expectedMetros for other regions.
import { scoreCard, formatScoreCard } from "@vinxi/scraper/recon"
import { makeRetailStore } from "@vinxi/scraper/retail"
const stores = rows.map((r) => makeRetailStore({ brand: "Acme", name: r.name, lat: r.lat, lng: r.lng, address: r.address }))
const card = scoreCard(stores, {
brand: "Acme",
sourceDomain: "acme.com",
officialDomain: "acme.com",
authoritativeCount: 366,
geoBounds: { minLat: 24, maxLat: 50, minLng: -125, maxLng: -66 }, // US (omit for India default)
expectedMetros: ["New York", "Los Angeles", "Chicago"],
})
console.log(formatScoreCard(card)) // → band: high/medium/low + evidenceThe scorecard is for physical-location data. A product catalog (SKUs, prices, variants) has no geographic scorecard — validate those by sampling fields and reconciling counts instead.
License
MIT
