@vinxi/scraper

v0.1.1

Published

13 days ago

Agent-native scraping engine: defineScraper() runtime with stealth Playwright, checkpoints, task queues, and reconnaissance tools (hydration payload walker, XHR capture, visual audit, validation scorecards). Built to be driven by the site-reconnaissance a

0High
0Medium
0Low

nksaraf

scraper scraping playwright agent agent-skills reconnaissance bun

@vinxi/scraper

Agent-native scraping engine. An AI agent (or you) writes a small per-site scraper with defineScraper(); the runtime provides stealth browsing, HTTP transport, checkpointing, task queues, rate-limit-aware pacing, and streaming JSONL output. A companion set of reconnaissance tools helps the agent understand a site first — hydration payload walking, XHR capture, visual audits, and validation scorecards.

Designed to be driven by the site-reconnaissance agent skill:

npx skills add nksaraf/skills --skill site-reconnaissance

Install

Runs on Bun.

bun add @vinxi/scraper
bun add -d playwright   # only needed for browser-tier scrapers
bunx playwright install chromium

Write a scraper

Scrapers live in scrapers/<id>.ts in your project:

// scrapers/example-quotes.ts
import { defineScraper } from "@vinxi/scraper"

export default defineScraper({
  id: "example-quotes",
  source: "quotes.toscrape.com",
  async run(ctx) {
    const res = await ctx.fetch("https://quotes.toscrape.com/")
    const html = await res.text()
    // ...extract with cheerio, or use ctx.browser for JS-rendered sites
    ctx.pushData({ quote: "...", author: "..." })
  },
})

Run it from your project root:

bunx vinxi-scraper run example-quotes --city=delhi      # --key=value args reach ctx.args
bunx vinxi-scraper run example-quotes --rpm=60 --min-delay=500 --concurrency=8
bunx vinxi-scraper list
bunx vinxi-scraper export example-quotes --format json   # or csv | jsonl

Each run streams to data/results/<id>/<timestamp>.jsonl, and data/results/<id>/latest.jsonl points at the most recent successful run. Checkpoints and task queues persist under data/ so interrupted runs resume. Set SCRAPER_HOME to point the CLI at a different project root.

Pacing defaults to a polite 30 requests/min with a 2s floor per domain — tune it per run with --rpm, --min-delay (ms), and --concurrency. CSV export JSON-encodes nested columns (e.g. a variants array), so commerce rows round-trip cleanly.

The scraper context

run(ctx) receives:

| | | |---|---| | ctx.fetch | HTTP client with realistic headers, proxy-rotation-ready | | ctx.browser | Lazy stealth Playwright facade — newPage() survives Akamai / Cloudflare / DataDome checks | | ctx.pushData | Stream a result row out (persisted immediately) | | ctx.checkpoint | Key-value store persisted to disk | | ctx.tasks | Deduplicating, persistent task queue | | ctx.human | Human-like pauses, mouse movement, warmup navigation, smooth scrolling | | ctx.log / ctx.sleep / ctx.args | Structured logging, pacing, CLI args (--key=value) |

Reconnaissance tools

import {
  extractAllHydrationPayloads, walkSchema, filterLeaves, // hydration payloads (__NEXT_DATA__, JSON-LD, window.*)
  startXhrCapture, capturedToPayloads,                   // intercept JSON/XHR traffic before navigation
  auditPage, findMissingFields,                          // visual ground truth: screenshots + label/value pairs
  scoreCard, formatScoreCard,                            // completeness / quality / authenticity / freshness scoring
} from "@vinxi/scraper/recon"

These are deliberately noisy scaffolding: they help an agent see a page fast, then the agent writes a small per-site extractor. The methodology — six recon phases with quality gates, the recon–extract loop, validation scorecards — is documented in the site-reconnaissance skill.

Canonical retail schema

@vinxi/scraper/retail exports RetailStore — a cross-brand store/dealer/branch/ATM location schema — plus helpers: makeRetailStore() (build a record from a partial, filling safe defaults), extraOf(), normaliseHours(), normaliseCountry().

Validation scorecard

scoreCard() scores a cohort of store-locator / geo records (0–100 across completeness, data quality, source authenticity, freshness). It expects RetailStore shapes — build them with makeRetailStore(). It defaults to India geography; pass geoBounds / expectedMetros for other regions.

import { scoreCard, formatScoreCard } from "@vinxi/scraper/recon"
import { makeRetailStore } from "@vinxi/scraper/retail"

const stores = rows.map((r) => makeRetailStore({ brand: "Acme", name: r.name, lat: r.lat, lng: r.lng, address: r.address }))
const card = scoreCard(stores, {
  brand: "Acme",
  sourceDomain: "acme.com",
  officialDomain: "acme.com",
  authoritativeCount: 366,
  geoBounds: { minLat: 24, maxLat: 50, minLng: -125, maxLng: -66 }, // US (omit for India default)
  expectedMetros: ["New York", "Los Angeles", "Chicago"],
})
console.log(formatScoreCard(card)) // → band: high/medium/low + evidence

The scorecard is for physical-location data. A product catalog (SKUs, prices, variants) has no geographic scorecard — validate those by sampling fields and reconciling counts instead.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@vinxi/scraper

Install

Write a scraper

The scraper context

Reconnaissance tools

Canonical retail schema

Validation scorecard

License