npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

edgecrawl

v0.3.4

Published

Local AI-powered web scraper. Extract structured JSON from any website using on-device ONNX LLMs. No API keys, no cloud, no Python.

Readme

edgecrawl

Local AI-powered web scraper. Extract structured JSON from any website using on-device ONNX LLMs. No API keys, no cloud, no Python.

Features

  • 100% Local AI — Runs Qwen3 ONNX models on your machine via Transformers.js v4 (WebGPU/WASM)
  • Zero API Keys — No OpenAI, no Anthropic, no cloud bills. Everything runs on-device
  • Structured JSON Output — Define a schema, get clean JSON back
  • CLI + Library — Use from the command line or import into your Node.js app

Architecture

Playwright (headless browser)
  |
  v
JSDOM + node-html-markdown -> Clean Markdown
  |
  v
Qwen3 ONNX (Transformers.js v4) -> Structured JSON

Quick Start

Install from npm

npm install edgecrawl
npx playwright install chromium

Install from source

git clone https://github.com/couzip/edgecrawl.git
cd edgecrawl
npm install
npx playwright install chromium

Models are downloaded automatically on first run:

  • LLM: Qwen3 ONNX (0.4-2.5 GB depending on preset)

CLI

# Extract structured data from a URL
edgecrawl extract https://example.com

# With custom schema
edgecrawl extract https://example.com -s schemas/product.json -o result.json

# Light model on WASM
edgecrawl extract https://example.com -p light -d wasm

# Convert to Markdown only (no LLM)
edgecrawl md https://example.com

Library

import { scrapeAndExtract, cleanup } from "edgecrawl";

const result = await scrapeAndExtract("https://example.com", {
  preset: "balanced",
});

console.log(result.extracted);
await cleanup();

CLI Commands

extract <url> — Structured extraction

# Default (balanced model, WebGPU)
edgecrawl extract https://example.com

# Light model on WASM
edgecrawl extract https://example.com -p light -d wasm

# Custom schema + output file
edgecrawl extract https://example.com -s schemas/product.json -o result.json

# Target a specific section
edgecrawl extract https://example.com --selector "main article"

batch <file> — Batch processing

# Process URL list (one URL per line)
edgecrawl batch urls.txt -o results.json

# With concurrency control
edgecrawl batch urls.txt -c 5

query <url> <prompt> — Custom question

# Ask a question about page content
edgecrawl query https://example.com "What are the main products?"

md <url> — Markdown conversion only (no LLM)

edgecrawl md https://example.com
edgecrawl md https://example.com -o page.md --scroll

CLI Options

Common Options (extract, batch, query)

| Option | Description | Default | |--------|-------------|---------| | -p, --preset <preset> | Model preset: light / balanced / quality | balanced | | -d, --device <device> | Inference device: webgpu / wasm | webgpu | | -s, --schema <file> | Custom schema JSON file | built-in default | | -o, --output <file> | Output file path | stdout | | -t, --max-tokens <n> | Max input tokens for LLM | 2048 | | --selector <selector> | CSS selector to narrow target content | - |

Batch Options

| Option | Description | Default | |--------|-------------|---------| | -c, --concurrency <n> | Concurrent scraping limit | 3 |

Browser Options (all commands)

| Option | Description | Default | |--------|-------------|---------| | --headful | Show browser window (for debugging) | false | | --user-agent <ua> | Custom User-Agent string | - | | --timeout <ms> | Page load timeout in milliseconds | 30000 | | --proxy <url> | Proxy server URL | - | | --cookie <cookie> | Cookie in name=value format (repeatable) | - | | --extra-header <header> | HTTP header in Key:Value format (repeatable) | - | | --viewport <WxH> | Viewport size | 1280x800 | | --wait-until <event> | Navigation wait condition: load / domcontentloaded / networkidle | load | | --no-block-media | Disable blocking of images/fonts/media | false | | --scroll | Scroll to bottom (for lazy-loaded content) | false | | --wait <selector> | Wait for CSS selector to appear | - |

Library Usage

High-level Pipeline

import {
  scrapeAndExtract,
  batchScrapeAndExtract,
  scrapeAndQuery,
  cleanup,
} from "edgecrawl";

// Basic extraction
const result = await scrapeAndExtract("https://example.com");

// Custom schema
const product = await scrapeAndExtract("https://shop.example.com/item", {
  schema: {
    type: "object",
    properties: {
      name: { type: "string", description: "Product name" },
      price: { type: "number", description: "Price (numeric)" },
      features: {
        type: "array",
        items: { type: "string" },
        description: "Key features or specs",
      },
    },
    required: ["name", "price"],
  },
});

// Batch processing
const results = await batchScrapeAndExtract(
  ["https://example.com/1", "https://example.com/2"],
  { concurrency: 3 }
);

// Custom query
const answer = await scrapeAndQuery(
  "https://example.com",
  "What are the main products?",
  { preset: "quality" }
);

await cleanup();

Low-level APIs

// Use individual modules
import { htmlToMarkdown, cleanMarkdown } from "edgecrawl/html2md";
import { launchBrowser, fetchPage, closeBrowser } from "edgecrawl/scraper";
import { initLLM, extractStructured } from "edgecrawl/llm";

// HTML to Markdown only
await launchBrowser();
const { html } = await fetchPage("https://example.com");
const { markdown, title } = htmlToMarkdown(html, "https://example.com");
const cleaned = cleanMarkdown(markdown);
await closeBrowser();

// Or use the root export
import { htmlToMarkdown, cleanMarkdown, fetchPage } from "edgecrawl";

Custom Schemas

Define what data to extract by providing a JSON schema file:

{
  "type": "object",
  "properties": {
    "name": { "type": "string", "description": "Product name" },
    "price": { "type": "number", "description": "Price (numeric)" },
    "currency": { "type": "string", "description": "Currency code (e.g. USD, EUR, JPY)" },
    "description": { "type": "string", "description": "Product description (1-3 sentences)" },
    "features": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Key features or specs"
    },
    "availability": { "type": "string", "description": "Stock status (in stock, out of stock, etc.)" }
  },
  "required": ["name", "price", "currency"]
}
edgecrawl extract https://shop.example.com/product -s schema.json

See the schemas/ directory for more examples.

Model Presets

| Preset | Model | Size | Speed | Quality | |--------|-------|------|-------|---------| | light | Qwen3-0.6B | ~0.4 GB | Fast | Good for simple pages | | balanced | Qwen3-1.7B | ~1.2 GB | Medium | Best balance (default) | | quality | Qwen3-4B | ~2.5 GB | Slower | Best accuracy |

All models run locally via ONNX Runtime. First run downloads the model to .model-cache/.

Tech Stack

| Component | Library | Role | |-----------|---------|------| | Browser | Playwright | Headless scraping | | HTML -> Markdown | JSDOM + node-html-markdown | Content cleaning + Markdown conversion | | LLM | Transformers.js v4 + Qwen3 ONNX | Local structured extraction | | CLI | Commander.js | Command-line interface |

AI Agent Skill

A skill file is included for AI coding agents. Install it to let your agent use edgecrawl directly:

npx skills add couzip/edgecrawl

Once installed, your AI agent can scrape websites and extract structured data using edgecrawl.

Requirements

  • Node.js >= 20.0.0
  • Chromium (installed via npx playwright install chromium)
  • ~1-3 GB disk space for models (downloaded on first run)
  • GPU recommended for WebGPU mode (falls back to WASM/CPU)

License

MIT