edgecrawl

v0.3.4

Published

14 days ago

Local AI-powered web scraper. Extract structured JSON from any website using on-device ONNX LLMs. No API keys, no cloud, no Python.

0High
0Medium
0Low

couzip

scraper web-scraping llm onnx local-ai structured-data json headless playwright edge-computing transformers

edgecrawl

Local AI-powered web scraper. Extract structured JSON from any website using on-device ONNX LLMs. No API keys, no cloud, no Python.

Features

100% Local AI — Runs Qwen3 ONNX models on your machine via Transformers.js v4 (WebGPU/WASM)
Zero API Keys — No OpenAI, no Anthropic, no cloud bills. Everything runs on-device
Structured JSON Output — Define a schema, get clean JSON back
CLI + Library — Use from the command line or import into your Node.js app

Architecture

Playwright (headless browser)
  |
  v
JSDOM + node-html-markdown -> Clean Markdown
  |
  v
Qwen3 ONNX (Transformers.js v4) -> Structured JSON

Quick Start

Install from npm

npm install edgecrawl
npx playwright install chromium

Install from source

git clone https://github.com/couzip/edgecrawl.git
cd edgecrawl
npm install
npx playwright install chromium

Models are downloaded automatically on first run:

LLM: Qwen3 ONNX (0.4-2.5 GB depending on preset)

CLI

# Extract structured data from a URL
edgecrawl extract https://example.com

# With custom schema
edgecrawl extract https://example.com -s schemas/product.json -o result.json

# Light model on WASM
edgecrawl extract https://example.com -p light -d wasm

# Convert to Markdown only (no LLM)
edgecrawl md https://example.com

Library

import { scrapeAndExtract, cleanup } from "edgecrawl";

const result = await scrapeAndExtract("https://example.com", {
  preset: "balanced",
});

console.log(result.extracted);
await cleanup();

CLI Commands

`extract <url>` — Structured extraction

# Default (balanced model, WebGPU)
edgecrawl extract https://example.com

# Light model on WASM
edgecrawl extract https://example.com -p light -d wasm

# Custom schema + output file
edgecrawl extract https://example.com -s schemas/product.json -o result.json

# Target a specific section
edgecrawl extract https://example.com --selector "main article"

`batch <file>` — Batch processing

# Process URL list (one URL per line)
edgecrawl batch urls.txt -o results.json

# With concurrency control
edgecrawl batch urls.txt -c 5

`query <url> <prompt>` — Custom question

# Ask a question about page content
edgecrawl query https://example.com "What are the main products?"

`md <url>` — Markdown conversion only (no LLM)

edgecrawl md https://example.com
edgecrawl md https://example.com -o page.md --scroll

CLI Options

Common Options (extract, batch, query)

| Option | Description | Default | |--------|-------------|---------| | -p, --preset <preset> | Model preset: light / balanced / quality | balanced | | -d, --device <device> | Inference device: webgpu / wasm | webgpu | | -s, --schema <file> | Custom schema JSON file | built-in default | | -o, --output <file> | Output file path | stdout | | -t, --max-tokens <n> | Max input tokens for LLM | 2048 | | --selector <selector> | CSS selector to narrow target content | - |

Batch Options

| Option | Description | Default | |--------|-------------|---------| | -c, --concurrency <n> | Concurrent scraping limit | 3 |

Browser Options (all commands)

| Option | Description | Default | |--------|-------------|---------| | --headful | Show browser window (for debugging) | false | | --user-agent <ua> | Custom User-Agent string | - | | --timeout <ms> | Page load timeout in milliseconds | 30000 | | --proxy <url> | Proxy server URL | - | | --cookie <cookie> | Cookie in name=value format (repeatable) | - | | --extra-header <header> | HTTP header in Key:Value format (repeatable) | - | | --viewport <WxH> | Viewport size | 1280x800 | | --wait-until <event> | Navigation wait condition: load / domcontentloaded / networkidle | load | | --no-block-media | Disable blocking of images/fonts/media | false | | --scroll | Scroll to bottom (for lazy-loaded content) | false | | --wait <selector> | Wait for CSS selector to appear | - |

Library Usage

High-level Pipeline

import {
  scrapeAndExtract,
  batchScrapeAndExtract,
  scrapeAndQuery,
  cleanup,
} from "edgecrawl";

// Basic extraction
const result = await scrapeAndExtract("https://example.com");

// Custom schema
const product = await scrapeAndExtract("https://shop.example.com/item", {
  schema: {
    type: "object",
    properties: {
      name: { type: "string", description: "Product name" },
      price: { type: "number", description: "Price (numeric)" },
      features: {
        type: "array",
        items: { type: "string" },
        description: "Key features or specs",
      },
    },
    required: ["name", "price"],
  },
});

// Batch processing
const results = await batchScrapeAndExtract(
  ["https://example.com/1", "https://example.com/2"],
  { concurrency: 3 }
);

// Custom query
const answer = await scrapeAndQuery(
  "https://example.com",
  "What are the main products?",
  { preset: "quality" }
);

await cleanup();

Low-level APIs

// Use individual modules
import { htmlToMarkdown, cleanMarkdown } from "edgecrawl/html2md";
import { launchBrowser, fetchPage, closeBrowser } from "edgecrawl/scraper";
import { initLLM, extractStructured } from "edgecrawl/llm";

// HTML to Markdown only
await launchBrowser();
const { html } = await fetchPage("https://example.com");
const { markdown, title } = htmlToMarkdown(html, "https://example.com");
const cleaned = cleanMarkdown(markdown);
await closeBrowser();

// Or use the root export
import { htmlToMarkdown, cleanMarkdown, fetchPage } from "edgecrawl";

Custom Schemas

Define what data to extract by providing a JSON schema file:

{
  "type": "object",
  "properties": {
    "name": { "type": "string", "description": "Product name" },
    "price": { "type": "number", "description": "Price (numeric)" },
    "currency": { "type": "string", "description": "Currency code (e.g. USD, EUR, JPY)" },
    "description": { "type": "string", "description": "Product description (1-3 sentences)" },
    "features": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Key features or specs"
    },
    "availability": { "type": "string", "description": "Stock status (in stock, out of stock, etc.)" }
  },
  "required": ["name", "price", "currency"]
}

edgecrawl extract https://shop.example.com/product -s schema.json

See the schemas/ directory for more examples.

Model Presets

| Preset | Model | Size | Speed | Quality | |--------|-------|------|-------|---------| | light | Qwen3-0.6B | ~0.4 GB | Fast | Good for simple pages | | balanced | Qwen3-1.7B | ~1.2 GB | Medium | Best balance (default) | | quality | Qwen3-4B | ~2.5 GB | Slower | Best accuracy |

All models run locally via ONNX Runtime. First run downloads the model to .model-cache/.

Tech Stack

| Component | Library | Role | |-----------|---------|------| | Browser | Playwright | Headless scraping | | HTML -> Markdown | JSDOM + node-html-markdown | Content cleaning + Markdown conversion | | LLM | Transformers.js v4 + Qwen3 ONNX | Local structured extraction | | CLI | Commander.js | Command-line interface |

AI Agent Skill

A skill file is included for AI coding agents. Install it to let your agent use edgecrawl directly:

npx skills add couzip/edgecrawl

Once installed, your AI agent can scrape websites and extract structured data using edgecrawl.

Requirements

Node.js >= 20.0.0
Chromium (installed via npx playwright install chromium)
~1-3 GB disk space for models (downloaded on first run)
GPU recommended for WebGPU mode (falls back to WASM/CPU)

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

edgecrawl

Features

Architecture

Quick Start

Install from npm

Install from source

CLI

Library

CLI Commands

extract <url> — Structured extraction

batch <file> — Batch processing

query <url> <prompt> — Custom question

md <url> — Markdown conversion only (no LLM)

CLI Options

Common Options (extract, batch, query)

Batch Options

Browser Options (all commands)

Library Usage

High-level Pipeline

Low-level APIs

Custom Schemas

Model Presets

Tech Stack

AI Agent Skill

Requirements

License

`extract <url>` — Structured extraction

`batch <file>` — Batch processing

`query <url> <prompt>` — Custom question

`md <url>` — Markdown conversion only (no LLM)