openscrape

v1.0.8

Published

3 days ago

Open-source web scraping library with headless browser support, pagination, and data extraction

Downloads

774

0High
0Medium
0Low

rantsroamer

web-scraping crawler headless-browser data-extraction pagination markdown json

OpenScrape

OpenScrape is a fully open-source web scraping library that mimics the core features of commercial scraping APIs. Built with TypeScript for Node.js 18+, it provides headless browser rendering, automatic pagination detection, clean data extraction, and both CLI and REST API interfaces.

Features

🚀 Headless Browser Rendering - Full JavaScript rendering using Playwright
📄 Pagination & Navigation - Automatic detection of "next" links and "load more" buttons
🧹 Data Extraction & Normalization - Clean markdown or JSON output with noise removal
⚡ Rate Limiting & Concurrency - Safe request throttling with exponential backoff
🖥️ CLI Interface - Easy-to-use command-line tools
🌐 REST API - HTTP endpoints for programmatic access
📡 WebSocket - Real-time job status updates over WebSocket
📁 Media handling - Download images to an organized folder; optional base64-embed small images in JSON
🔧 Extensible - Custom extraction schemas and pagination callbacks

Installation

npm install openscrape

Or install globally for CLI usage:

npm install -g openscrape

Important: After installation, you need to install Playwright browsers:

npx playwright install chromium

This downloads the Chromium browser required for headless rendering.

Docker

You can run OpenScrape in a container with no local Node or Playwright install.

Build the image:

docker build -t openscrape .

Run the API server (default; port 3000):

docker run -p 3000:3000 --init openscrape

Or with Docker Compose:

docker compose up --build

Then scrape via the API:

curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Use the CLI inside the container (crawl a URL, write output to a mounted volume):

docker run --rm -v "$(pwd)/out:/out" openscrape crawl https://example.com/article -o /out/article.json

Batch scrape (mount a file with URLs and an output directory):

docker run --rm -v "$(pwd)/urls.txt:/app/urls.txt" -v "$(pwd)/scraped:/out" openscrape batch /app/urls.txt --output-dir /out --format markdown

Custom command (override the default serve):

docker run --rm openscrape crawl https://example.com -o /tmp/out.json --format json

The image includes Chromium and its dependencies; the default command is serve --port 3000 --host 0.0.0.0. Use --init to avoid zombie processes. For large workloads, you may need to increase memory for the container.

Quick Start

CLI Usage

Scrape a single URL:

openscrape crawl https://example.com/article --output article.json

Scrape multiple URLs from a file:

openscrape batch urls.txt --output-dir ./scraped --format markdown

Start the API server:

openscrape serve --port 3000

Programmatic Usage

import { OpenScrape } from 'openscrape';

const scraper = new OpenScrape();

// Scrape a single URL
const data = await scraper.scrape({
  url: 'https://example.com/article',
  render: true,
  format: 'json',
  extractImages: true,
});

console.log(data.title);
console.log(data.content);
console.log(data.markdown);

await scraper.close();

REST API

Start the server:

openscrape serve

Scrape a URL:

curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Check job status:

curl http://localhost:3000/status/{jobId}

WebSocket (real-time updates)

When you run openscrape serve, the server also exposes a WebSocket endpoint at path /ws. Connect to receive real-time events for crawl jobs.

Endpoint: ws://localhost:3000/ws (or wss:// in production with TLS)

Subscribe to a job: Send a JSON message:

{ "type": "subscribe", "jobId": "<jobId>" }

Unsubscribe:

{ "type": "unsubscribe", "jobId": "<jobId>" }

Server events (you receive JSON):

| Event | When | |------------------|-------------------------| | job:created | A new crawl job was created | | job:processing | Scraping has started | | job:completed | Scraping finished; job.result has the data | | job:failed | Scraping failed; job.error has the message |

Example message:

{
  "event": "job:completed",
  "jobId": "abc-123",
  "job": {
    "id": "abc-123",
    "url": "https://example.com/article",
    "status": "completed",
    "result": { "url": "...", "title": "...", "content": "...", "markdown": "..." },
    "createdAt": "2025-01-15T12:00:00.000Z",
    "completedAt": "2025-01-15T12:00:05.000Z"
  },
  "timestamp": "2025-01-15T12:00:05.000Z"
}

Minimal client example (Node):

const WebSocket = require('ws');
const ws = new WebSocket('ws://localhost:3000/ws');

ws.on('open', () => {
  // First POST /crawl to get jobId, then:
  ws.send(JSON.stringify({ type: 'subscribe', jobId: 'YOUR_JOB_ID' }));
});
ws.on('message', (data) => {
  const msg = JSON.parse(data);
  console.log(msg.event, msg.job?.status, msg.job?.result?.title);
});

Configuration

Scrape Options

interface ScrapeOptions {
  url: string;                    // URL to scrape (required)
  render?: boolean;                // Enable JS rendering (default: true)
  waitTime?: number;              // Wait time after load in ms (default: 2000)
  maxDepth?: number;              // Max pagination depth (default: 10)
  nextSelector?: string;          // Custom CSS selector for next link
  paginationCallback?: Function;  // Custom pagination detection
  format?: 'json' | 'markdown' | 'html' | 'text' | 'csv' | 'yaml';  // Output format (default: 'json')
  extractionSchema?: object;      // Custom extraction schema
  autoDetectSchema?: boolean;     // Auto-detect schema from page (opt-in)
  schemaSamples?: string[];       // Sample URLs for schema detection (optional)
  llmExtract?: boolean;           // Use local LLM to extract structured JSON
  llmEndpoint?: string;           // Ollama or LM Studio endpoint URL
  llmModel?: string;              // Model name (default: 'llama2')
  userAgent?: string;             // Custom user agent
  proxy?: string | string[];      // Override proxy for this request (single URL or list for rotation)
  timeout?: number;               // Request timeout in ms (default: 30000)
  extractImages?: boolean;        // Extract images (default: true)
  extractMedia?: boolean;         // Extract embedded media (default: false)
  downloadMedia?: boolean;       // Download images to a local folder (default: false)
  mediaOutputDir?: string;       // Folder for downloads (default: ./media)
  base64EmbedImages?: boolean;   // Embed small images as base64 in JSON (default: false)
  base64EmbedMaxBytes?: number;  // Max size for embedding in bytes (default: 51200)
}

Media & asset handling

You can save images (and other assets) locally and optionally embed small images as base64 in JSON.

Download media to a folder (organized by site and path):

openscrape crawl https://example.com/article --output article.json --download-media --media-dir ./media

Folder structure: mediaOutputDir / hostname / path_slug / image_0.jpg, e.g. ./media/example.com/article/image_0.jpg.

Base64-embed small images in JSON (for self-contained output or small thumbnails):

openscrape crawl https://example.com/article --output article.json --embed-images --embed-images-max-size 51200

Only images under the size limit (default 50KB) are embedded.
Result includes mediaEmbedded: [{ url, dataUrl, mimeType }] with data:image/...;base64,... URLs.

Programmatic usage:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  downloadMedia: true,
  mediaOutputDir: './media',
  base64EmbedImages: true,
  base64EmbedMaxBytes: 51200,
});
// data.images       → original URLs
// data.mediaDownloads → [{ url, localPath, mimeType }]
// data.mediaEmbedded  → [{ url, dataUrl, mimeType }]

Output formats

Besides json and markdown, OpenScrape can output:

| Format | Use case | |----------|----------| | html | Cleaned HTML (no scripts/nav); good for archiving or re-rendering. | | text | Plain text only; good for search indexes or NLP. | | csv | List/table-like pages: first <table> as rows; otherwise one row with url, title, author, content. | | yaml | Full structured data (url, title, author, content, images, etc.) in YAML. |

Examples:

openscrape crawl https://example.com/article -o page.html --format html
openscrape crawl https://example.com/table -o data.csv --format csv
openscrape crawl https://example.com/article -o meta.yaml --format yaml

Programmatic usage: the scraper always returns full ScrapedData; use the formatters for string output:

import { OpenScrape, toHtml, toText, toCsv, toYaml } from 'openscrape';

const scraper = new OpenScrape();
const data = await scraper.scrape({ url: 'https://example.com/article' });
await scraper.close();

const htmlString = toHtml(data);
const textString = toText(data);
const csvString = toCsv(data);
const yamlString = toYaml(data);

LLM-based extraction (Ollama / LM Studio)

You can send the cleaned HTML or Markdown to a local LLM and get structured JSON (title, author, publishDate, content, metadata). Useful when pages have irregular structure.

Requirements: A local endpoint such as Ollama or LM Studio.

CLI:

# Ollama (default endpoint http://localhost:11434)
openscrape crawl https://example.com/article -o out.json --llm-extract --llm-model llama2

# Custom Ollama or LM Studio endpoint
openscrape crawl https://example.com/article -o out.json --llm-extract \
  --llm-endpoint http://localhost:1234/v1 --llm-model my-model

Programmatic:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  llmExtract: true,
  llmEndpoint: 'http://localhost:11434',  // Ollama
  llmModel: 'llama2',
});
// data is merged with LLM-extracted fields; on error, data.metadata.llmError is set

Ollama: use base URL (e.g. http://localhost:11434); the client calls /api/generate.
LM Studio: use the chat completions URL (e.g. http://localhost:1234/v1); the client calls /v1/chat/completions.

Auto-detect schema (opt-in)

With autoDetectSchema: true, OpenScrape infers an extraction schema from the page (e.g. title from <title> or og:title, content from article or .content). Use it when you don’t have a custom schema.

CLI:

openscrape crawl https://example.com/article -o out.json --auto-detect-schema

Programmatic:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  autoDetectSchema: true,
});

You can also use the schema detector directly:

import { detectSchemaFromHtml } from 'openscrape';

const { schema, confidence, suggestions } = detectSchemaFromHtml(htmlString);

Custom Extraction Schema

const schema = {
  title: '.article-title',
  author: '.author-name',
  publishDate: '.publish-date',
  content: '.article-body',
  custom: [
    {
      name: 'category',
      selector: '.category',
    },
    {
      name: 'views',
      selector: '.views',
      transform: (value: string) => parseInt(value, 10),
    },
  ],
};

const data = await scraper.scrape({
  url: 'https://example.com/article',
  extractionSchema: schema,
});

Rate Limiting

const scraper = new OpenScrape({
  maxRequestsPerSecond: 5,
  maxConcurrency: 3,
});

Proxy support (rotating & residential)

Use a single proxy or a list for round-robin rotation. Supports auth (http://user:pass@host:port), SOCKS5 (socks5://host:port), and residential proxy lists.

Constructor: set a default proxy for all scrapes (single URL or array).
Per-scrape: override with options.proxy for that request.
Retries: on 403, 429, or timeout, the next proxy in the list is tried automatically.

Formats:

http://host:port or https://host:port
http://user:pass@host:port (auth)
socks5://host:port or socks5://user:pass@host:port

CLI:

# Single proxy
openscrape crawl https://example.com --proxy http://user:[email protected]:8080 -o out.json

# Rotating list (comma-separated)
openscrape crawl https://example.com --proxy "http://p1:8080,http://p2:8080,socks5://p3:1080" -o out.json

# Batch with proxy list
openscrape batch urls.txt --proxy "http://user:[email protected]:8080" --output-dir ./out

Programmatic:

// Single proxy or rotating list at construction
const scraper = new OpenScrape({
  proxy: 'http://user:[email protected]:8080',
  maxConcurrency: 3,
});

// Or pass an array for rotation
const scraper = new OpenScrape({
  proxy: ['http://p1:8080', 'socks5://p2:1080', 'http://user:pass@p3:8080'],
});

// Per-scrape override
const data = await scraper.scrape({
  url: 'https://example.com',
  proxy: 'socks5://localhost:1080',
});

Low-level: use parseProxyString(), normalizeProxyInput(), and ProxyPool from the package for custom rotation logic.

CLI Commands

`crawl <URL>`

Scrape a single URL and save to file.

Options:

-o, --output <path> - Output file path (default: output.json)
--no-render - Disable JavaScript rendering
--format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)
--wait-time <ms> - Wait time after page load (default: 2000)
--max-depth <number> - Maximum pagination depth (default: 10)
--next-selector <selector> - CSS selector for next link
--timeout <ms> - Request timeout (default: 30000)
--user-agent <ua> - Custom user agent string
--llm-extract - Use local LLM (Ollama/LM Studio) to extract structured data
--llm-endpoint <url> - LLM endpoint (e.g. http://localhost:11434 for Ollama)
--llm-model <name> - Model name for LLM extraction (default: llama2)
--auto-detect-schema - Auto-detect extraction schema from the page
--proxy <url> - Proxy URL or comma-separated list for rotation (http://user:pass@host:port, socks5://host:port)

Example:

openscrape crawl https://example.com/article \
  --output article.md \
  --format markdown \
  --max-depth 5

`batch <file>`

Scrape multiple URLs from a file (one URL per line).

Options:

-o, --output-dir <path> - Output directory (default: ./output)
--no-render - Disable JavaScript rendering
--format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)
--wait-time <ms> - Wait time after page load (default: 2000)
--max-depth <number> - Maximum pagination depth (default: 10)
--timeout <ms> - Request timeout (default: 30000)
--max-concurrency <number> - Maximum concurrent requests (default: 3)
--llm-extract - Use local LLM to extract structured data per URL
--llm-endpoint <url> - LLM endpoint URL
--llm-model <name> - Model name (default: llama2)
--auto-detect-schema - Auto-detect extraction schema from each page
--proxy <url> - Proxy URL or comma-separated list for rotation

Example:

openscrape batch urls.txt \
  --output-dir ./scraped \
  --format markdown \
  --max-concurrency 5

`serve`

Start the REST API server.

Options:

-p, --port <number> - Port number (default: 3000)
--host <host> - Host address (default: 0.0.0.0)

Example:

openscrape serve --port 8080

REST API Endpoints

`POST /crawl`

Scrape a URL asynchronously.

Request:

{
  "url": "https://example.com/article",
  "options": {
    "render": true,
    "format": "json",
    "maxDepth": 5
  }
}

Response:

{
  "jobId": "uuid-here",
  "status": "pending",
  "url": "https://example.com/article"
}

`GET /status/:jobId`

Get the status and result of a crawl job.

Response:

{
  "id": "uuid-here",
  "status": "completed",
  "url": "https://example.com/article",
  "createdAt": "2024-01-01T00:00:00.000Z",
  "completedAt": "2024-01-01T00:00:05.000Z",
  "result": {
    "url": "https://example.com/article",
    "title": "Article Title",
    "content": "...",
    "markdown": "...",
    "timestamp": "2024-01-01T00:00:05.000Z"
  }
}

`GET /jobs`

List all crawl jobs.

`GET /health`

Health check endpoint.

`GET /about`

Credits and repository info. Returns: { name, version, by, repository } (e.g. by: John F. Gonzales, repository: https://github.com/RantsRoamer/OpenScrape).

Development

Prerequisites

Node.js 18+
npm or yarn

Setup

git clone https://github.com/yourusername/openscrape.git
cd openscrape
npm install

Build

npm run build

Test

npm test

Lint

npm run lint

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by Firecrawl API
Built with Playwright
Uses Turndown for HTML to Markdown conversion

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

OpenScrape

Features

Installation

Docker

Quick Start

CLI Usage

Programmatic Usage

REST API

WebSocket (real-time updates)

Configuration

Scrape Options

Media & asset handling

Output formats

LLM-based extraction (Ollama / LM Studio)

Auto-detect schema (opt-in)

Custom Extraction Schema

Rate Limiting

Proxy support (rotating & residential)

CLI Commands

crawl <URL>

batch <file>

serve

REST API Endpoints

POST /crawl

GET /status/:jobId

GET /jobs

GET /health

GET /about

Development

Prerequisites

Setup

Build

Test

Lint

License

Acknowledgments

Roadmap

`crawl <URL>`

`batch <file>`

`serve`

`POST /crawl`

`GET /status/:jobId`

`GET /jobs`

`GET /health`

`GET /about`