npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

openscrape

v1.0.8

Published

Open-source web scraping library with headless browser support, pagination, and data extraction

Downloads

774

Readme

OpenScrape

License: MIT Node.js Version

OpenScrape is a fully open-source web scraping library that mimics the core features of commercial scraping APIs. Built with TypeScript for Node.js 18+, it provides headless browser rendering, automatic pagination detection, clean data extraction, and both CLI and REST API interfaces.

Features

  • 🚀 Headless Browser Rendering - Full JavaScript rendering using Playwright
  • 📄 Pagination & Navigation - Automatic detection of "next" links and "load more" buttons
  • 🧹 Data Extraction & Normalization - Clean markdown or JSON output with noise removal
  • Rate Limiting & Concurrency - Safe request throttling with exponential backoff
  • 🖥️ CLI Interface - Easy-to-use command-line tools
  • 🌐 REST API - HTTP endpoints for programmatic access
  • 📡 WebSocket - Real-time job status updates over WebSocket
  • 📁 Media handling - Download images to an organized folder; optional base64-embed small images in JSON
  • 🔧 Extensible - Custom extraction schemas and pagination callbacks

Installation

npm install openscrape

Or install globally for CLI usage:

npm install -g openscrape

Important: After installation, you need to install Playwright browsers:

npx playwright install chromium

This downloads the Chromium browser required for headless rendering.

Docker

You can run OpenScrape in a container with no local Node or Playwright install.

Build the image:

docker build -t openscrape .

Run the API server (default; port 3000):

docker run -p 3000:3000 --init openscrape

Or with Docker Compose:

docker compose up --build

Then scrape via the API:

curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Use the CLI inside the container (crawl a URL, write output to a mounted volume):

docker run --rm -v "$(pwd)/out:/out" openscrape crawl https://example.com/article -o /out/article.json

Batch scrape (mount a file with URLs and an output directory):

docker run --rm -v "$(pwd)/urls.txt:/app/urls.txt" -v "$(pwd)/scraped:/out" openscrape batch /app/urls.txt --output-dir /out --format markdown

Custom command (override the default serve):

docker run --rm openscrape crawl https://example.com -o /tmp/out.json --format json

The image includes Chromium and its dependencies; the default command is serve --port 3000 --host 0.0.0.0. Use --init to avoid zombie processes. For large workloads, you may need to increase memory for the container.

Quick Start

CLI Usage

Scrape a single URL:

openscrape crawl https://example.com/article --output article.json

Scrape multiple URLs from a file:

openscrape batch urls.txt --output-dir ./scraped --format markdown

Start the API server:

openscrape serve --port 3000

Programmatic Usage

import { OpenScrape } from 'openscrape';

const scraper = new OpenScrape();

// Scrape a single URL
const data = await scraper.scrape({
  url: 'https://example.com/article',
  render: true,
  format: 'json',
  extractImages: true,
});

console.log(data.title);
console.log(data.content);
console.log(data.markdown);

await scraper.close();

REST API

Start the server:

openscrape serve

Scrape a URL:

curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Check job status:

curl http://localhost:3000/status/{jobId}

WebSocket (real-time updates)

When you run openscrape serve, the server also exposes a WebSocket endpoint at path /ws. Connect to receive real-time events for crawl jobs.

Endpoint: ws://localhost:3000/ws (or wss:// in production with TLS)

Subscribe to a job: Send a JSON message:

{ "type": "subscribe", "jobId": "<jobId>" }

Unsubscribe:

{ "type": "unsubscribe", "jobId": "<jobId>" }

Server events (you receive JSON):

| Event | When | |------------------|-------------------------| | job:created | A new crawl job was created | | job:processing | Scraping has started | | job:completed | Scraping finished; job.result has the data | | job:failed | Scraping failed; job.error has the message |

Example message:

{
  "event": "job:completed",
  "jobId": "abc-123",
  "job": {
    "id": "abc-123",
    "url": "https://example.com/article",
    "status": "completed",
    "result": { "url": "...", "title": "...", "content": "...", "markdown": "..." },
    "createdAt": "2025-01-15T12:00:00.000Z",
    "completedAt": "2025-01-15T12:00:05.000Z"
  },
  "timestamp": "2025-01-15T12:00:05.000Z"
}

Minimal client example (Node):

const WebSocket = require('ws');
const ws = new WebSocket('ws://localhost:3000/ws');

ws.on('open', () => {
  // First POST /crawl to get jobId, then:
  ws.send(JSON.stringify({ type: 'subscribe', jobId: 'YOUR_JOB_ID' }));
});
ws.on('message', (data) => {
  const msg = JSON.parse(data);
  console.log(msg.event, msg.job?.status, msg.job?.result?.title);
});

Configuration

Scrape Options

interface ScrapeOptions {
  url: string;                    // URL to scrape (required)
  render?: boolean;                // Enable JS rendering (default: true)
  waitTime?: number;              // Wait time after load in ms (default: 2000)
  maxDepth?: number;              // Max pagination depth (default: 10)
  nextSelector?: string;          // Custom CSS selector for next link
  paginationCallback?: Function;  // Custom pagination detection
  format?: 'json' | 'markdown' | 'html' | 'text' | 'csv' | 'yaml';  // Output format (default: 'json')
  extractionSchema?: object;      // Custom extraction schema
  autoDetectSchema?: boolean;     // Auto-detect schema from page (opt-in)
  schemaSamples?: string[];       // Sample URLs for schema detection (optional)
  llmExtract?: boolean;           // Use local LLM to extract structured JSON
  llmEndpoint?: string;           // Ollama or LM Studio endpoint URL
  llmModel?: string;              // Model name (default: 'llama2')
  userAgent?: string;             // Custom user agent
  proxy?: string | string[];      // Override proxy for this request (single URL or list for rotation)
  timeout?: number;               // Request timeout in ms (default: 30000)
  extractImages?: boolean;        // Extract images (default: true)
  extractMedia?: boolean;         // Extract embedded media (default: false)
  downloadMedia?: boolean;       // Download images to a local folder (default: false)
  mediaOutputDir?: string;       // Folder for downloads (default: ./media)
  base64EmbedImages?: boolean;   // Embed small images as base64 in JSON (default: false)
  base64EmbedMaxBytes?: number;  // Max size for embedding in bytes (default: 51200)
}

Media & asset handling

You can save images (and other assets) locally and optionally embed small images as base64 in JSON.

Download media to a folder (organized by site and path):

openscrape crawl https://example.com/article --output article.json --download-media --media-dir ./media

Folder structure: mediaOutputDir / hostname / path_slug / image_0.jpg, e.g. ./media/example.com/article/image_0.jpg.

Base64-embed small images in JSON (for self-contained output or small thumbnails):

openscrape crawl https://example.com/article --output article.json --embed-images --embed-images-max-size 51200
  • Only images under the size limit (default 50KB) are embedded.
  • Result includes mediaEmbedded: [{ url, dataUrl, mimeType }] with data:image/...;base64,... URLs.

Programmatic usage:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  downloadMedia: true,
  mediaOutputDir: './media',
  base64EmbedImages: true,
  base64EmbedMaxBytes: 51200,
});
// data.images       → original URLs
// data.mediaDownloads → [{ url, localPath, mimeType }]
// data.mediaEmbedded  → [{ url, dataUrl, mimeType }]

Output formats

Besides json and markdown, OpenScrape can output:

| Format | Use case | |----------|----------| | html | Cleaned HTML (no scripts/nav); good for archiving or re-rendering. | | text | Plain text only; good for search indexes or NLP. | | csv | List/table-like pages: first <table> as rows; otherwise one row with url, title, author, content. | | yaml | Full structured data (url, title, author, content, images, etc.) in YAML. |

Examples:

openscrape crawl https://example.com/article -o page.html --format html
openscrape crawl https://example.com/table -o data.csv --format csv
openscrape crawl https://example.com/article -o meta.yaml --format yaml

Programmatic usage: the scraper always returns full ScrapedData; use the formatters for string output:

import { OpenScrape, toHtml, toText, toCsv, toYaml } from 'openscrape';

const scraper = new OpenScrape();
const data = await scraper.scrape({ url: 'https://example.com/article' });
await scraper.close();

const htmlString = toHtml(data);
const textString = toText(data);
const csvString = toCsv(data);
const yamlString = toYaml(data);

LLM-based extraction (Ollama / LM Studio)

You can send the cleaned HTML or Markdown to a local LLM and get structured JSON (title, author, publishDate, content, metadata). Useful when pages have irregular structure.

Requirements: A local endpoint such as Ollama or LM Studio.

CLI:

# Ollama (default endpoint http://localhost:11434)
openscrape crawl https://example.com/article -o out.json --llm-extract --llm-model llama2

# Custom Ollama or LM Studio endpoint
openscrape crawl https://example.com/article -o out.json --llm-extract \
  --llm-endpoint http://localhost:1234/v1 --llm-model my-model

Programmatic:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  llmExtract: true,
  llmEndpoint: 'http://localhost:11434',  // Ollama
  llmModel: 'llama2',
});
// data is merged with LLM-extracted fields; on error, data.metadata.llmError is set
  • Ollama: use base URL (e.g. http://localhost:11434); the client calls /api/generate.
  • LM Studio: use the chat completions URL (e.g. http://localhost:1234/v1); the client calls /v1/chat/completions.

Auto-detect schema (opt-in)

With autoDetectSchema: true, OpenScrape infers an extraction schema from the page (e.g. title from <title> or og:title, content from article or .content). Use it when you don’t have a custom schema.

CLI:

openscrape crawl https://example.com/article -o out.json --auto-detect-schema

Programmatic:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  autoDetectSchema: true,
});

You can also use the schema detector directly:

import { detectSchemaFromHtml } from 'openscrape';

const { schema, confidence, suggestions } = detectSchemaFromHtml(htmlString);

Custom Extraction Schema

const schema = {
  title: '.article-title',
  author: '.author-name',
  publishDate: '.publish-date',
  content: '.article-body',
  custom: [
    {
      name: 'category',
      selector: '.category',
    },
    {
      name: 'views',
      selector: '.views',
      transform: (value: string) => parseInt(value, 10),
    },
  ],
};

const data = await scraper.scrape({
  url: 'https://example.com/article',
  extractionSchema: schema,
});

Rate Limiting

const scraper = new OpenScrape({
  maxRequestsPerSecond: 5,
  maxConcurrency: 3,
});

Proxy support (rotating & residential)

Use a single proxy or a list for round-robin rotation. Supports auth (http://user:pass@host:port), SOCKS5 (socks5://host:port), and residential proxy lists.

  • Constructor: set a default proxy for all scrapes (single URL or array).
  • Per-scrape: override with options.proxy for that request.
  • Retries: on 403, 429, or timeout, the next proxy in the list is tried automatically.

Formats:

  • http://host:port or https://host:port
  • http://user:pass@host:port (auth)
  • socks5://host:port or socks5://user:pass@host:port

CLI:

# Single proxy
openscrape crawl https://example.com --proxy http://user:[email protected]:8080 -o out.json

# Rotating list (comma-separated)
openscrape crawl https://example.com --proxy "http://p1:8080,http://p2:8080,socks5://p3:1080" -o out.json

# Batch with proxy list
openscrape batch urls.txt --proxy "http://user:[email protected]:8080" --output-dir ./out

Programmatic:

// Single proxy or rotating list at construction
const scraper = new OpenScrape({
  proxy: 'http://user:[email protected]:8080',
  maxConcurrency: 3,
});

// Or pass an array for rotation
const scraper = new OpenScrape({
  proxy: ['http://p1:8080', 'socks5://p2:1080', 'http://user:pass@p3:8080'],
});

// Per-scrape override
const data = await scraper.scrape({
  url: 'https://example.com',
  proxy: 'socks5://localhost:1080',
});

Low-level: use parseProxyString(), normalizeProxyInput(), and ProxyPool from the package for custom rotation logic.

CLI Commands

crawl <URL>

Scrape a single URL and save to file.

Options:

  • -o, --output <path> - Output file path (default: output.json)
  • --no-render - Disable JavaScript rendering
  • --format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)
  • --wait-time <ms> - Wait time after page load (default: 2000)
  • --max-depth <number> - Maximum pagination depth (default: 10)
  • --next-selector <selector> - CSS selector for next link
  • --timeout <ms> - Request timeout (default: 30000)
  • --user-agent <ua> - Custom user agent string
  • --llm-extract - Use local LLM (Ollama/LM Studio) to extract structured data
  • --llm-endpoint <url> - LLM endpoint (e.g. http://localhost:11434 for Ollama)
  • --llm-model <name> - Model name for LLM extraction (default: llama2)
  • --auto-detect-schema - Auto-detect extraction schema from the page
  • --proxy <url> - Proxy URL or comma-separated list for rotation (http://user:pass@host:port, socks5://host:port)

Example:

openscrape crawl https://example.com/article \
  --output article.md \
  --format markdown \
  --max-depth 5

batch <file>

Scrape multiple URLs from a file (one URL per line).

Options:

  • -o, --output-dir <path> - Output directory (default: ./output)
  • --no-render - Disable JavaScript rendering
  • --format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)
  • --wait-time <ms> - Wait time after page load (default: 2000)
  • --max-depth <number> - Maximum pagination depth (default: 10)
  • --timeout <ms> - Request timeout (default: 30000)
  • --max-concurrency <number> - Maximum concurrent requests (default: 3)
  • --llm-extract - Use local LLM to extract structured data per URL
  • --llm-endpoint <url> - LLM endpoint URL
  • --llm-model <name> - Model name (default: llama2)
  • --auto-detect-schema - Auto-detect extraction schema from each page
  • --proxy <url> - Proxy URL or comma-separated list for rotation

Example:

openscrape batch urls.txt \
  --output-dir ./scraped \
  --format markdown \
  --max-concurrency 5

serve

Start the REST API server.

Options:

  • -p, --port <number> - Port number (default: 3000)
  • --host <host> - Host address (default: 0.0.0.0)

Example:

openscrape serve --port 8080

REST API Endpoints

POST /crawl

Scrape a URL asynchronously.

Request:

{
  "url": "https://example.com/article",
  "options": {
    "render": true,
    "format": "json",
    "maxDepth": 5
  }
}

Response:

{
  "jobId": "uuid-here",
  "status": "pending",
  "url": "https://example.com/article"
}

GET /status/:jobId

Get the status and result of a crawl job.

Response:

{
  "id": "uuid-here",
  "status": "completed",
  "url": "https://example.com/article",
  "createdAt": "2024-01-01T00:00:00.000Z",
  "completedAt": "2024-01-01T00:00:05.000Z",
  "result": {
    "url": "https://example.com/article",
    "title": "Article Title",
    "content": "...",
    "markdown": "...",
    "timestamp": "2024-01-01T00:00:05.000Z"
  }
}

GET /jobs

List all crawl jobs.

GET /health

Health check endpoint.

GET /about

Credits and repository info. Returns: { name, version, by, repository } (e.g. by: John F. Gonzales, repository: https://github.com/RantsRoamer/OpenScrape).

Development

Prerequisites

  • Node.js 18+
  • npm or yarn

Setup

git clone https://github.com/yourusername/openscrape.git
cd openscrape
npm install

Build

npm run build

Test

npm test

Lint

npm run lint

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by Firecrawl API
  • Built with Playwright
  • Uses Turndown for HTML to Markdown conversion

Roadmap