openscrape
v1.0.8
Published
Open-source web scraping library with headless browser support, pagination, and data extraction
Downloads
774
Maintainers
Readme
OpenScrape
OpenScrape is a fully open-source web scraping library that mimics the core features of commercial scraping APIs. Built with TypeScript for Node.js 18+, it provides headless browser rendering, automatic pagination detection, clean data extraction, and both CLI and REST API interfaces.
Features
- 🚀 Headless Browser Rendering - Full JavaScript rendering using Playwright
- 📄 Pagination & Navigation - Automatic detection of "next" links and "load more" buttons
- 🧹 Data Extraction & Normalization - Clean markdown or JSON output with noise removal
- ⚡ Rate Limiting & Concurrency - Safe request throttling with exponential backoff
- 🖥️ CLI Interface - Easy-to-use command-line tools
- 🌐 REST API - HTTP endpoints for programmatic access
- 📡 WebSocket - Real-time job status updates over WebSocket
- 📁 Media handling - Download images to an organized folder; optional base64-embed small images in JSON
- 🔧 Extensible - Custom extraction schemas and pagination callbacks
Installation
npm install openscrapeOr install globally for CLI usage:
npm install -g openscrapeImportant: After installation, you need to install Playwright browsers:
npx playwright install chromiumThis downloads the Chromium browser required for headless rendering.
Docker
You can run OpenScrape in a container with no local Node or Playwright install.
Build the image:
docker build -t openscrape .Run the API server (default; port 3000):
docker run -p 3000:3000 --init openscrapeOr with Docker Compose:
docker compose up --buildThen scrape via the API:
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'Use the CLI inside the container (crawl a URL, write output to a mounted volume):
docker run --rm -v "$(pwd)/out:/out" openscrape crawl https://example.com/article -o /out/article.jsonBatch scrape (mount a file with URLs and an output directory):
docker run --rm -v "$(pwd)/urls.txt:/app/urls.txt" -v "$(pwd)/scraped:/out" openscrape batch /app/urls.txt --output-dir /out --format markdownCustom command (override the default serve):
docker run --rm openscrape crawl https://example.com -o /tmp/out.json --format jsonThe image includes Chromium and its dependencies; the default command is serve --port 3000 --host 0.0.0.0. Use --init to avoid zombie processes. For large workloads, you may need to increase memory for the container.
Quick Start
CLI Usage
Scrape a single URL:
openscrape crawl https://example.com/article --output article.jsonScrape multiple URLs from a file:
openscrape batch urls.txt --output-dir ./scraped --format markdownStart the API server:
openscrape serve --port 3000Programmatic Usage
import { OpenScrape } from 'openscrape';
const scraper = new OpenScrape();
// Scrape a single URL
const data = await scraper.scrape({
url: 'https://example.com/article',
render: true,
format: 'json',
extractImages: true,
});
console.log(data.title);
console.log(data.content);
console.log(data.markdown);
await scraper.close();REST API
Start the server:
openscrape serveScrape a URL:
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'Check job status:
curl http://localhost:3000/status/{jobId}WebSocket (real-time updates)
When you run openscrape serve, the server also exposes a WebSocket endpoint at path /ws. Connect to receive real-time events for crawl jobs.
Endpoint: ws://localhost:3000/ws (or wss:// in production with TLS)
Subscribe to a job: Send a JSON message:
{ "type": "subscribe", "jobId": "<jobId>" }Unsubscribe:
{ "type": "unsubscribe", "jobId": "<jobId>" }Server events (you receive JSON):
| Event | When |
|------------------|-------------------------|
| job:created | A new crawl job was created |
| job:processing | Scraping has started |
| job:completed | Scraping finished; job.result has the data |
| job:failed | Scraping failed; job.error has the message |
Example message:
{
"event": "job:completed",
"jobId": "abc-123",
"job": {
"id": "abc-123",
"url": "https://example.com/article",
"status": "completed",
"result": { "url": "...", "title": "...", "content": "...", "markdown": "..." },
"createdAt": "2025-01-15T12:00:00.000Z",
"completedAt": "2025-01-15T12:00:05.000Z"
},
"timestamp": "2025-01-15T12:00:05.000Z"
}Minimal client example (Node):
const WebSocket = require('ws');
const ws = new WebSocket('ws://localhost:3000/ws');
ws.on('open', () => {
// First POST /crawl to get jobId, then:
ws.send(JSON.stringify({ type: 'subscribe', jobId: 'YOUR_JOB_ID' }));
});
ws.on('message', (data) => {
const msg = JSON.parse(data);
console.log(msg.event, msg.job?.status, msg.job?.result?.title);
});Configuration
Scrape Options
interface ScrapeOptions {
url: string; // URL to scrape (required)
render?: boolean; // Enable JS rendering (default: true)
waitTime?: number; // Wait time after load in ms (default: 2000)
maxDepth?: number; // Max pagination depth (default: 10)
nextSelector?: string; // Custom CSS selector for next link
paginationCallback?: Function; // Custom pagination detection
format?: 'json' | 'markdown' | 'html' | 'text' | 'csv' | 'yaml'; // Output format (default: 'json')
extractionSchema?: object; // Custom extraction schema
autoDetectSchema?: boolean; // Auto-detect schema from page (opt-in)
schemaSamples?: string[]; // Sample URLs for schema detection (optional)
llmExtract?: boolean; // Use local LLM to extract structured JSON
llmEndpoint?: string; // Ollama or LM Studio endpoint URL
llmModel?: string; // Model name (default: 'llama2')
userAgent?: string; // Custom user agent
proxy?: string | string[]; // Override proxy for this request (single URL or list for rotation)
timeout?: number; // Request timeout in ms (default: 30000)
extractImages?: boolean; // Extract images (default: true)
extractMedia?: boolean; // Extract embedded media (default: false)
downloadMedia?: boolean; // Download images to a local folder (default: false)
mediaOutputDir?: string; // Folder for downloads (default: ./media)
base64EmbedImages?: boolean; // Embed small images as base64 in JSON (default: false)
base64EmbedMaxBytes?: number; // Max size for embedding in bytes (default: 51200)
}Media & asset handling
You can save images (and other assets) locally and optionally embed small images as base64 in JSON.
Download media to a folder (organized by site and path):
openscrape crawl https://example.com/article --output article.json --download-media --media-dir ./mediaFolder structure: mediaOutputDir / hostname / path_slug / image_0.jpg, e.g. ./media/example.com/article/image_0.jpg.
Base64-embed small images in JSON (for self-contained output or small thumbnails):
openscrape crawl https://example.com/article --output article.json --embed-images --embed-images-max-size 51200- Only images under the size limit (default 50KB) are embedded.
- Result includes
mediaEmbedded:[{ url, dataUrl, mimeType }]withdata:image/...;base64,...URLs.
Programmatic usage:
const data = await scraper.scrape({
url: 'https://example.com/article',
downloadMedia: true,
mediaOutputDir: './media',
base64EmbedImages: true,
base64EmbedMaxBytes: 51200,
});
// data.images → original URLs
// data.mediaDownloads → [{ url, localPath, mimeType }]
// data.mediaEmbedded → [{ url, dataUrl, mimeType }]Output formats
Besides json and markdown, OpenScrape can output:
| Format | Use case |
|----------|----------|
| html | Cleaned HTML (no scripts/nav); good for archiving or re-rendering. |
| text | Plain text only; good for search indexes or NLP. |
| csv | List/table-like pages: first <table> as rows; otherwise one row with url, title, author, content. |
| yaml | Full structured data (url, title, author, content, images, etc.) in YAML. |
Examples:
openscrape crawl https://example.com/article -o page.html --format html
openscrape crawl https://example.com/table -o data.csv --format csv
openscrape crawl https://example.com/article -o meta.yaml --format yamlProgrammatic usage: the scraper always returns full ScrapedData; use the formatters for string output:
import { OpenScrape, toHtml, toText, toCsv, toYaml } from 'openscrape';
const scraper = new OpenScrape();
const data = await scraper.scrape({ url: 'https://example.com/article' });
await scraper.close();
const htmlString = toHtml(data);
const textString = toText(data);
const csvString = toCsv(data);
const yamlString = toYaml(data);LLM-based extraction (Ollama / LM Studio)
You can send the cleaned HTML or Markdown to a local LLM and get structured JSON (title, author, publishDate, content, metadata). Useful when pages have irregular structure.
Requirements: A local endpoint such as Ollama or LM Studio.
CLI:
# Ollama (default endpoint http://localhost:11434)
openscrape crawl https://example.com/article -o out.json --llm-extract --llm-model llama2
# Custom Ollama or LM Studio endpoint
openscrape crawl https://example.com/article -o out.json --llm-extract \
--llm-endpoint http://localhost:1234/v1 --llm-model my-modelProgrammatic:
const data = await scraper.scrape({
url: 'https://example.com/article',
llmExtract: true,
llmEndpoint: 'http://localhost:11434', // Ollama
llmModel: 'llama2',
});
// data is merged with LLM-extracted fields; on error, data.metadata.llmError is set- Ollama: use base URL (e.g.
http://localhost:11434); the client calls/api/generate. - LM Studio: use the chat completions URL (e.g.
http://localhost:1234/v1); the client calls/v1/chat/completions.
Auto-detect schema (opt-in)
With autoDetectSchema: true, OpenScrape infers an extraction schema from the page (e.g. title from <title> or og:title, content from article or .content). Use it when you don’t have a custom schema.
CLI:
openscrape crawl https://example.com/article -o out.json --auto-detect-schemaProgrammatic:
const data = await scraper.scrape({
url: 'https://example.com/article',
autoDetectSchema: true,
});You can also use the schema detector directly:
import { detectSchemaFromHtml } from 'openscrape';
const { schema, confidence, suggestions } = detectSchemaFromHtml(htmlString);Custom Extraction Schema
const schema = {
title: '.article-title',
author: '.author-name',
publishDate: '.publish-date',
content: '.article-body',
custom: [
{
name: 'category',
selector: '.category',
},
{
name: 'views',
selector: '.views',
transform: (value: string) => parseInt(value, 10),
},
],
};
const data = await scraper.scrape({
url: 'https://example.com/article',
extractionSchema: schema,
});Rate Limiting
const scraper = new OpenScrape({
maxRequestsPerSecond: 5,
maxConcurrency: 3,
});Proxy support (rotating & residential)
Use a single proxy or a list for round-robin rotation. Supports auth (http://user:pass@host:port), SOCKS5 (socks5://host:port), and residential proxy lists.
- Constructor: set a default proxy for all scrapes (single URL or array).
- Per-scrape: override with
options.proxyfor that request. - Retries: on 403, 429, or timeout, the next proxy in the list is tried automatically.
Formats:
http://host:portorhttps://host:porthttp://user:pass@host:port(auth)socks5://host:portorsocks5://user:pass@host:port
CLI:
# Single proxy
openscrape crawl https://example.com --proxy http://user:[email protected]:8080 -o out.json
# Rotating list (comma-separated)
openscrape crawl https://example.com --proxy "http://p1:8080,http://p2:8080,socks5://p3:1080" -o out.json
# Batch with proxy list
openscrape batch urls.txt --proxy "http://user:[email protected]:8080" --output-dir ./outProgrammatic:
// Single proxy or rotating list at construction
const scraper = new OpenScrape({
proxy: 'http://user:[email protected]:8080',
maxConcurrency: 3,
});
// Or pass an array for rotation
const scraper = new OpenScrape({
proxy: ['http://p1:8080', 'socks5://p2:1080', 'http://user:pass@p3:8080'],
});
// Per-scrape override
const data = await scraper.scrape({
url: 'https://example.com',
proxy: 'socks5://localhost:1080',
});Low-level: use parseProxyString(), normalizeProxyInput(), and ProxyPool from the package for custom rotation logic.
CLI Commands
crawl <URL>
Scrape a single URL and save to file.
Options:
-o, --output <path>- Output file path (default:output.json)--no-render- Disable JavaScript rendering--format <format>- Output format:json,markdown,html,text,csv, oryaml(default:json)--wait-time <ms>- Wait time after page load (default:2000)--max-depth <number>- Maximum pagination depth (default:10)--next-selector <selector>- CSS selector for next link--timeout <ms>- Request timeout (default:30000)--user-agent <ua>- Custom user agent string--llm-extract- Use local LLM (Ollama/LM Studio) to extract structured data--llm-endpoint <url>- LLM endpoint (e.g.http://localhost:11434for Ollama)--llm-model <name>- Model name for LLM extraction (default:llama2)--auto-detect-schema- Auto-detect extraction schema from the page--proxy <url>- Proxy URL or comma-separated list for rotation (http://user:pass@host:port,socks5://host:port)
Example:
openscrape crawl https://example.com/article \
--output article.md \
--format markdown \
--max-depth 5batch <file>
Scrape multiple URLs from a file (one URL per line).
Options:
-o, --output-dir <path>- Output directory (default:./output)--no-render- Disable JavaScript rendering--format <format>- Output format:json,markdown,html,text,csv, oryaml(default:json)--wait-time <ms>- Wait time after page load (default:2000)--max-depth <number>- Maximum pagination depth (default:10)--timeout <ms>- Request timeout (default:30000)--max-concurrency <number>- Maximum concurrent requests (default:3)--llm-extract- Use local LLM to extract structured data per URL--llm-endpoint <url>- LLM endpoint URL--llm-model <name>- Model name (default:llama2)--auto-detect-schema- Auto-detect extraction schema from each page--proxy <url>- Proxy URL or comma-separated list for rotation
Example:
openscrape batch urls.txt \
--output-dir ./scraped \
--format markdown \
--max-concurrency 5serve
Start the REST API server.
Options:
-p, --port <number>- Port number (default:3000)--host <host>- Host address (default:0.0.0.0)
Example:
openscrape serve --port 8080REST API Endpoints
POST /crawl
Scrape a URL asynchronously.
Request:
{
"url": "https://example.com/article",
"options": {
"render": true,
"format": "json",
"maxDepth": 5
}
}Response:
{
"jobId": "uuid-here",
"status": "pending",
"url": "https://example.com/article"
}GET /status/:jobId
Get the status and result of a crawl job.
Response:
{
"id": "uuid-here",
"status": "completed",
"url": "https://example.com/article",
"createdAt": "2024-01-01T00:00:00.000Z",
"completedAt": "2024-01-01T00:00:05.000Z",
"result": {
"url": "https://example.com/article",
"title": "Article Title",
"content": "...",
"markdown": "...",
"timestamp": "2024-01-01T00:00:05.000Z"
}
}GET /jobs
List all crawl jobs.
GET /health
Health check endpoint.
GET /about
Credits and repository info. Returns: { name, version, by, repository } (e.g. by: John F. Gonzales, repository: https://github.com/RantsRoamer/OpenScrape).
Development
Prerequisites
- Node.js 18+
- npm or yarn
Setup
git clone https://github.com/yourusername/openscrape.git
cd openscrape
npm installBuild
npm run buildTest
npm testLint
npm run lintLicense
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Inspired by Firecrawl API
- Built with Playwright
- Uses Turndown for HTML to Markdown conversion
