web-scrapy
v1.0.0
Published
The command line web scraper
Readme
web-scrapy
A powerful command-line web scraper with schema support for structured data extraction. Extract data from HTML using CSS selectors or comprehensive JSON schemas with built-in type conversion and error handling.
Features
- 🔍 CSS Selector Extraction - Simple text/HTML extraction using CSS selectors
- 📋 Schema-Based Extraction - Complex data extraction using JSON schemas
- 🎯 Multiple Data Types - Support for text, attributes, HTML, numbers, booleans, and arrays
- 📄 Multiple Output Formats - JSON, pretty JSON, and plain text output
- 🔄 Batch Processing - Extract multiple records from lists and tables
- 🛡️ Error Handling - Robust error handling with detailed feedback
- ⚡ Fast & Lightweight - Uses node-html-parser for efficient HTML parsing
- 🎨 Google Style TypeScript - Clean, maintainable codebase
Installation
npm install -g web-scrapyOr use directly with npx:
npx web-scrapy --helpQuick Start
Simple CSS Selector Extraction
# Extract page title
curl https://example.com | web-scrapy -s "title" -t
# Extract all links
echo '<a href="/page1">Link 1</a><a href="/page2">Link 2</a>' | web-scrapy -s "a" --pretty
# Extract specific content
cat article.html | web-scrapy -s ".content p" -t --format textSchema-Based Extraction
# Inline schema for article extraction
curl https://news.site.com | web-scrapy --schema '{
"fields": {
"title": {"selector": "h1", "type": "text"},
"author": {"selector": ".author", "type": "text"},
"date": {"selector": "time", "type": "attribute", "attribute": "datetime"}
}
}' --pretty
# Use schema file for complex extraction
curl https://shop.com/product/123 | web-scrapy -f product-schema.json -o results.jsonUsage
web-scrapy - Advanced command line web scraper with schema support
Usage:
echo "<html>...</html>" | web-scrapy [options]
cat file.html | web-scrapy [options]
curl https://example.com | web-scrapy [options]
Input Options (choose one):
-s, --selector <selector> Simple CSS selector extraction
--schema <json> Inline JSON schema for complex extraction
-f, --schema-file <path> JSON schema file for complex extraction
Output Options:
-o, --output <path> Save output to file (default: stdout)
--format <format> Output format: json, pretty, text (default: json)
-p, --pretty Pretty-print JSON output
Extraction Options:
-m, --mode <mode> Extraction mode: single, multiple (default: single)
-c, --container <selector> Container selector for multiple mode
-t, --text Extract text content only (selector mode)
-l, --limit <number> Limit number of results (multiple mode)
--ignore-errors Continue extraction despite errors
Utility Options:
-h, --help Show this help message
-e, --examples Show example schemas and usage
-v, --version Show version informationSchema Format
Schemas are JSON objects that define how to extract structured data from HTML:
{
"name": "Schema name (optional)",
"description": "Schema description (optional)",
"fields": {
"fieldName": {
"selector": "CSS selector",
"type": "text|attribute|html|number|boolean|array",
"required": true|false,
"default": "default value"
}
},
"config": {
"ignoreErrors": true|false,
"limit": number
}
}Field Types
- text - Extract text content (supports
trimoption) - attribute - Extract attribute value (requires
attributeproperty) - html - Extract HTML content (supports
inneroption for innerHTML) - number - Parse as number (supports
integeroption) - boolean - Convert to true/false (supports
trueValueoption) - array - Extract multiple items (requires
itemSchemaproperty)
Examples
1. Article Extraction
Schema file: article-schema.json
{
"name": "News Article",
"fields": {
"title": {
"selector": "h1, .headline, .title",
"type": "text",
"trim": true
},
"author": {
"selector": ".author, [rel='author']",
"type": "text",
"required": false
},
"publishDate": {
"selector": "time",
"type": "attribute",
"attribute": "datetime"
},
"content": {
"selector": ".content p",
"type": "array",
"itemSchema": {
"selector": "",
"type": "text"
}
},
"tags": {
"selector": ".tag",
"type": "array",
"itemSchema": {
"selector": "",
"type": "text"
}
}
}
}Usage:
curl https://news.com/article | web-scrapy -f article-schema.json --pretty2. E-commerce Product
Inline schema:
echo '<div class="product">
<h1>Awesome Product</h1>
<span class="price">$29.99</span>
<span class="original-price">$39.99</span>
<div class="in-stock">In Stock</div>
</div>' | web-scrapy --schema '{
"fields": {
"name": {"selector": "h1", "type": "text"},
"price": {"selector": ".price", "type": "number"},
"originalPrice": {"selector": ".original-price", "type": "number"},
"inStock": {"selector": ".in-stock", "type": "boolean", "trueValue": "In Stock"}
}
}' --prettyOutput:
{
"data": {
"name": "Awesome Product",
"price": 29.99,
"originalPrice": 39.99,
"inStock": true
},
"errors": [],
"extractedAt": "2024-01-15T10:30:00.000Z",
"schema": "Unnamed schema"
}3. Multiple Records Extraction
Extract multiple products from a catalog page:
curl https://shop.com/catalog | web-scrapy --schema '{
"fields": {
"name": {"selector": "h3", "type": "text"},
"price": {"selector": ".price", "type": "number"},
"rating": {"selector": ".rating", "type": "number"}
}
}' -m multiple -c ".product-item" -l 10 --pretty4. Social Media Posts
cat social-feed.html | web-scrapy --schema '{
"fields": {
"username": {"selector": ".username", "type": "text"},
"content": {"selector": ".post-text", "type": "text"},
"likes": {"selector": ".likes", "type": "number", "default": 0},
"hashtags": {
"selector": ".hashtag",
"type": "array",
"itemSchema": {"selector": "", "type": "text"}
}
}
}' -m multiple -c ".post" --ignore-errors -o posts.jsonAdvanced Features
Error Handling
The scraper provides detailed error reporting:
# Ignore errors and continue extraction
web-scrapy -s ".missing-selector" --ignore-errors
# Get detailed error information in JSON output
web-scrapy -f schema.json --pretty # Errors included in outputCustom Default Values
{
"fields": {
"price": {
"selector": ".price",
"type": "number",
"default": 0,
"required": false
},
"description": {
"selector": ".desc",
"type": "text",
"default": "No description available"
}
}
}Complex Nested Arrays
{
"fields": {
"specifications": {
"selector": ".spec-row",
"type": "array",
"itemSchema": {
"selector": "",
"type": "object",
"fields": {
"name": {"selector": ".spec-name", "type": "text"},
"value": {"selector": ".spec-value", "type": "text"}
}
}
}
}
}Output Formats
JSON (default)
web-scrapy -s "title" --format json
# {"content": "Page Title"}Pretty JSON
web-scrapy -s "title" --format pretty
# {
# "content": "Page Title"
# }Plain Text
web-scrapy -s "title" --format text
# Page TitleIntegration Examples
With jq for JSON processing
# Extract and filter data
curl https://api.example.com | web-scrapy -f schema.json | jq '.data.title'
# Count extracted items
curl https://news.com | web-scrapy -f news.json -m multiple -c "article" | jq '.data | length'With shell scripts
#!/usr/bin/env bash
# Monitor product prices
curl -s "https://shop.com/product/123" | \
web-scrapy --schema '{"fields":{"price":{"selector":".price","type":"number"}}}' | \
jq -r '.data.price' > current-price.txtWith Node.js
import { spawn } from 'child_process';
import { readFileSync } from 'fs';
const html = readFileSync('page.html', 'utf8');
const schema = JSON.stringify({
fields: {
title: { selector: 'h1', type: 'text' },
price: { selector: '.price', type: 'number' }
}
});
const scraper = spawn('web-scrapy', ['--schema', schema, '--format', 'json']);
scraper.stdin.write(html);
scraper.stdin.end();
scraper.stdout.on('data', (data) => {
const result = JSON.parse(data.toString());
console.log('Extracted:', result.data);
});Error Codes
0- Success1- Argument parsing error2- Input/output error3- Schema validation error4- Extraction error
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Related Projects
- node-html-parser - Fast HTML parser
- cheerio - jQuery-like server-side HTML manipulation
- puppeteer - Headless browser automation
For more examples and detailed documentation, run:
web-scrapy --examples