easy-scrape
v1.0.4
Published
Scrape a website more easier than before
Readme
Easy Scrape
A powerful and flexible HTML scraping library built on top of Cheerio. Easy Scrape provides a declarative way to extract data from HTML with support for nested structures, transformations, advanced filtering, and much more.
Features
✨ Simple & Declarative - Define your scraping schema in plain JavaScript objects
🎯 Flexible Selectors - Use CSS selectors to target any element
🔄 Data Transformation - Built-in conversion and transformation pipeline
📋 List Handling - Easy extraction of arrays and nested lists
🎨 Multiple Extraction Modes - Text, HTML, attributes, or custom functions
🔍 Advanced Filtering - Filter elements before extraction
🧭 Navigation - Parent, siblings, and ancestor traversal
🌐 URL Resolution - Automatically resolve relative URLs
📊 Table Parsing - Built-in support for HTML tables
✅ Validation - Validate extracted data with custom functions
🎭 Conditional Extraction - Extract based on conditions
🛠️ Helper Functions - Common transformations included (toNumber, toDate, etc.)
📦 Presets - Ready-to-use patterns for common tasks
🛡️ Error Handling - Strict mode for validation or graceful fallbacks
📘 TypeScript Support - Full TypeScript definitions included
Installation
npm install easy-scrapeQuick Start
import { easyScrape } from 'easy-scrape';
const html = `
<div class="product">
<h2>Laptop</h2>
<span class="price">$999</span>
</div>
`;
const result = easyScrape(html, {
title: 'h2',
price: '.price'
});
console.log(result);
// { title: 'Laptop', price: '$999' }Table of Contents
API Reference
easyScrape(input, schema, options?)
Parameters:
input- HTML string or Cheerio instanceschema- Scraping schema defining what to extractoptions(optional) - Parsing options
Returns: Object with extracted data
Parsing Options
baseUrl (string)
Base URL for resolving relative URLs when resolveUrl is used.
const result = easyScrape(html, schema, {
baseUrl: 'https://example.com'
});xmlMode (boolean)
Parse as XML instead of HTML.
const xml = `<?xml version="1.0"?><root><item>Value</item></root>`;
const result = easyScrape(xml, {
value: 'item'
}, {
xmlMode: true
});
// { value: 'Value' }decodeEntities (boolean)
Decode HTML entities. Default: true.
cheerioOptions (object)
Additional Cheerio load options.
Schema Options
Basic Options
selector (string)
CSS selector to find the element(s).
const result = easyScrape(html, {
title: {
selector: '.title'
}
});Or use shorthand:
const result = easyScrape(html, {
title: '.title' // String shorthand
});attr (string)
Extract a specific attribute value.
const html = `<a href="https://example.com" class="link">Click</a>`;
const result = easyScrape(html, {
url: {
selector: '.link',
attr: 'href'
}
});
// { url: 'https://example.com' }attrs (string[])
Extract multiple attributes as an object.
const html = `<a href="/page" class="nav-link" title="Go to page">Link</a>`;
const result = easyScrape(html, {
linkData: {
selector: '.nav-link',
attrs: ['href', 'class', 'title']
}
});
// { linkData: { href: '/page', class: 'nav-link', title: 'Go to page' } }html (boolean)
Extract inner HTML instead of text.
const html = `<div class="box"><strong>Bold</strong> text</div>`;
const result = easyScrape(html, {
content: {
selector: '.box',
html: true
}
});
// { content: '<strong>Bold</strong> text' }outerHtml (boolean)
Extract outer HTML including the element itself.
const html = `<div class="container"><p>Text</p></div>`;
const result = easyScrape(html, {
fullHtml: {
selector: 'p',
outerHtml: true
}
});
// { fullHtml: '<p>Text</p>' }textMode (string)
Control how text is extracted. Options: 'text' (default), 'ownText', 'deepText'.
const html = `<div class="wrapper">Direct text<span>Nested text</span>More direct</div>`;
const result = easyScrape(html, {
allText: {
selector: '.wrapper',
textMode: 'text' // All text including descendants
},
ownText: {
selector: '.wrapper',
textMode: 'ownText' // Only direct text nodes
}
});
// { allText: 'Direct textNested textMore direct', ownText: 'Direct textMore direct' }separator (string)
Join multiple text nodes with a separator.
const html = `<div class="list">Apple<br>Banana<br>Cherry</div>`;
const result = easyScrape(html, {
joined: {
selector: '.list',
separator: ', '
}
});
// { joined: 'Apple, Banana, Cherry' }trimValue (boolean)
Whether to trim whitespace from extracted values. Default: true.
resolveUrl (boolean)
Resolve relative URLs to absolute using baseUrl option.
const html = `<a href="/about">About</a>`;
const result = easyScrape(html, {
url: {
selector: 'a',
attr: 'href',
resolveUrl: true
}
}, {
baseUrl: 'https://example.com'
});
// { url: 'https://example.com/about' }Data Transformation
convert (function)
Transform the extracted value.
const html = `<span class="price">$99.99</span>`;
const result = easyScrape(html, {
price: {
selector: '.price',
convert: (value) => parseFloat(value.replace('$', ''))
}
});
// { price: 99.99 }transform (function | function[])
Apply transformation pipeline after conversion.
const html = `<span class="amount"> 100 </span>`;
const result = easyScrape(html, {
amount: {
selector: '.amount',
transform: [
(val) => val.trim(),
(val) => parseInt(val),
(val) => val * 2
]
}
});
// { amount: 200 }how (string | function)
Custom extraction method.
const html = `<div class="item" data-id="123">Item</div>`;
const result = easyScrape(html, {
itemId: {
selector: '.item',
how: ($el) => $el.attr('data-id')
}
});
// { itemId: '123' }Element Selection & Navigation
eq (number)
Select a specific element by index (0-based).
const html = `
<ul>
<li>First</li>
<li>Second</li>
<li>Third</li>
</ul>
`;
const result = easyScrape(html, {
secondItem: {
selector: 'li',
eq: 1
}
});
// { secondItem: 'Second' }texteq (number)
Select a specific text node by index.
const html = `<div>Text1<span>Span</span>Text2</div>`;
const result = easyScrape(html, {
firstText: { selector: 'div', texteq: 0 },
secondText: { selector: 'div', texteq: 1 }
});
// { firstText: 'Text1', secondText: 'Text2' }closest (string)
Find the closest ancestor matching the selector.
const html = `
<div class="container">
<div class="item">
<span class="text">Click</span>
</div>
</div>
`;
const result = easyScrape(html, {
containerClass: {
selector: '.text',
closest: '.container',
how: ($el) => $el.attr('class')
}
});
// { containerClass: 'container' }parent (number | string)
Navigate to parent element(s).
- Number: Move up N levels
- String: Find parent matching selector
const html = `
<div class="grandparent">
<div class="parent">
<span class="child">Text</span>
</div>
</div>
`;
const result = easyScrape(html, {
parentText: {
selector: '.child',
parent: 1 // Go up 1 level
},
grandparentText: {
selector: '.child',
parent: 2 // Go up 2 levels
}
});parents (string)
Find ancestor element matching selector.
const result = easyScrape(html, {
outerDiv: {
selector: '.child',
parents: '.grandparent'
}
});siblings (string)
Navigate to sibling elements. Options: 'next', 'prev', 'nextAll', 'prevAll'.
const html = `
<div>
<span class="first">First</span>
<span class="target">Target</span>
<span class="last">Last</span>
</div>
`;
const result = easyScrape(html, {
nextSibling: {
selector: '.target',
siblings: 'next'
},
prevSibling: {
selector: '.target',
siblings: 'prev'
}
});
// { nextSibling: 'Last', prevSibling: 'First' }siblingSelector (string)
Filter siblings by selector.
const result = easyScrape(html, {
nextItems: {
selector: '.marker',
siblings: 'nextAll',
siblingSelector: '.item',
multiple: true
}
});Lists and Arrays
listItem (string)
Extract an array of items with nested data.
const html = `
<ul>
<li class="item">
<span class="name">Item 1</span>
<span class="value">10</span>
</li>
<li class="item">
<span class="name">Item 2</span>
<span class="value">20</span>
</li>
</ul>
`;
const result = easyScrape(html, {
items: {
listItem: '.item',
data: {
name: '.name',
value: {
selector: '.value',
convert: (v) => parseInt(v)
}
}
}
});
// { items: [{ name: 'Item 1', value: 10 }, { name: 'Item 2', value: 20 }] }multiple (boolean)
Extract all matching elements as an array.
const html = `
<span class="tag">JS</span>
<span class="tag">CSS</span>
<span class="tag">HTML</span>
`;
const result = easyScrape(html, {
tags: {
selector: '.tag',
multiple: true
}
});
// { tags: ['JS', 'CSS', 'HTML'] }includeIndex (boolean)
Add _index property to list items.
const result = easyScrape(html, {
fruits: {
listItem: 'li',
includeIndex: true
}
});
// { fruits: [{ text: 'Apple', _index: 0 }, { text: 'Banana', _index: 1 }, ...] }Advanced Features
map (function)
Map over elements with custom transformation.
const html = `
<div class="product">Product 1</div>
<div class="product">Product 2</div>
<div class="product">Product 3</div>
`;
const result = easyScrape(html, {
products: {
selector: '.product',
map: ($el, $, index) => ({
id: index + 1,
name: $el.text(),
upper: $el.text().toUpperCase()
})
}
});
// { products: [{ id: 1, name: 'Product 1', upper: 'PRODUCT 1' }, ...] }Filtering within map: Return null or undefined to exclude items.
const result = easyScrape(html, {
expensive: {
selector: '.item',
map: ($el) => {
const price = parseInt($el.attr('data-price'));
if (price < 50) return null; // Filter out
return { name: $el.text(), price };
}
}
});filter (function)
Filter elements before extraction.
const html = `
<div class="item active">Active 1</div>
<div class="item">Inactive</div>
<div class="item active">Active 2</div>
`;
const result = easyScrape(html, {
activeItems: {
selector: '.item',
filter: ($el) => $el.hasClass('active'),
multiple: true
}
});
// { activeItems: ['Active 1', 'Active 2'] }regex (RegExp) & regexGroup (number)
Extract data using regular expressions.
const html = `<div class="price">Price: $99.99 USD</div>`;
const result = easyScrape(html, {
amount: {
selector: '.price',
regex: /\$(\d+\.\d+)/,
regexGroup: 1, // Capture group (default: 0)
convert: (val) => parseFloat(val)
}
});
// { amount: 99.99 }Conditional Extraction
if (function)
Only extract if condition function returns true.
const html = `<div class="product" data-available="true"><span class="price">$50</span></div>`;
const result = easyScrape(html, {
price: {
selector: '.price',
if: ($) => $('.product').attr('data-available') === 'true'
}
});
// { price: '$50' }ifExists (string)
Only extract if selector exists in context.
const html = `
<div class="container">
<span class="badge">New</span>
<span class="price">$99</span>
</div>
`;
const result = easyScrape(html, {
price: {
selector: '.price',
ifExists: '.badge' // Only extract if badge exists
}
});
// { price: '$99' }ifNotExists (string)
Only extract if selector does NOT exist.
const result = easyScrape(html, {
regularPrice: {
selector: '.regular-price',
ifNotExists: '.sale-price'
}
});Array Operations
Array operations are applied in this order: unique → slice → limit → flatten
unique (boolean)
Remove duplicate values from array.
const html = `
<div class="tag">JavaScript</div>
<div class="tag">Python</div>
<div class="tag">JavaScript</div>
`;
const result = easyScrape(html, {
uniqueTags: {
selector: '.tag',
multiple: true,
unique: true
}
});
// { uniqueTags: ['JavaScript', 'Python'] }slice (array)
Array slice operation [start, end].
const result = easyScrape(html, {
middleItems: {
selector: '.item',
multiple: true,
slice: [1, 4] // Get items at index 1, 2, 3
}
});limit (number)
Limit number of items in array.
const result = easyScrape(html, {
topItems: {
selector: '.item',
multiple: true,
limit: 3 // Only get first 3 items
}
});flatten (boolean | number)
Flatten nested arrays.
const result = easyScrape(html, {
allTags: {
listItem: '.category',
data: {
tags: {
selector: '.tag',
multiple: true
}
},
flatten: true // Flatten one level
}
});Combining array operations:
const result = easyScrape(html, {
topUnique: {
selector: '.item',
multiple: true,
unique: true, // Remove duplicates first
limit: 3 // Then take first 3
}
});Validation
validate (function)
Validate extracted value with custom function.
const html = `<div class="email">[email protected]</div>`;
const result = easyScrape(html, {
email: {
selector: '.email',
validate: (value) => value.includes('@')
}
});
// { email: '[email protected]' }required (boolean)
Field is required - throws error if missing or empty.
try {
const result = easyScrape(html, {
title: {
selector: '.missing-title',
required: true // Will throw error
}
});
} catch (error) {
console.error('Required field missing:', error.message);
}Table Parsing
Extract data from HTML tables.
const html = `
<table class="data-table">
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
<tr>
<td>John</td>
<td>30</td>
<td>NYC</td>
</tr>
<tr>
<td>Jane</td>
<td>25</td>
<td>LA</td>
</tr>
</table>
`;
const result = easyScrape(html, {
users: {
selector: '.data-table',
table: {
headers: true,
selector: 'tr'
}
}
});
/*
{
users: [
{ "Name": "John", "Age": "30", "City": "NYC" },
{ "Name": "Jane", "Age": "25", "City": "LA" }
]
}
*/Without headers:
const result = easyScrape(html, {
data: {
selector: 'table',
table: {
headers: false
}
}
});
// { data: [['Item 1', 'Value 1'], ['Item 2', 'Value 2']] }Custom table conversion:
const result = easyScrape(html, {
specs: {
selector: '.spec-table',
table: {
headers: false
},
convert: (rows) => {
const specs = {};
rows.forEach(row => {
if (row.length >= 2) {
specs[row[0]] = row[1];
}
});
return specs;
}
}
});
// { specs: { "CPU": "Intel i7", "RAM": "16GB" } }Error Handling
default (any)
Default value when element is not found.
const result = easyScrape(html, {
missing: {
selector: '.not-exist',
default: 'Not Found'
}
});strict (boolean)
Throw errors instead of returning null. Default: false.
try {
const result = easyScrape(html, {
required: {
selector: '.not-exist',
strict: true // Will throw error
}
});
} catch (error) {
console.error('Missing required field:', error.message);
}Nested Data
Extract nested objects using the data property.
const html = `
<div class="card">
<h2 class="title">Product</h2>
<div class="meta">
<span class="price">$50</span>
<span class="stock">In Stock</span>
</div>
</div>
`;
const result = easyScrape(html, {
product: {
selector: '.card',
data: {
title: '.title',
price: '.price',
stock: '.stock'
}
}
});
// { product: { title: 'Product', price: '$50', stock: 'In Stock' } }Helper Functions
Easy Scrape includes common transformation helpers:
import { easyScrape, helpers } from 'easy-scrape';
const result = easyScrape(html, {
price: {
selector: '.price',
convert: helpers.toNumber // Parse number from "$1,234.56"
},
isAvailable: {
selector: '.status',
convert: helpers.toBoolean // Convert "yes" to true
},
publishDate: {
selector: 'time',
attr: 'datetime',
convert: helpers.toDate // Convert to Date object
}
});Available Helpers
helpers.toNumber(val)- Parse number from string (removes non-numeric chars)helpers.toInt(val)- Parse integer from stringhelpers.toBoolean(val)- Convert to boolean (accepts: true, yes, 1, on)helpers.toDate(val)- Convert to Date objecthelpers.extractUrl(val)- Extract first URL from texthelpers.extractEmail(val)- Extract first email from texthelpers.stripHtml(html)- Remove HTML tagshelpers.parseJson(val)- Parse JSON stringhelpers.capitalize(val)- Capitalize first letterhelpers.slug(val)- Convert to URL-friendly slug
Example:
const html = `
<div class="data">
<span class="price">$1,234.56</span>
<span class="status">yes</span>
<span class="email">Contact: [email protected]</span>
</div>
`;
const result = easyScrape(html, {
price: { selector: '.price', convert: helpers.toNumber },
isActive: { selector: '.status', convert: helpers.toBoolean },
email: { selector: '.email', convert: helpers.extractEmail }
});
// { price: 1234.56, isActive: true, email: '[email protected]' }Presets
Ready-to-use patterns for common scraping tasks:
import { easyScrape, presets } from 'easy-scrape';
const result = easyScrape(html, {
logo: presets.image('.logo'),
aboutLink: presets.link('.nav a'),
description: presets.meta('description'),
ogImage: presets.ogMeta('image'),
structuredData: presets.jsonLd()
});Available Presets
presets.link(selector)- Extract href from linkpresets.image(selector)- Extract src and alt from imagepresets.meta(name)- Extract meta tag content by namepresets.ogMeta(property)- Extract Open Graph meta tagpresets.twitterMeta(name)- Extract Twitter Card meta tagpresets.jsonLd(selector)- Extract and parse JSON-LD structured data
Example:
const html = `
<head>
<meta name="description" content="Page description">
<meta property="og:title" content="Page Title">
<script type="application/ld+json">
{"@type": "Product", "name": "Widget"}
</script>
</head>
<body>
<img src="/logo.png" alt="Company Logo">
<a href="/about">About</a>
</body>
`;
const result = easyScrape(html, {
logo: presets.image('img'),
aboutLink: presets.link('a'),
description: presets.meta('description'),
ogTitle: presets.ogMeta('title'),
productData: presets.jsonLd()
});
/*
{
logo: { src: '/logo.png', alt: 'Company Logo' },
aboutLink: '/about',
description: 'Page description',
ogTitle: 'Page Title',
productData: { '@type': 'Product', name: 'Widget' }
}
*/Complex Examples
E-commerce Product Scraping
const result = easyScrape(html, {
products: {
listItem: '.product',
data: {
id: {
selector: '',
how: ($el) => $el.attr('data-id'),
convert: helpers.toInt
},
title: '.title',
price: {
selector: '.price',
convert: helpers.toNumber
},
originalPrice: {
selector: '.original-price',
convert: helpers.toNumber,
default: null
},
discount: {
selector: '.discount',
regex: /(\d+)%/,
regexGroup: 1,
convert: helpers.toInt,
ifExists: '.discount'
},
rating: {
selector: '.rating',
attr: 'data-rating',
convert: helpers.toNumber
},
inStock: {
selector: '.stock-status',
convert: helpers.toBoolean
},
images: {
selector: '.gallery img',
multiple: true,
attr: 'src',
resolveUrl: true
},
features: {
selector: '.features li',
multiple: true
}
}
}
}, {
baseUrl: 'https://example.com'
});Blog Article Extraction
const result = easyScrape(html, {
article: {
selector: 'article',
data: {
title: 'h1',
author: {
selector: '.author',
regex: /By (.+)/,
regexGroup: 1
},
publishDate: {
selector: 'time',
attr: 'datetime',
convert: helpers.toDate
},
readTime: {
selector: '.read-time',
regex: /(\d+)/,
regexGroup: 1,
convert: helpers.toInt
},
tags: {
selector: '.tag',
multiple: true
},
content: {
selector: '.article-body',
html: true
},
headings: {
selector: 'h2, h3',
multiple: true
},
relatedPosts: {
selector: '.related a',
map: ($el) => ({
title: $el.text(),
url: $el.attr('href')
}),
limit: 5
}
}
}
});Complete E-commerce Page
const result = easyScrape(html, {
// Meta data
meta: {
selector: 'head',
data: {
description: presets.meta('description'),
ogImage: {
...presets.ogMeta('image'),
resolveUrl: true
},
structuredData: presets.jsonLd()
}
},
// Navigation
breadcrumb: {
selector: '.breadcrumb a',
map: ($el) => ({
text: $el.text(),
url: $el.attr('href')
})
},
// Product details
title: '.product-title',
images: {
selector: '.gallery img',
multiple: true,
attr: 'src',
resolveUrl: true
},
pricing: {
selector: '.pricing',
data: {
current: {
selector: '.sale-price',
convert: helpers.toNumber,
required: true
},
original: {
selector: '.original-price',
convert: helpers.toNumber,
ifExists: '.sale-price'
},
savings: {
selector: '.discount',
regex: /\$(\d+)/,
regexGroup: 1,
convert: helpers.toNumber
}
}
},
availability: {
selector: '.stock-info',
data: {
inStock: {
selector: '',
how: ($el) => $el.attr('data-available'),
convert: helpers.toBoolean
},
quantity: {
selector: '.quantity',
regex: /(\d+)/,
regexGroup: 1,
convert: helpers.toInt
}
}
},
specifications: {
selector: '.spec-table',
table: {
headers: false
},
convert: (rows) => {
const specs = {};
rows.forEach(([key, value]) => {
specs[key] = value;
});
return specs;
}
},
reviews: {
listItem: '.review',
data: {
author: '.author',
rating: {
selector: '.rating',
attr: 'data-rating',
convert: helpers.toInt
},
comment: '.comment',
date: {
selector: 'time',
attr: 'datetime',
convert: helpers.toDate
},
helpful: {
selector: '.helpful-count',
regex: /(\d+)/,
regexGroup: 1,
convert: helpers.toInt,
default: 0
}
},
// Only show reviews with 4+ stars
filter: ($el) => parseInt($el.find('.rating').attr('data-rating')) >= 4,
limit: 10
}
}, {
baseUrl: 'https://shop.example.com'
});Social Media Profile Scraping
const result = easyScrape(html, {
profile: {
selector: '.profile-card',
data: {
name: '.profile-name',
username: {
selector: '.username',
regex: /@(.+)/,
regexGroup: 1
},
bio: '.bio',
avatar: {
selector: '.avatar',
attr: 'src',
resolveUrl: true
},
stats: {
selector: '.stats',
data: {
followers: {
selector: '.followers-count',
convert: helpers.toNumber
},
following: {
selector: '.following-count',
convert: helpers.toNumber
},
posts: {
selector: '.posts-count',
convert: helpers.toNumber
}
}
},
verified: {
selector: '.verified-badge',
how: ($el) => $el.length > 0,
default: false
},
links: {
selector: '.profile-links a',
map: ($el) => ({
text: $el.text(),
url: $el.attr('href')
})
}
}
},
posts: {
listItem: '.post',
data: {
id: {
selector: '',
how: ($el) => $el.attr('data-post-id')
},
content: '.post-content',
timestamp: {
selector: 'time',
attr: 'datetime',
convert: helpers.toDate
},
likes: {
selector: '.likes-count',
convert: helpers.toNumber
},
comments: {
selector: '.comments-count',
convert: helpers.toNumber
},
media: {
selector: '.post-media img',
multiple: true,
attr: 'src'
},
hashtags: {
selector: '.hashtag',
multiple: true,
unique: true
}
},
limit: 20
}
});News Article Aggregation
const result = easyScrape(html, {
articles: {
listItem: 'article.news-item',
data: {
headline: 'h2',
summary: '.summary',
category: {
selector: '.category',
convert: (val) => val.trim().toUpperCase()
},
author: {
selector: '.author',
ifExists: '.author'
},
publishDate: {
selector: 'time',
attr: 'datetime',
convert: helpers.toDate
},
url: {
selector: 'a',
attr: 'href',
resolveUrl: true
},
thumbnail: {
selector: 'img',
attr: 'src',
resolveUrl: true
},
readTime: {
selector: '.read-time',
regex: /(\d+)/,
regexGroup: 1,
convert: helpers.toInt,
default: null
},
isPremium: {
selector: '.premium-badge',
how: ($el) => $el.length > 0,
default: false
}
},
// Filter out old articles (older than 7 days)
filter: ($el) => {
const dateStr = $el.find('time').attr('datetime');
const date = new Date(dateStr);
const daysDiff = (Date.now() - date.getTime()) / (1000 * 60 * 60 * 24);
return daysDiff <= 7;
}
}
});Restaurant Menu Scraping
const result = easyScrape(html, {
restaurant: {
selector: '.restaurant-info',
data: {
name: 'h1',
cuisine: '.cuisine-type',
rating: {
selector: '.rating',
attr: 'data-rating',
convert: helpers.toNumber
},
priceRange: '.price-range',
address: '.address',
phone: {
selector: '.phone',
convert: (val) => val.replace(/\D/g, '')
}
}
},
menu: {
listItem: '.menu-category',
data: {
category: '.category-name',
items: {
listItem: '.menu-item',
data: {
name: '.item-name',
description: '.item-description',
price: {
selector: '.price',
convert: helpers.toNumber
},
calories: {
selector: '.calories',
regex: /(\d+)/,
regexGroup: 1,
convert: helpers.toInt,
default: null
},
isVegetarian: {
selector: '.veg-icon',
how: ($el) => $el.length > 0,
default: false
},
isSpicy: {
selector: '.spicy-icon',
how: ($el) => $el.length > 0,
default: false
},
allergens: {
selector: '.allergen',
multiple: true,
default: []
}
}
}
}
}
});Use Cases
Scraping E-commerce Sites
const productData = easyScrape(html, {
products: {
listItem: '.product-card',
data: {
name: '.product-name',
price: {
selector: '.price',
convert: helpers.toNumber
},
rating: {
selector: '.rating',
attr: 'data-rating',
convert: helpers.toNumber
},
inStock: {
selector: '.stock-status',
convert: helpers.toBoolean
}
}
}
});Extracting Article Metadata
const article = easyScrape(html, {
title: 'h1',
author: '.author-name',
publishDate: {
selector: 'time',
attr: 'datetime',
convert: helpers.toDate
},
tags: {
selector: '.tag',
multiple: true
},
content: {
selector: '.article-body',
html: true
}
});Job Listings Scraper
const jobs = easyScrape(html, {
listings: {
listItem: '.job-listing',
data: {
title: '.job-title',
company: '.company-name',
location: '.location',
salary: {
selector: '.salary',
regex: /\$([\d,]+)\s*-\s*\$([\d,]+)/,
convert: (val) => {
const match = val.match(/\$([\d,]+)\s*-\s*\$([\d,]+)/);
if (match) {
return {
min: parseInt(match[1].replace(/,/g, '')),
max: parseInt(match[2].replace(/,/g, ''))
};
}
return null;
}
},
type: '.job-type',
remote: {
selector: '.remote-badge',
how: ($el) => $el.length > 0,
default: false
},
postedDate: {
selector: '.posted-date',
attr: 'data-date',
convert: helpers.toDate
},
description: '.job-description',
requirements: {
selector: '.requirements li',
multiple: true
},
benefits: {
selector: '.benefits li',
multiple: true
}
}
}
});Real Estate Listings
const properties = easyScrape(html, {
listings: {
listItem: '.property-card',
data: {
address: '.address',
price: {
selector: '.price',
convert: helpers.toNumber
},
bedrooms: {
selector: '.bedrooms',
regex: /(\d+)/,
regexGroup: 1,
convert: helpers.toInt
},
bathrooms: {
selector: '.bathrooms',
regex: /(\d+\.?\d*)/,
regexGroup: 1,
convert: helpers.toNumber
},
sqft: {
selector: '.sqft',
convert: helpers.toNumber
},
images: {
selector: '.gallery img',
multiple: true,
attr: 'src',
limit: 10
},
features: {
selector: '.features li',
multiple: true
},
description: '.description',
listingUrl: {
selector: 'a.property-link',
attr: 'href',
resolveUrl: true
}
}
}
}, {
baseUrl: 'https://realestate.example.com'
});TypeScript Support
Easy Scrape includes full TypeScript definitions:
import { easyScrape, ScrapeSchema, ScrapeOptions, helpers } from 'easy-scrape';
interface Product {
title: string;
price: number;
inStock: boolean;
}
const schema: ScrapeSchema = {
products: {
listItem: '.product',
data: {
title: '.title',
price: {
selector: '.price',
convert: helpers.toNumber
},
inStock: {
selector: '.stock',
convert: helpers.toBoolean
}
}
}
};
const result = easyScrape(html, schema);
const products: Product[] = result.products;Type-safe Options
import { ScrapeOptions, HowFunction, ConvertFunction } from 'easy-scrape';
const customHow: HowFunction = ($el) => $el.attr('data-id');
const customConvert: ConvertFunction = (val) => parseInt(val);
const options: ScrapeOptions = {
selector: '.item',
how: customHow,
convert: customConvert,
multiple: true,
filter: ($el) => $el.hasClass('active'),
validate: (val) => val > 0,
required: true
};Error Handling Best Practices
1. Use Default Values for Optional Fields
const result = easyScrape(html, {
optionalField: {
selector: '.optional',
default: 'N/A'
},
optionalNumber: {
selector: '.number',
convert: helpers.toNumber,
default: 0
}
});2. Use Strict Mode for Required Fields
const result = easyScrape(html, {
requiredField: {
selector: '.required',
strict: true // Throws if missing
}
});3. Use Required Flag with Validation
const result = easyScrape(html, {
email: {
selector: '.email',
required: true,
validate: (val) => val.includes('@')
}
});4. Wrap in Try-Catch for Production
try {
const result = easyScrape(html, schema, { baseUrl });
// Process result
} catch (error) {
console.error('Scraping failed:', error.message);
// Handle error (log, retry, fallback)
}5. Use Conditional Extraction
const result = easyScrape(html, {
salePrice: {
selector: '.sale-price',
ifExists: '.on-sale', // Only extract if on sale
convert: helpers.toNumber
},
regularPrice: {
selector: '.regular-price',
ifNotExists: '.sale-price', // Only if no sale
convert: helpers.toNumber
}
});Performance Tips
1. Use Specific Selectors
More specific selectors are faster:
// Slower
const result = easyScrape(html, { title: 'div span' });
// Faster
const result = easyScrape(html, { title: '.product-title' });2. Avoid Deep Nesting
Flatten your data structure when possible:
// Less efficient
const result = easyScrape(html, {
level1: {
selector: '.level1',
data: {
level2: {
selector: '.level2',
data: {
level3: '.level3'
}
}
}
}
});
// More efficient
const result = easyScrape(html, {
level3: '.level1 .level2 .level3'
});3. Use multiple Instead of map for Simple Text
// Less efficient (creates function overhead)
const result = easyScrape(html, {
tags: {
selector: '.tag',
map: ($el) => $el.text()
}
});
// More efficient
const result = easyScrape(html, {
tags: {
selector: '.tag',
multiple: true
}
});4. Cache Cheerio Instances
Reuse parsed HTML for multiple extractions:
import * as cheerio from 'cheerio';
const $ = cheerio.load(html);
const result1 = easyScrape($, schema1);
const result2 = easyScrape($, schema2);5. Use Array Operations Wisely
Apply filters early to reduce processing:
// Process less data
const result = easyScrape(html, {
items: {
selector: '.item',
filter: ($el) => $el.hasClass('active'),
multiple: true,
limit: 10
}
});Migration Guide
If you're migrating from other scraping libraries:
From Cheerio Direct Usage
Before:
const $ = cheerio.load(html);
const products = [];
$('.product').each((i, el) => {
products.push({
title: $(el).find('.title').text(),
price: parseFloat($(el).find('.price').text().replace('$', ''))
});
});After:
const result = easyScrape(html, {
products: {
listItem: '.product',
data: {
title: '.title',
price: {
selector: '.price',
convert: helpers.toNumber
}
}
}
});From scrape-it
Easy Scrape is inspired by scrape-it and offers similar API with enhanced features:
// scrape-it style (still works!)
const result = easyScrape(html, {
title: '.title',
price: {
selector: '.price',
convert: x => parseFloat(x)
}
});
// Enhanced with new features
const result = easyScrape(html, {
title: '.title',
price: {
selector: '.price',
convert: helpers.toNumber,
required: true,
validate: (val) => val > 0
}
});Common Patterns
Extract Links with Text
const links = easyScrape(html, {
navigation: {
selector: 'nav a',
map: ($el) => ({
text: $el.text(),
href: $el.attr('href')
})
}
});Extract Meta Tags
const meta = easyScrape(html, {
title: 'title',
description: presets.meta('description'),
keywords: {
...presets.meta('keywords'),
convert: (val) => val.split(',').map(k => k.trim())
},
ogTitle: presets.ogMeta('title'),
ogImage: presets.ogMeta('image')
});Extract Breadcrumbs
const breadcrumbs = easyScrape(html, {
trail: {
selector: '.breadcrumb a',
map: ($el, $, index) => ({
position: index + 1,
name: $el.text(),
url: $el.attr('href')
})
}
});Extract Pagination Info
const pagination = easyScrape(html, {
currentPage: {
selector: '.pagination .active',
convert: helpers.toInt
},
totalPages: {
selector: '.pagination a:last',
convert: helpers.toInt
},
nextPage: {
selector: '.pagination .next',
attr: 'href',
default: null
}
});Debugging Tips
1. Test Selectors Separately
// Test each selector individually
const test = easyScrape(html, {
test1: '.selector1',
test2: '.selector2'
});
console.log(test);2. Use Default Values During Development
const result = easyScrape(html, {
field: {
selector: '.test',
default: 'DEBUG: Not found' // Makes missing fields obvious
}
});3. Log Transform Steps
const result = easyScrape(html, {
price: {
selector: '.price',
transform: [
(val) => { console.log('1:', val); return val; },
(val) => val.replace('$', ''),
(val) => { console.log('2:', val); return val; },
(val) => parseFloat(val)
]
}
});4. Inspect Extracted HTML
const result = easyScrape(html, {
debug: {
selector: '.target',
outerHtml: true // See the actual HTML
}
});
console.log(result.debug);FAQ
Q: How do I extract data from a specific element without a selector?
Use an empty selector with context:
{
id: {
selector: '', // Use context element
how: ($el) => $el.attr('data-id')
}
}Q: Can I use custom Cheerio methods?
Yes, via the how function:
{
custom: {
selector: '.item',
how: ($el) => $el.prev().text() // Any Cheerio method
}
}Q: How do I handle missing nested elements?
Use default or ifExists:
{
optional: {
selector: '.nested .deep',
default: null,
ifExists: '.nested'
}
}Q: Can I extract data from multiple pages?
Yes, fetch and scrape each page:
const results = [];
for (const url of urls) {
const html = await fetch(url).then(r => r.text());
const data = easyScrape(html, schema);
results.push(data);
}Q: How do I handle dynamic content (JavaScript-rendered)?
Easy Scrape works with static HTML. For JavaScript-rendered content, use tools like Puppeteer or Playwright to get the HTML first:
import puppeteer from 'puppeteer';
import { easyScrape } from 'easy-scrape';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const html = await page.content();
await browser.close();
const result = easyScrape(html, schema);Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT
Credits
Built on top of Cheerio - Fast, flexible & lean implementation of core jQuery designed specifically for the server.
Inspired by scrape-it - A Node.js scraper for humans.
