n8n-nodes-scraper-web-new
v1.0.9
Published
n8n node for web scraping using Cheerio and Crawlee
Maintainers
Readme
n8n-nodes-scraper-web
This is an n8n community node for web scraping using Cheerio and Crawlee.
n8n is a fair-code licensed workflow automation platform.
Features
- Scrape Single Page: Extract data from a single web page using CSS selectors
- Crawl Website: Crawl multiple pages following internal links
- Flexible Extraction: Extract text, HTML, or specific attributes
- Multiple Selectors: Define multiple CSS selectors to extract different data points
- Crawl Control: Control max pages, depth, and link patterns
- Same Domain Filtering: Option to stay within the same domain while crawling
Installation
Follow the installation guide in the n8n community nodes documentation.
npm
npm install n8n-nodes-scraper-webn8n
In n8n, go to Settings > Community Nodes and install:
n8n-nodes-scraper-webOperations
Scrape Single Page
Extract data from a single web page.
Parameters:
- URL: The URL to scrape
- Extraction Mode: Choose between CSS selectors, full HTML, or text content
- Selectors: Define CSS selectors to extract specific data
Example:
URL: https://example.com
Selector: .title -> Extract text
Result: { title: "Example Title", url: "https://example.com" }Crawl Website
Crawl multiple pages following internal links.
Parameters:
- Start URLs: Starting URLs for crawling (one per line)
- Max Pages: Maximum number of pages to crawl
- Max Depth: Maximum depth of crawling
- Link Selector: CSS selector for links to follow (default:
a[href]) - Pagination Selector: CSS selector specifically for pagination links (e.g.,
.pagination a,a[aria-label*="next"]). Leave empty to use Link Selector for all links - Same Domain Only: Only crawl pages on the same domain (default:
true)
Example:
Start URLs: https://example.com
Max Pages: 50
Max Depth: 2
Link Selector: a[href]
Pagination Selector: .pagination aCSS Selectors
The node supports standard CSS selectors:
- Element:
div,p,a - Class:
.classname - ID:
#idname - Attribute:
[href],[data-id] - Combined:
div.content > p.text
Extraction Options
For each selector, you can extract:
- Text: The text content of the element
- HTML: The HTML content of the element
- Attribute: A specific attribute value (e.g.,
href,src)
You can also choose to extract:
- Single: Only the first matching element
- Multiple: All matching elements (returns an array)
Advanced Options
- User Agent: Custom user agent string
- Timeout: Request timeout in milliseconds
- Max Retries: Maximum number of retries for failed requests
- Wait For: Wait time before scraping (useful for dynamic content)
Examples
Extract Article Titles and Links
Operation: Scrape Single Page
URL: https://news.example.com
Selectors:
- Field: titles, Selector: .article-title, Extract: text, Multiple: true
- Field: links, Selector: .article-link, Extract: attribute (href), Multiple: trueCrawl Blog Posts
Operation: Crawl Website
Start URLs: https://blog.example.com
Max Pages: 20
Max Depth: 2
Link Selector: a.post-link
Selectors:
- Field: title, Selector: h1.post-title, Extract: text
- Field: content, Selector: .post-content, Extract: text
- Field: author, Selector: .author-name, Extract: textExample: Crawl a Blog
Operation: Crawl Website
Start URLs: https://blog.example.com
Max Pages: 50
Max Depth: 2
Link Selector: a[href]
Same Domain Only: Yes
Selectors:
- Field: title, Selector: h1, Extract: text
- Field: content, Selector: .post-content, Extract: text
- Field: author, Selector: .author-name, Extract: textExample: Scrape Paginated Results (e.g., Real Estate Listings)
Operation: Crawl Website
Start URLs: https://www.vivareal.com.br/venda/rj/niteroi/bairros/centro/apartamento_residencial/
Max Pages: 100
Max Depth: 1
Pagination Selector: .olx-core-pagination a, a[aria-label*="página"]
Same Domain Only: Yes
Selectors:
- Field: title, Selector: h2.property-card__title, Extract: text, Multiple: Yes
- Field: price, Selector: .property-card__price, Extract: text, Multiple: Yes
- Field: link, Selector: a.property-card__content-link, Extract: attribute (href), Multiple: YesTip for Pagination:
- Use Pagination Selector to target only pagination links (next page, page numbers)
- Set Max Depth: 1 to avoid following links inside individual listings
- Set Max Pages to the number of result pages you want to scrape
- The crawler will automatically follow pagination links and extract data from each page
Dependencies
Compatibility
- Requires n8n version 1.0.0 or later
- Node.js 18.10 or later
