n8n-nodes-scraper-web-new

v1.0.9

Published

2 days ago

n8n node for web scraping using Cheerio and Crawlee

0High
0Medium
0Low

mattos123

n8n-community-node-package n8n web scraping cheerio crawlee crawler

n8n-nodes-scraper-web

This is an n8n community node for web scraping using Cheerio and Crawlee.

n8n is a fair-code licensed workflow automation platform.

Features

Scrape Single Page: Extract data from a single web page using CSS selectors
Crawl Website: Crawl multiple pages following internal links
Flexible Extraction: Extract text, HTML, or specific attributes
Multiple Selectors: Define multiple CSS selectors to extract different data points
Crawl Control: Control max pages, depth, and link patterns
Same Domain Filtering: Option to stay within the same domain while crawling

Installation

Follow the installation guide in the n8n community nodes documentation.

npm

npm install n8n-nodes-scraper-web

n8n

In n8n, go to Settings > Community Nodes and install:

n8n-nodes-scraper-web

Operations

Scrape Single Page

Extract data from a single web page.

Parameters:

URL: The URL to scrape
Extraction Mode: Choose between CSS selectors, full HTML, or text content
Selectors: Define CSS selectors to extract specific data

Example:

URL: https://example.com
Selector: .title -> Extract text
Result: { title: "Example Title", url: "https://example.com" }

Crawl Website

Crawl multiple pages following internal links.

Parameters:

Start URLs: Starting URLs for crawling (one per line)
Max Pages: Maximum number of pages to crawl
Max Depth: Maximum depth of crawling
Link Selector: CSS selector for links to follow (default: a[href])
Pagination Selector: CSS selector specifically for pagination links (e.g., .pagination a, a[aria-label*="next"]). Leave empty to use Link Selector for all links
Same Domain Only: Only crawl pages on the same domain (default: true)

Example:

Start URLs: https://example.com
Max Pages: 50
Max Depth: 2
Link Selector: a[href]
Pagination Selector: .pagination a

CSS Selectors

The node supports standard CSS selectors:

Element: div, p, a
Class: .classname
ID: #idname
Attribute: [href], [data-id]
Combined: div.content > p.text

Extraction Options

For each selector, you can extract:

Text: The text content of the element
HTML: The HTML content of the element
Attribute: A specific attribute value (e.g., href, src)

You can also choose to extract:

Single: Only the first matching element
Multiple: All matching elements (returns an array)

Advanced Options

User Agent: Custom user agent string
Timeout: Request timeout in milliseconds
Max Retries: Maximum number of retries for failed requests
Wait For: Wait time before scraping (useful for dynamic content)

Examples

Extract Article Titles and Links

Operation: Scrape Single Page
URL: https://news.example.com
Selectors:
  - Field: titles, Selector: .article-title, Extract: text, Multiple: true
  - Field: links, Selector: .article-link, Extract: attribute (href), Multiple: true

Crawl Blog Posts

Operation: Crawl Website
Start URLs: https://blog.example.com
Max Pages: 20
Max Depth: 2
Link Selector: a.post-link
Selectors:
  - Field: title, Selector: h1.post-title, Extract: text
  - Field: content, Selector: .post-content, Extract: text
  - Field: author, Selector: .author-name, Extract: text

Example: Crawl a Blog

Operation: Crawl Website
Start URLs: https://blog.example.com
Max Pages: 50
Max Depth: 2
Link Selector: a[href]
Same Domain Only: Yes

Selectors:
  - Field: title, Selector: h1, Extract: text
  - Field: content, Selector: .post-content, Extract: text
  - Field: author, Selector: .author-name, Extract: text

Example: Scrape Paginated Results (e.g., Real Estate Listings)

Operation: Crawl Website
Start URLs: https://www.vivareal.com.br/venda/rj/niteroi/bairros/centro/apartamento_residencial/
Max Pages: 100
Max Depth: 1
Pagination Selector: .olx-core-pagination a, a[aria-label*="página"]
Same Domain Only: Yes

Selectors:
  - Field: title, Selector: h2.property-card__title, Extract: text, Multiple: Yes
  - Field: price, Selector: .property-card__price, Extract: text, Multiple: Yes
  - Field: link, Selector: a.property-card__content-link, Extract: attribute (href), Multiple: Yes

Tip for Pagination:

Use Pagination Selector to target only pagination links (next page, page numbers)
Set Max Depth: 1 to avoid following links inside individual listings
Set Max Pages to the number of result pages you want to scrape
The crawler will automatically follow pagination links and extract data from each page

Dependencies

Cheerio - Fast, flexible HTML parsing
Crawlee - Web scraping and browser automation library

Compatibility

Requires n8n version 1.0.0 or later
Node.js 18.10 or later

Resources

License

MIT