n8n-nodes-smart-web-scraper

v0.2.1

Published

7 months ago

Smart web scraper node for n8n with automatic failover and content extraction

0High
0Medium
0Low

jezweb

n8n-community-node-package n8n web-scraping scraper content-extraction firecrawl jina readability

n8n-nodes-smart-web-scraper

Smart Web Scraper node for n8n with automatic failover and intelligent content extraction. This node attempts multiple scraping methods to ensure you get the content you need, even when sites block traditional HTTP requests.

Features

🚀 Automatic Failover: Tries multiple scraping methods until one succeeds
📄 Smart Content Extraction: Automatically extracts main article content, removing ads, navigation, and other clutter
🎯 Multiple Strategies: Choose between cost-effective, speed-first, or quality-first approaches
🔄 Multiple Backends: Supports HTTP GET, Jina AI Reader, and Firecrawl API
🌐 Proxy Support: Route requests through proxy servers when needed
📝 Multiple Output Formats: Markdown, plain text, HTML, or structured JSON
🤖 AI-Ready: Enabled as a tool for AI agents with usableAsTool flag

Installation

Community Nodes (Recommended)

Go to Settings > Community Nodes
Search for n8n-nodes-smart-web-scraper
Click Install

Manual Installation

npm install n8n-nodes-smart-web-scraper

Scraping Methods

1. HTTP GET with Content Extraction (Free)

Standard HTTP request with intelligent content extraction
Uses Mozilla's Readability algorithm to extract main content
Removes ads, navigation, sidebars automatically
Converts to clean markdown format

2. Jina AI Reader (Free Tier Available)

Specialized reader API that returns clean markdown
No API key required for basic usage
Handles JavaScript-rendered content better than HTTP GET

3. Firecrawl API (Premium)

Professional web scraping API
Best extraction quality
Handles complex sites and anti-scraping measures

Configuration

Scraping Strategies

Cost Effective: Tries free methods first (HTTP → Jina → Firecrawl)
Speed First: Uses fastest available method
Quality First: Starts with premium APIs for best extraction

Credentials Setup

Firecrawl API (Optional)

Sign up at Firecrawl.dev
Get your API key
Add to n8n credentials

Jina AI API (Optional)

Visit Jina AI Reader
API key is optional for basic usage
Add to n8n credentials for higher limits

Proxy Server (Optional)

Configure your proxy details
Supports HTTP, HTTPS, and SOCKS5 protocols
Optional authentication support

Usage Examples

Basic Web Scraping

{
  "url": "https://example.com/article",
  "strategy": "cost_effective",
  "outputOptions": {
    "format": "markdown",
    "extractMainContent": true
  }
}

With Failover Options

{
  "url": "https://example.com/article",
  "strategy": "cost_effective",
  "failoverOptions": {
    "enableJina": true,
    "enableFirecrawl": true,
    "enableProxy": false
  },
  "outputOptions": {
    "format": "markdown",
    "maxLength": 5000,
    "includeMetadata": true
  }
}

For AI Processing

{
  "url": "https://example.com/article",
  "strategy": "quality_first",
  "outputOptions": {
    "format": "markdown",
    "extractMainContent": true,
    "maxLength": 3000
  }
}

Output Structure

The node returns:

content: The extracted content in your chosen format
metadata: Title, author, excerpt, site name (when available)
scrapingMethod: Which method successfully retrieved the content
url: The scraped URL
timestamp: When the scraping occurred

Use with AI Agents

This node is AI-tool enabled with usableAsTool: true. You can:

Connect it to an AI Agent node
The AI will automatically use it to fetch web content
Clean, extracted content is perfect for AI context windows

Error Handling

The node includes comprehensive error handling:

Automatic retry with exponential backoff
Detailed error messages for each failed method
Option to continue workflow on errors
Clear indication of which method succeeded

Tips

Start with Cost Effective strategy - It's free and works for most sites
Enable Jina for JavaScript sites - Better than plain HTTP for SPAs
Use Firecrawl for critical content - When you absolutely need the data
Set max length for AI use - Prevent token limit issues
Extract main content by default - Cleaner data for processing

Development

# Install dependencies
pnpm install

# Build the node
pnpm run build

# Test in development
pnpm run dev

# Lint code
pnpm run lint

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Support

For issues and feature requests, please use the GitHub Issues page.

Changelog

v0.1.0

Initial release
HTTP GET with Readability extraction
Jina AI Reader integration
Firecrawl API support
Proxy server support
Multiple output formats
AI tool compatibility

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

n8n-nodes-smart-web-scraper

Features

Installation

Community Nodes (Recommended)

Manual Installation

Scraping Methods

1. HTTP GET with Content Extraction (Free)

2. Jina AI Reader (Free Tier Available)

3. Firecrawl API (Premium)

Configuration

Scraping Strategies

Credentials Setup

Firecrawl API (Optional)

Jina AI API (Optional)

Proxy Server (Optional)

Usage Examples

Basic Web Scraping

With Failover Options

For AI Processing

Output Structure

Use with AI Agents

Error Handling

Tips

Development

Contributing

License

Support

Changelog

v0.1.0