npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

n8n-nodes-smart-web-scraper

v0.2.1

Published

Smart web scraper node for n8n with automatic failover and content extraction

Downloads

189

Readme

n8n-nodes-smart-web-scraper

Smart Web Scraper node for n8n with automatic failover and intelligent content extraction. This node attempts multiple scraping methods to ensure you get the content you need, even when sites block traditional HTTP requests.

Features

  • 🚀 Automatic Failover: Tries multiple scraping methods until one succeeds
  • 📄 Smart Content Extraction: Automatically extracts main article content, removing ads, navigation, and other clutter
  • 🎯 Multiple Strategies: Choose between cost-effective, speed-first, or quality-first approaches
  • 🔄 Multiple Backends: Supports HTTP GET, Jina AI Reader, and Firecrawl API
  • 🌐 Proxy Support: Route requests through proxy servers when needed
  • 📝 Multiple Output Formats: Markdown, plain text, HTML, or structured JSON
  • 🤖 AI-Ready: Enabled as a tool for AI agents with usableAsTool flag

Installation

Community Nodes (Recommended)

  1. Go to Settings > Community Nodes
  2. Search for n8n-nodes-smart-web-scraper
  3. Click Install

Manual Installation

npm install n8n-nodes-smart-web-scraper

Scraping Methods

1. HTTP GET with Content Extraction (Free)

  • Standard HTTP request with intelligent content extraction
  • Uses Mozilla's Readability algorithm to extract main content
  • Removes ads, navigation, sidebars automatically
  • Converts to clean markdown format

2. Jina AI Reader (Free Tier Available)

  • Specialized reader API that returns clean markdown
  • No API key required for basic usage
  • Handles JavaScript-rendered content better than HTTP GET

3. Firecrawl API (Premium)

  • Professional web scraping API
  • Best extraction quality
  • Handles complex sites and anti-scraping measures

Configuration

Scraping Strategies

  • Cost Effective: Tries free methods first (HTTP → Jina → Firecrawl)
  • Speed First: Uses fastest available method
  • Quality First: Starts with premium APIs for best extraction

Credentials Setup

Firecrawl API (Optional)

  1. Sign up at Firecrawl.dev
  2. Get your API key
  3. Add to n8n credentials

Jina AI API (Optional)

  1. Visit Jina AI Reader
  2. API key is optional for basic usage
  3. Add to n8n credentials for higher limits

Proxy Server (Optional)

  1. Configure your proxy details
  2. Supports HTTP, HTTPS, and SOCKS5 protocols
  3. Optional authentication support

Usage Examples

Basic Web Scraping

{
  "url": "https://example.com/article",
  "strategy": "cost_effective",
  "outputOptions": {
    "format": "markdown",
    "extractMainContent": true
  }
}

With Failover Options

{
  "url": "https://example.com/article",
  "strategy": "cost_effective",
  "failoverOptions": {
    "enableJina": true,
    "enableFirecrawl": true,
    "enableProxy": false
  },
  "outputOptions": {
    "format": "markdown",
    "maxLength": 5000,
    "includeMetadata": true
  }
}

For AI Processing

{
  "url": "https://example.com/article",
  "strategy": "quality_first",
  "outputOptions": {
    "format": "markdown",
    "extractMainContent": true,
    "maxLength": 3000
  }
}

Output Structure

The node returns:

  • content: The extracted content in your chosen format
  • metadata: Title, author, excerpt, site name (when available)
  • scrapingMethod: Which method successfully retrieved the content
  • url: The scraped URL
  • timestamp: When the scraping occurred

Use with AI Agents

This node is AI-tool enabled with usableAsTool: true. You can:

  1. Connect it to an AI Agent node
  2. The AI will automatically use it to fetch web content
  3. Clean, extracted content is perfect for AI context windows

Error Handling

The node includes comprehensive error handling:

  • Automatic retry with exponential backoff
  • Detailed error messages for each failed method
  • Option to continue workflow on errors
  • Clear indication of which method succeeded

Tips

  1. Start with Cost Effective strategy - It's free and works for most sites
  2. Enable Jina for JavaScript sites - Better than plain HTTP for SPAs
  3. Use Firecrawl for critical content - When you absolutely need the data
  4. Set max length for AI use - Prevent token limit issues
  5. Extract main content by default - Cleaner data for processing

Development

# Install dependencies
pnpm install

# Build the node
pnpm run build

# Test in development
pnpm run dev

# Lint code
pnpm run lint

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Support

For issues and feature requests, please use the GitHub Issues page.

Changelog

v0.1.0

  • Initial release
  • HTTP GET with Readability extraction
  • Jina AI Reader integration
  • Firecrawl API support
  • Proxy server support
  • Multiple output formats
  • AI tool compatibility