n8n-nodes-ai-training-scraper

v1.0.2

Published

5 days ago

Scrape and chunk websites into AI-ready training data for RAG, LLM fine-tuning, and vector databases

0High
0Medium
0Low

hellopaul

n8n-community-node-package n8n ai scraping rag llm vector-database training-data chunking apify

n8n-nodes-ai-training-scraper

This is an n8n community node that lets you scrape websites and convert them into AI-ready training data using the Apify AI Training Data Scraper. It intelligently chunks content for RAG (Retrieval-Augmented Generation), LLM fine-tuning, and vector databases.

n8n is a fair-code licensed workflow automation platform.

Features

Smart Scraping: Choose between Cheerio (fast, static) or Playwright (headless browser for JS-heavy sites).
Intelligent Chunking:
- Semantic: Splits by meaning (recommended for RAG).
- Fixed Token: Strict token limits.
- Sentence Based: Preserves sentence structure.
- Markdown Section: Splits by headers.
AI-Ready Output: Formats data specifically for vector databases (Pinecone, Weaviate, etc.) or fine-tuning datasets.
Advanced Control:
- Remove CSS selectors (ads, navbars).
- Respect robots.txt.
- Recursively follow links with depth control.
- Extract metadata (author, date, keywords).

Installation

Follow the instructions for installing a community node in your n8n instance.

Go to Settings > Community Nodes.
Select Install.
Enter the package name: n8n-nodes-ai-training-scraper.

Alternatively, if running via npm:

npm install n8n-nodes-ai-training-scraper

Configuration

You need an Apify API Token to use this node.

Log in to your Apify Console.
Go to Settings > Integrations.
Copy your API Token.
In n8n, add a new Credential for Apify API and paste the token.

Usage Examples

1. Basic Documentation Scraping

Scrape a documentation site and prepare it for a vector store.

Operation: Scrape and Chunk
Start URLs: https://docs.python.org/3/
Chunking: Semantic
Output Format: Vector Ready

2. RAG Pipeline

Build a chatbot that answers questions based on your website.

AI Training Scraper: Scrapes your blog.
OpenAI Embeddings: Converts chunks to vectors.
Pinecone: Stores the vectors.
LangChain: Queries Pinecone for context.

3. Multi-site Knowledge Base

Combine multiple sources into one dataset.

Start URLs: https://docs.example.com, https://blog.example.com
Max Pages: 500
Crawler Type: Playwright (to handle dynamic content)

Parameters Guide

Essential

Start URLs: Where the crawler begins. Can be multiple comma-separated URLs.
Crawler Type:
- Cheerio: Much faster, cheaper, but acts like curl. Good for static HTML.
- Playwright: Uses a real browser. Essential for React/Vue/Angular sites but slower.

Chunking

Strategy: How to split the text. Semantic uses basic NLP to keep related text together.
Chunk Size: Target size in tokens (approx. 4 chars per token).
Chunk Overlap: How many tokens to repeat between chunks to preserve context at boundaries.

Advanced

Max Crawl Depth: How many clicks away from the start URL to go.
Remove Elements: CSS selectors to strip out before processing (e.g., nav, .footer, .ad-banner).
URL Patterns: Only scrape URLs matching these globs (e.g., **/blog/**).
Exclude URL Patterns: Skip URLs matching these globs (e.g., **/login).

Troubleshooting

Rate Limits: If you see 429 errors or timeouts, reduce Max Concurrency in Advanced Options.
Empty Results: Check your Start URLs and ensure Crawler Type matches the site technology. If the site uses JavaScript to render content, you MUST use Playwright.
Garbage Content: Use Remove Elements to strip out headers, footers, and sidebars that clutter the training data.

Compatibility

tested with n8n v1.0.0+

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

n8n-nodes-ai-training-scraper

Features

Installation

Configuration

Usage Examples

1. Basic Documentation Scraping

2. RAG Pipeline

3. Multi-site Knowledge Base

Parameters Guide

Essential

Chunking

Advanced

Troubleshooting

Compatibility

License