n8n-nodes-ai-training-scraper
v1.0.2
Published
Scrape and chunk websites into AI-ready training data for RAG, LLM fine-tuning, and vector databases
Maintainers
Readme
n8n-nodes-ai-training-scraper
This is an n8n community node that lets you scrape websites and convert them into AI-ready training data using the Apify AI Training Data Scraper. It intelligently chunks content for RAG (Retrieval-Augmented Generation), LLM fine-tuning, and vector databases.
n8n is a fair-code licensed workflow automation platform.
Features
- Smart Scraping: Choose between Cheerio (fast, static) or Playwright (headless browser for JS-heavy sites).
- Intelligent Chunking:
- Semantic: Splits by meaning (recommended for RAG).
- Fixed Token: Strict token limits.
- Sentence Based: Preserves sentence structure.
- Markdown Section: Splits by headers.
- AI-Ready Output: Formats data specifically for vector databases (Pinecone, Weaviate, etc.) or fine-tuning datasets.
- Advanced Control:
- Remove CSS selectors (ads, navbars).
- Respect robots.txt.
- Recursively follow links with depth control.
- Extract metadata (author, date, keywords).
Installation
Follow the instructions for installing a community node in your n8n instance.
- Go to Settings > Community Nodes.
- Select Install.
- Enter the package name:
n8n-nodes-ai-training-scraper.
Alternatively, if running via npm:
npm install n8n-nodes-ai-training-scraperConfiguration
You need an Apify API Token to use this node.
- Log in to your Apify Console.
- Go to Settings > Integrations.
- Copy your API Token.
- In n8n, add a new Credential for Apify API and paste the token.
Usage Examples
1. Basic Documentation Scraping
Scrape a documentation site and prepare it for a vector store.
- Operation: Scrape and Chunk
- Start URLs:
https://docs.python.org/3/ - Chunking: Semantic
- Output Format: Vector Ready
2. RAG Pipeline
Build a chatbot that answers questions based on your website.
- AI Training Scraper: Scrapes your blog.
- OpenAI Embeddings: Converts chunks to vectors.
- Pinecone: Stores the vectors.
- LangChain: Queries Pinecone for context.
3. Multi-site Knowledge Base
Combine multiple sources into one dataset.
- Start URLs:
https://docs.example.com, https://blog.example.com - Max Pages: 500
- Crawler Type: Playwright (to handle dynamic content)
Parameters Guide
Essential
- Start URLs: Where the crawler begins. Can be multiple comma-separated URLs.
- Crawler Type:
Cheerio: Much faster, cheaper, but acts likecurl. Good for static HTML.Playwright: Uses a real browser. Essential for React/Vue/Angular sites but slower.
Chunking
- Strategy: How to split the text.
Semanticuses basic NLP to keep related text together. - Chunk Size: Target size in tokens (approx. 4 chars per token).
- Chunk Overlap: How many tokens to repeat between chunks to preserve context at boundaries.
Advanced
- Max Crawl Depth: How many clicks away from the start URL to go.
- Remove Elements: CSS selectors to strip out before processing (e.g.,
nav, .footer, .ad-banner). - URL Patterns: Only scrape URLs matching these globs (e.g.,
**/blog/**). - Exclude URL Patterns: Skip URLs matching these globs (e.g.,
**/login).
Troubleshooting
- Rate Limits: If you see 429 errors or timeouts, reduce
Max Concurrencyin Advanced Options. - Empty Results: Check your
Start URLsand ensureCrawler Typematches the site technology. If the site uses JavaScript to render content, you MUST use Playwright. - Garbage Content: Use
Remove Elementsto strip out headers, footers, and sidebars that clutter the training data.
Compatibility
tested with n8n v1.0.0+
License
MIT
