n8n-nodes-sitemap-parser
v0.1.2
Published
n8n community node for parsing sitemaps with recursive sitemap index traversal
Maintainers
Readme
n8n-nodes-sitemap-parser
An n8n community node for parsing XML sitemaps with recursive sitemap index traversal.
Features
- Recursive Sitemap Index Parsing — Automatically traverses nested
<sitemapindex>structures to any depth - Domain Auto-Discovery — Given a domain, discovers sitemaps from
robots.txtand common paths - Direct Sitemap URL — Parse any sitemap URL directly
- Gzip Support — Handles
.xml.gzcompressed sitemaps - Concurrency Control — Configurable parallel request limits
- URL Filtering — Include/exclude URLs with regex patterns
- Rich Output — Extracts
lastmod,changefreq,prioritymetadata - Loop Detection — Prevents infinite recursion with visited tracking
Installation
In n8n, go to Settings → Community Nodes and install:
n8n-nodes-sitemap-parserOr install via npm:
npm install n8n-nodes-sitemap-parserUsage
Mode 1: Sitemap URL (Direct)
Provide a direct sitemap URL:
https://rothys.com/sitemap.xmlThe node will:
- Fetch the sitemap
- If it's a
<sitemapindex>, recursively fetch all child sitemaps - Extract all
<url>entries with metadata - Output each URL as a separate n8n item
Mode 2: Domain (Auto-Discovery)
Provide a domain:
rothys.comThe node will:
- Check
robots.txtforSitemap:directives - Try common sitemap paths (
/sitemap.xml,/sitemap_index.xml, etc.) - Parse all discovered sitemaps recursively
- Output all URLs
Options
| Option | Default | Description |
|--------|---------|-------------|
| Max Recursion Depth | 10 | Maximum depth for nested sitemap indexes |
| Concurrency | 5 | Max parallel HTTP requests |
| Request Timeout | 30s | Timeout per request |
| Custom User Agent | n8n-sitemap-parser/1.0 | User-Agent header |
| URL Filter Pattern | — | Regex to include only matching URLs |
| Exclude Pattern | — | Regex to exclude matching URLs |
| Include Metadata | true | Include lastmod, changefreq, priority |
| Flatten Output | true | One item per URL (false = single array) |
Output Schema
Each URL item contains:
{
"url": "https://example.com/products/widget",
"lastmod": "2024-01-15",
"changefreq": "weekly",
"priority": "0.8",
"depth": 2,
"source": "https://example.com/sitemap-products.xml"
}Example Workflows
Crawl all product pages from a store
[Sitemap Parser] → [HTTP Request] → [Extract Content]
url: store.com
filter: .*\/products\/.*Get all blog post URLs
[Sitemap Parser] → [Filter] → [Next Steps]
url: https://blog.example.com/sitemap.xml
exclude: .*\.(jpg|png|gif|css|js)$Development
# Install dependencies
npm install
# Build
npm run build
# Development with hot reload
npm run dev
# Lint
npm run lintLicense
MIT
