n8n-nodes-sitemap-parser

v0.1.2

Published

a month ago

n8n community node for parsing sitemaps with recursive sitemap index traversal

0High
0Medium
0Low

raevon

n8n-community-node-package sitemap xml crawler scraper

n8n-nodes-sitemap-parser

An n8n community node for parsing XML sitemaps with recursive sitemap index traversal.

Features

Recursive Sitemap Index Parsing — Automatically traverses nested <sitemapindex> structures to any depth
Domain Auto-Discovery — Given a domain, discovers sitemaps from robots.txt and common paths
Direct Sitemap URL — Parse any sitemap URL directly
Gzip Support — Handles .xml.gz compressed sitemaps
Concurrency Control — Configurable parallel request limits
URL Filtering — Include/exclude URLs with regex patterns
Rich Output — Extracts lastmod, changefreq, priority metadata
Loop Detection — Prevents infinite recursion with visited tracking

Installation

In n8n, go to Settings → Community Nodes and install:

n8n-nodes-sitemap-parser

Or install via npm:

npm install n8n-nodes-sitemap-parser

Usage

Mode 1: Sitemap URL (Direct)

Provide a direct sitemap URL:

https://rothys.com/sitemap.xml

The node will:

Fetch the sitemap
If it's a <sitemapindex>, recursively fetch all child sitemaps
Extract all <url> entries with metadata
Output each URL as a separate n8n item

Mode 2: Domain (Auto-Discovery)

Provide a domain:

rothys.com

The node will:

Check robots.txt for Sitemap: directives
Try common sitemap paths (/sitemap.xml, /sitemap_index.xml, etc.)
Parse all discovered sitemaps recursively
Output all URLs

Options

| Option | Default | Description | |--------|---------|-------------| | Max Recursion Depth | 10 | Maximum depth for nested sitemap indexes | | Concurrency | 5 | Max parallel HTTP requests | | Request Timeout | 30s | Timeout per request | | Custom User Agent | n8n-sitemap-parser/1.0 | User-Agent header | | URL Filter Pattern | — | Regex to include only matching URLs | | Exclude Pattern | — | Regex to exclude matching URLs | | Include Metadata | true | Include lastmod, changefreq, priority | | Flatten Output | true | One item per URL (false = single array) |

Output Schema

Each URL item contains:

{
  "url": "https://example.com/products/widget",
  "lastmod": "2024-01-15",
  "changefreq": "weekly",
  "priority": "0.8",
  "depth": 2,
  "source": "https://example.com/sitemap-products.xml"
}

Example Workflows

Crawl all product pages from a store

[Sitemap Parser] → [HTTP Request] → [Extract Content]
  url: store.com
  filter: .*\/products\/.*

Get all blog post URLs

[Sitemap Parser] → [Filter] → [Next Steps]
  url: https://blog.example.com/sitemap.xml
  exclude: .*\.(jpg|png|gif|css|js)$

Development

# Install dependencies
npm install

# Build
npm run build

# Development with hot reload
npm run dev

# Lint
npm run lint

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

n8n-nodes-sitemap-parser

Features

Installation

Usage

Mode 1: Sitemap URL (Direct)

Mode 2: Domain (Auto-Discovery)

Options

Output Schema

Example Workflows

Crawl all product pages from a store

Get all blog post URLs

Development

License