sitemap-xml-parser

v1.4.1

Published

2 days ago

Parses sitemap XML files and returns all listed URLs. CLI and library. Supports TSV/JSON output, filtering, sitemap index files, and gzip compression.

0High
0Medium
0Low

shinkawax

sitemap xml parse gzip sitemap-index cli tsv json

sitemap-xml-parser

Parses sitemap XML files and returns all listed URLs. Can be used as a CLI tool or a Node.js library.

Follows sitemap index files recursively and decompresses gzip automatically
Supports custom request headers, concurrency control, and request timeouts
CLI: outputs plain URLs, TSV, or JSON with configurable field selection (--fields)
CLI: filters URLs by substring or regular expression

Installation

npm install sitemap-xml-parser

CLI

Run without installing via npx:

npx sitemap-xml-parser <url> [options]

Or, after installing globally (npm install -g sitemap-xml-parser):

sitemap-xml-parser <url> [options]

Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See Options for available flags.

Examples

# Print all URLs
npx sitemap-xml-parser https://example.com/sitemap.xml

# Save URLs to a file, errors to a log
npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log

# Count URLs
npx sitemap-xml-parser https://example.com/sitemap.xml --count

# Stop after 100 entries
npx sitemap-xml-parser https://example.com/sitemap.xml --cap 100

# Filter and count
npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog" --count

# Filter by regular expression
npx sitemap-xml-parser https://example.com/sitemap.xml --filter-regex "blog/[0-9]{4}/"

# Output as TSV (loc, lastmod, changefreq, priority)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv

Some sitemaps include extension fields such as image:image or news:news beyond the standard four. If you need to include those fields in your output, use --list-fields to find out what's available first.

# Output as JSON with all fields (all fields present in the source XML are included by default)
npx sitemap-xml-parser https://example.com/sitemap.xml --format json

# Discover all fields present in a sitemap
npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields

# Output as TSV with custom columns (e.g. image sitemap extension)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv --fields loc,image:image

# Output as TSV with all fields (fetches twice: once to discover fields, once to output)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv \
  --fields "$(npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields | paste -sd, -)"

Options

CLI

| Flag | Default | Description | |-------------------------|---------|-----------------------------------------------------------------------------| | --delay <ms> | 1000 | Milliseconds to wait between batches when following a sitemap index. --limit URLs are fetched in parallel per batch; after each batch completes, the process waits --delay ms before starting the next. Set to 0 to disable. | | --limit <n> | 10 | Number of child sitemaps to fetch concurrently per batch. | | --timeout <ms> | 30000 | Milliseconds before a request is aborted. | | --cap <n> | — | Stop collecting after this many URL entries. Useful for sampling large sitemaps. | | --header <Name: Value>| — | Add a request header. Repeatable. Single: --header "User-Agent: MyBot/1.0". Multiple: --header "User-Agent: MyBot/1.0" --header "Authorization: Bearer token" | | --filter <str> | — | Only output URLs whose loc contains the given string (substring match). Can be combined with --count or --format. | | --filter-regex <regex>| — | Only output URLs whose loc matches the given regular expression. Invalid patterns exit non-zero. Can be combined with --count or --format. | | --format <fmt> | — | Output format: tsv prints a header row followed by one tab-separated row per entry; json outputs a JSON array of entry objects including all fields from the source XML. | | --fields <f1,f2,...> | — | Comma-separated list of fields to include in the output. Requires --format. For tsv, defaults to loc,lastmod,changefreq,priority. For json, defaults to all fields. Nested values are serialized as JSON in TSV output. | | --list-fields | — | Print all field names found across every entry, one per line. Scans the entire sitemap and outputs the union of all keys seen. Useful for discovering available fields before using --fields. Compatible with --filter and --filter-regex. Cannot be combined with --format, --fields, --cap, or --count. | | --count | — | Print only the total number of URLs. |

Library

| Option | Type | Default | Description | |-----------|------------|---------|------------------------------------| | delay | number | 1000 | Same as --delay. | | limit | number | 10 | Same as --limit. | | timeout | number | 30000 | Same as --timeout. | | cap | number | — | Same as --cap. | | headers | object | — | Key-value map of request headers. Same as repeated --header. | | onError | function | — | Called as onError(url, error) when a fetch or parse fails. The entry is skipped regardless. | | onEntry | function | — | Called as onEntry(entry) each time a URL entry is parsed. entry has the same shape as the objects returned by fetch(). |

Features

Follows Sitemap Index files recursively, including nested indexes (Index within an Index)
Automatically decompresses gzip: supports both .gz URLs and Content-Encoding: gzip responses
Batch processing: fetches limit child sitemaps in parallel per batch, then waits delay ms after each batch completes
Automatically follows redirects (301/302/303/307/308) up to 5 hops; errors beyond that are reported via onError. Custom request headers are forwarded only when the redirect stays on the same origin (same scheme, host, and port); they are stripped on cross-origin redirects.

Usage

const SitemapXMLParser = require('sitemap-xml-parser');

const parser = new SitemapXMLParser('https://example.com/sitemap.xml');

(async () => {
    const urls = await parser.fetch();
    urls.forEach(entry => {
        console.log(entry.loc);
    });
})();

Custom headers

const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
    headers: {
        'User-Agent': 'MyBot/2.0',
        'Authorization': 'Bearer my-token',
    },
});

Error handling with `onError`

Failed URLs (network errors, non-2xx responses, malformed XML) are skipped by default. Provide an onError callback to inspect them:

const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
    onError: (url, err) => {
        console.error(`Skipped ${url}: ${err.message}`);
    },
});

Return value

fetch() resolves to an array of URL entry objects. Each object contains all fields present in the source XML — no field selection is applied at the library level:

[
  {
    loc:        'https://example.com/page1',
    lastmod:    '2024-01-01',
    changefreq: 'weekly',
    priority:   '0.8',
  },
  // ...
]

loc is always a string. Standard fields (lastmod, changefreq, priority) are strings when present, or undefined when absent from the source XML.

Sitemap extension fields (e.g. image:image, news:news, video:video) are also preserved as-is when present in the source XML. Their values reflect the structure parsed by the underlying XML parser — nested elements become objects.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

sitemap-xml-parser

Installation

CLI

Examples

Options

CLI

Library

Features

Usage

Custom headers

Error handling with onError

Return value

Error handling with `onError`