npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

sitemap-xml-parser

v1.4.1

Published

Parses sitemap XML files and returns all listed URLs. CLI and library. Supports TSV/JSON output, filtering, sitemap index files, and gzip compression.

Readme

sitemap-xml-parser

Parses sitemap XML files and returns all listed URLs. Can be used as a CLI tool or a Node.js library.

  • Follows sitemap index files recursively and decompresses gzip automatically
  • Supports custom request headers, concurrency control, and request timeouts
  • CLI: outputs plain URLs, TSV, or JSON with configurable field selection (--fields)
  • CLI: filters URLs by substring or regular expression

Installation

npm install sitemap-xml-parser

CLI

Run without installing via npx:

npx sitemap-xml-parser <url> [options]

Or, after installing globally (npm install -g sitemap-xml-parser):

sitemap-xml-parser <url> [options]

Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See Options for available flags.

Examples

# Print all URLs
npx sitemap-xml-parser https://example.com/sitemap.xml

# Save URLs to a file, errors to a log
npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log

# Count URLs
npx sitemap-xml-parser https://example.com/sitemap.xml --count

# Stop after 100 entries
npx sitemap-xml-parser https://example.com/sitemap.xml --cap 100

# Filter and count
npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog" --count

# Filter by regular expression
npx sitemap-xml-parser https://example.com/sitemap.xml --filter-regex "blog/[0-9]{4}/"

# Output as TSV (loc, lastmod, changefreq, priority)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv

Some sitemaps include extension fields such as image:image or news:news beyond the standard four. If you need to include those fields in your output, use --list-fields to find out what's available first.

# Output as JSON with all fields (all fields present in the source XML are included by default)
npx sitemap-xml-parser https://example.com/sitemap.xml --format json

# Discover all fields present in a sitemap
npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields

# Output as TSV with custom columns (e.g. image sitemap extension)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv --fields loc,image:image

# Output as TSV with all fields (fetches twice: once to discover fields, once to output)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv \
  --fields "$(npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields | paste -sd, -)"

Options

CLI

| Flag | Default | Description | |-------------------------|---------|-----------------------------------------------------------------------------| | --delay <ms> | 1000 | Milliseconds to wait between batches when following a sitemap index. --limit URLs are fetched in parallel per batch; after each batch completes, the process waits --delay ms before starting the next. Set to 0 to disable. | | --limit <n> | 10 | Number of child sitemaps to fetch concurrently per batch. | | --timeout <ms> | 30000 | Milliseconds before a request is aborted. | | --cap <n> | — | Stop collecting after this many URL entries. Useful for sampling large sitemaps. | | --header <Name: Value>| — | Add a request header. Repeatable. Single: --header "User-Agent: MyBot/1.0". Multiple: --header "User-Agent: MyBot/1.0" --header "Authorization: Bearer token" | | --filter <str> | — | Only output URLs whose loc contains the given string (substring match). Can be combined with --count or --format. | | --filter-regex <regex>| — | Only output URLs whose loc matches the given regular expression. Invalid patterns exit non-zero. Can be combined with --count or --format. | | --format <fmt> | — | Output format: tsv prints a header row followed by one tab-separated row per entry; json outputs a JSON array of entry objects including all fields from the source XML. | | --fields <f1,f2,...> | — | Comma-separated list of fields to include in the output. Requires --format. For tsv, defaults to loc,lastmod,changefreq,priority. For json, defaults to all fields. Nested values are serialized as JSON in TSV output. | | --list-fields | — | Print all field names found across every entry, one per line. Scans the entire sitemap and outputs the union of all keys seen. Useful for discovering available fields before using --fields. Compatible with --filter and --filter-regex. Cannot be combined with --format, --fields, --cap, or --count. | | --count | — | Print only the total number of URLs. |

Library

| Option | Type | Default | Description | |-----------|------------|---------|------------------------------------| | delay | number | 1000 | Same as --delay. | | limit | number | 10 | Same as --limit. | | timeout | number | 30000 | Same as --timeout. | | cap | number | — | Same as --cap. | | headers | object | — | Key-value map of request headers. Same as repeated --header. | | onError | function | — | Called as onError(url, error) when a fetch or parse fails. The entry is skipped regardless. | | onEntry | function | — | Called as onEntry(entry) each time a URL entry is parsed. entry has the same shape as the objects returned by fetch(). |

Features

  • Follows Sitemap Index files recursively, including nested indexes (Index within an Index)
  • Automatically decompresses gzip: supports both .gz URLs and Content-Encoding: gzip responses
  • Batch processing: fetches limit child sitemaps in parallel per batch, then waits delay ms after each batch completes
  • Automatically follows redirects (301/302/303/307/308) up to 5 hops; errors beyond that are reported via onError. Custom request headers are forwarded only when the redirect stays on the same origin (same scheme, host, and port); they are stripped on cross-origin redirects.

Usage

const SitemapXMLParser = require('sitemap-xml-parser');

const parser = new SitemapXMLParser('https://example.com/sitemap.xml');

(async () => {
    const urls = await parser.fetch();
    urls.forEach(entry => {
        console.log(entry.loc);
    });
})();

Custom headers

const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
    headers: {
        'User-Agent': 'MyBot/2.0',
        'Authorization': 'Bearer my-token',
    },
});

Error handling with onError

Failed URLs (network errors, non-2xx responses, malformed XML) are skipped by default. Provide an onError callback to inspect them:

const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
    onError: (url, err) => {
        console.error(`Skipped ${url}: ${err.message}`);
    },
});

Return value

fetch() resolves to an array of URL entry objects. Each object contains all fields present in the source XML — no field selection is applied at the library level:

[
  {
    loc:        'https://example.com/page1',
    lastmod:    '2024-01-01',
    changefreq: 'weekly',
    priority:   '0.8',
  },
  // ...
]

loc is always a string. Standard fields (lastmod, changefreq, priority) are strings when present, or undefined when absent from the source XML.

Sitemap extension fields (e.g. image:image, news:news, video:video) are also preserved as-is when present in the source XML. Their values reflect the structure parsed by the underlying XML parser — nested elements become objects.