npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

nodefisher

v1.0.0

Published

CLI tool to scan a website's sitemap URLs for an HTML query selector using a headless browser context.

Readme

Sitemap HTML Selector Scanner CLI (Playwright Edition)

A fast Node.js CLI tool that crawls a website's sitemap to gather all URLs, loads each page concurrently in a headless browser (using Playwright), and waits for client-side JavaScript/APIs to render before checking for a specified CSS/HTML query selector.

This allows detection of dynamic elements (e.g. login/newsletter forms, React/Vue/Angular rendered components, and third-party widgets) that are not present in static HTML sources.

🚀 Features

  • Programmatic Sitemap Gathering: Parses robots.txt and recursively loads nested XML sitemaps to discover all page URLs.
  • Dynamic JS Rendering: Uses headless Chromium to run all page scripts, load APIs, and render dynamic contents.
  • Smart Element Waiting: Uses Playwright's waitForSelector to detect when target nodes are rendered, with a 5-second automatic timeout.
  • Controlled Concurrency: Set a custom parallel tab/context limit (default: 5 concurrent pages) to optimize scraping speed without overloading systems or getting rate-limited.
  • Isolated Browser Contexts: Every page is opened in its own isolated browser context, avoiding cookie/session contamination and maintaining clean execution states.
  • Multiple Output Formats: Save results as standard line-delimited text (txt), a structured json array, or print matches directly to standard output (stdout) in real-time.
  • Clean Output Streams: Matches write cleanly to stdout or files, while progress logs, warnings, and stats are sent to stderr for clean terminal piping.

📦 Installation

Ensure you have Node.js installed (version 18+ is recommended).

  1. Install the CLI tool globally:
    npm install -g nodefisher
    (Note: This will automatically download the required Chromium browser binary during the installation process.)

🛠️ Usage

Run the tool using nodefisher:

nodefisher <url> <selector> [options]

Required Arguments

  • <url>: The frontpage/homepage URL of the website to crawl (e.g., https://example.com or simply example.com).
  • <selector>: The CSS/HTML query selector to search for (e.g. ".target-class", "#main-title", or ".newsletter-signup").

Optional Flags

  • -f, --format <format>: Output format. Options: txt, json, or stdout (default: txt).
  • -o, --output <file>: Custom path to save the output file. Defaults to results.txt (for txt format) or results.json (for json format).
  • -c, --concurrency <number>: Number of concurrent browser pages to process in parallel (default: 5).
  • -v, --verbose: Print detailed page processing, info messages (timeouts vs. success), and match detections to stderr.
  • -h, --help: Display the help message.

💡 Examples

1. Basic Dynamic Scan (Default Text Output File)

Crawls https://example.com for dynamic elements matching .newsletter-signup and saves to results.txt:

nodefisher https://example.com ".newsletter-signup"

2. Save Results as JSON File (Verbose Mode)

Crawls and saves matching URLs into matches.json while printing detailed browser loading logs:

nodefisher example.com ".newsletter-signup" --format json --output matches.json --verbose

3. Print directly to Stdout (Real-time Streaming)

Prints matching URLs directly to standard output as they are found.

nodefisher https://example.com ".dynamic-element" --format stdout

4. Custom Concurrency

Scan a site using 10 concurrent browser context workers:

nodefisher example.com ".target-class" --concurrency 10

⚙️ How it Works

  1. URL Discovery: The CLI takes the target URL, parses its robots.txt file, fetches sitemaps recursively, and outputs all unique URLs on the domain.
  2. Headless Browser Setup: Playwright launches a headless Chromium instance.
  3. Parallel Worker Pool: A worker queue pulls URLs. It initializes isolated browser contexts up to the --concurrency limit.
  4. JS Page Load: Each worker navigates to its URL, allowing JavaScript to run and external APIs to load.
  5. Selector Matching: The script invokes Playwright's page.waitForSelector(selector, { timeout: 5000 }). If the element renders within 5 seconds, the page is flagged as a match. If it does not appear (or the page fails to load), the page is skipped.
  6. Output Writing: Matching URLs are saved/printed based on the chosen output format.