nodefisher
v1.0.0
Published
CLI tool to scan a website's sitemap URLs for an HTML query selector using a headless browser context.
Maintainers
Readme
Sitemap HTML Selector Scanner CLI (Playwright Edition)
A fast Node.js CLI tool that crawls a website's sitemap to gather all URLs, loads each page concurrently in a headless browser (using Playwright), and waits for client-side JavaScript/APIs to render before checking for a specified CSS/HTML query selector.
This allows detection of dynamic elements (e.g. login/newsletter forms, React/Vue/Angular rendered components, and third-party widgets) that are not present in static HTML sources.
🚀 Features
- Programmatic Sitemap Gathering: Parses
robots.txtand recursively loads nested XML sitemaps to discover all page URLs. - Dynamic JS Rendering: Uses headless Chromium to run all page scripts, load APIs, and render dynamic contents.
- Smart Element Waiting: Uses Playwright's
waitForSelectorto detect when target nodes are rendered, with a 5-second automatic timeout. - Controlled Concurrency: Set a custom parallel tab/context limit (default:
5concurrent pages) to optimize scraping speed without overloading systems or getting rate-limited. - Isolated Browser Contexts: Every page is opened in its own isolated browser context, avoiding cookie/session contamination and maintaining clean execution states.
- Multiple Output Formats: Save results as standard line-delimited text (
txt), a structuredjsonarray, or print matches directly to standard output (stdout) in real-time. - Clean Output Streams: Matches write cleanly to
stdoutor files, while progress logs, warnings, and stats are sent tostderrfor clean terminal piping.
📦 Installation
Ensure you have Node.js installed (version 18+ is recommended).
- Install the CLI tool globally:
(Note: This will automatically download the required Chromium browser binary during the installation process.)npm install -g nodefisher
🛠️ Usage
Run the tool using nodefisher:
nodefisher <url> <selector> [options]Required Arguments
<url>: The frontpage/homepage URL of the website to crawl (e.g.,https://example.comor simplyexample.com).<selector>: The CSS/HTML query selector to search for (e.g.".target-class","#main-title", or".newsletter-signup").
Optional Flags
-f, --format <format>: Output format. Options:txt,json, orstdout(default:txt).-o, --output <file>: Custom path to save the output file. Defaults toresults.txt(fortxtformat) orresults.json(forjsonformat).-c, --concurrency <number>: Number of concurrent browser pages to process in parallel (default:5).-v, --verbose: Print detailed page processing, info messages (timeouts vs. success), and match detections tostderr.-h, --help: Display the help message.
💡 Examples
1. Basic Dynamic Scan (Default Text Output File)
Crawls https://example.com for dynamic elements matching .newsletter-signup and saves to results.txt:
nodefisher https://example.com ".newsletter-signup"2. Save Results as JSON File (Verbose Mode)
Crawls and saves matching URLs into matches.json while printing detailed browser loading logs:
nodefisher example.com ".newsletter-signup" --format json --output matches.json --verbose3. Print directly to Stdout (Real-time Streaming)
Prints matching URLs directly to standard output as they are found.
nodefisher https://example.com ".dynamic-element" --format stdout4. Custom Concurrency
Scan a site using 10 concurrent browser context workers:
nodefisher example.com ".target-class" --concurrency 10⚙️ How it Works
- URL Discovery: The CLI takes the target URL, parses its
robots.txtfile, fetches sitemaps recursively, and outputs all unique URLs on the domain. - Headless Browser Setup: Playwright launches a headless Chromium instance.
- Parallel Worker Pool: A worker queue pulls URLs. It initializes isolated browser contexts up to the
--concurrencylimit. - JS Page Load: Each worker navigates to its URL, allowing JavaScript to run and external APIs to load.
- Selector Matching: The script invokes Playwright's
page.waitForSelector(selector, { timeout: 5000 }). If the element renders within 5 seconds, the page is flagged as a match. If it does not appear (or the page fails to load), the page is skipped. - Output Writing: Matching URLs are saved/printed based on the chosen output format.
