npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

xpcrawl

v1.0.5

Published

A versatile Puppeteer-based web crawler with XPath support, pagination, piping, and stealth capabilities.

Readme

xpcrawl

npm version npm downloads License: MIT TypeScript Kotlin

A versatile Puppeteer-based web crawler designed to extract data using XPath. It supports pagination, piping for multi-step extraction, and includes stealth features to bypass some anti-crawling mechanisms.

Features

  • XPath Extraction: Extract text, attributes, or HTML from elements using XPath.
  • Pagination: Automatically follow "next" links to crawl multiple pages.
  • Piping Support: Chain multiple instances of the crawler to extract links from one page and then crawl those links for content.
  • Stealth Mode: Uses puppeteer-extra-plugin-stealth and randomized delays to avoid detection.
  • Interactive CAPTCHA Handling: If blocked (headless mode fails), it optionally switches to visible mode and waits for you to solve the CAPTCHA.
  • Headless Toggle: Run in headless mode (default) or visible mode (--visible).

Installation

Install globally via npm:

npm install -g xpcrawl

Or run directly with npx:

npx xpcrawl <url> <xpath>

Usage

Basic Syntax

xpcrawl <url> <xpath> [--paginate <pagination_xpath>] [--delay <ms>] [--visible]

Or via pipe:

echo <url> | xpcrawl <xpath> [--delay <ms>]

Arguments

  • <url>: The starting URL to crawl.
  • <xpath>: (Optional) The XPath query to extract data. If omitted, xpcrawl enters Smart Mode.

Smart Mode

If you don't provide an XPath, xpcrawl will automatically attempt to extract the most relevant content from the page, including:

  • Page Title
  • Headings (H1, H2, H3)
  • Paragraphs
  • Link targets (hrefs)
  • Image sources (srcs)
  • --paginate <xpath>: (Optional) XPath to find the "Next Page" link.
  • --delay <ms>: (Optional) Delay in milliseconds between requests (adds random jitter).
  • --visible: (Optional) Run the browser in visible mode (useful for debugging).

CLI-Specific Features

When using xpcrawl as a CLI tool, you get additional interactive features:

Interactive CAPTCHA Handling

If a website presents a CAPTCHA and no results are found:

  1. Automatic Detection: The CLI automatically switches from headless to visible mode
  2. Browser Opens: A browser window appears showing the CAPTCHA
  3. Manual Intervention: Solve the CAPTCHA in the browser
  4. Interactive Prompt: The CLI prompts you to press Enter to retry or type "skip" to continue
  5. Resume Crawling: After solving, press Enter and crawling continues
# Example: If CAPTCHA is detected
xpcrawl "https://protected-site.com" "//h1"

# Output:
# --- No matches found in headless mode. Switching to VISIBLE mode for CAPTCHA resolution... ---
# Browser restarted. Navigating back to target...
# 
# --- No matches found. CAPTCHA detected? ---
# Browser is open. Solve the CAPTCHA or fix the page state.
# Press ENTER to retry extraction, or type "skip" to move on.
# >

CLI Examples

Example 1: Download Book Covers

This example demonstrates a full pipeline:

  1. Crawl a category page to find book links
  2. Extract the cover image URL from each book page
  3. Download the images in parallel using wget
mkdir -p covers
xpcrawl "http://books.toscrape.com/catalogue/category/books/travel_2/index.html" "//h3/a/@href" | \
xpcrawl "//div[@id='product_gallery']//img/@src" | \
xargs -P 4 -n 1 -I {} wget -q -P covers {}

Example 2: Extract Article Titles with Pagination

Crawl a blog or news site with pagination:

xpcrawl "https://example.com/blog" "//article//h2/a/text()" \
  --paginate "//a[@rel='next']/@href" \
  --delay 1000

Example 3: Scrape Product Prices

Extract product information from an e-commerce site:

xpcrawl "https://example.com/products" "//div[@class='product']//span[@class='price']/text()"

Example 4: Get All Links from a Page

Extract all absolute URLs from a webpage:

xpcrawl "https://example.com" "//a/@href"

Programmatic Usage

You can also use xpcrawl as a library in your Node.js projects:

npm install xpcrawl

Basic Example

const { crawl } = require('xpcrawl');

async function main() {
    const results = await crawl({
        url: 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
        xpath: '//h3/a/text()'
    });
    
    console.log('Book titles:', results);
}

main();

API Reference

crawl(options)

Crawl a URL or HTML content and extract data using XPath.

Options:

  • url (string): URL to crawl (mutually exclusive with html)
  • html (string): HTML content to parse (mutually exclusive with url)
  • xpath (string, required): XPath query to extract data
  • paginationXpath (string): XPath to find next page link
  • delay (number): Delay in ms between requests (default: 0)
  • headless (boolean): Run browser in headless mode (default: true)
  • autoSwitchVisible (boolean): Auto-switch to visible mode on CAPTCHA (default: true)
  • onResult (function): Callback for each result batch (matches) => {}
  • onPage (function): Callback for each page processed (url) => {}

Returns: Promise<string[]> - Array of extracted values

crawlMultiple(options)

Crawl multiple URLs in sequence.

Options: Same as crawl(), plus:

  • urls (string[], required): Array of URLs to crawl

Returns: Promise<string[]> - Array of all extracted values

CAPTCHA Handling in Library Mode

When using xpcrawl as a library (not CLI), CAPTCHA handling works differently:

  • Automatic switch to visible mode: If autoSwitchVisible: true (default), the browser will automatically switch from headless to visible mode when no results are found
  • Browser stays open: The visible browser window remains open, allowing you to solve CAPTCHAs
  • No interactive prompt: Unlike CLI mode, there's no terminal prompt asking you to press Enter
  • ⚠️ Manual handling required: You need to implement your own logic to wait for CAPTCHA resolution

Example with manual CAPTCHA handling:

const { crawl } = require('xpcrawl');

async function crawlWithCaptchaHandling() {
    const results = await crawl({
        url: 'https://protected-site.com',
        xpath: '//h1/text()',
        headless: false,  // Start in visible mode
        autoSwitchVisible: false,  // Disable auto-switch since we're already visible
        onResult: (matches) => {
            if (matches.length === 0) {
                console.log('No results found. Please solve CAPTCHA in the browser window.');
                console.log('The browser will stay open for 60 seconds...');
            }
        }
    });
    
    return results;
}

Recommendation: For production use with CAPTCHAs, consider:

  • Starting with headless: false to see what's happening
  • Using CAPTCHA-solving services
  • Implementing retry logic with delays
  • Using the CLI mode for interactive debugging

More Examples

See the examples directory for more usage patterns:

// Smart Mode (automatic extraction)
const results = await crawl({
    url: 'https://example.com'
    // xpath is optional!
});

// With pagination
const results = await crawl({
    url: 'https://example.com/blog',
    xpath: '//article//h2/a/text()',
    paginationXpath: '//a[@rel="next"]/@href',
    delay: 1000
});

// Parse HTML directly
const results = await crawl({
    html: '<html><body><h1>Hello</h1></body></html>',
    xpath: '//h1/text()'
});

// Stream results with callbacks
await crawl({
    url: 'https://example.com',
    xpath: '//a/@href',
    onResult: (matches) => console.log('Batch:', matches),
    onPage: (url) => console.log('Processing:', url)
});

TypeScript Support

xpcrawl includes TypeScript type definitions out of the box:

import { crawl, crawlMultiple, CrawlOptions } from 'xpcrawl';

// Full type safety
const options: CrawlOptions = {
    url: 'https://example.com',
    xpath: '//h1/text()',
    delay: 1000,
    onResult: (matches: string[]) => {
        console.log('Results:', matches);
    }
};

const results: string[] = await crawl(options);

Available Types:

  • CrawlOptions - Options for the crawl() function
  • CrawlMultipleOptions - Options for the crawlMultiple() function

See examples/typescript-usage.ts for more TypeScript examples.

Kotlin/JS Support

xpcrawl can be used from Kotlin/JS projects via npm:

// build.gradle.kts
dependencies {
    implementation(npm("xpcrawl", "1.0.0"))
}
// External declarations
@file:JsModule("xpcrawl")
external fun crawl(options: CrawlOptions): Promise<Array<String>>

// Usage
suspend fun example() {
    val results = crawl(object : CrawlOptions {
        override var url = "https://example.com"
        override var xpath = "//h1/text()"
        override var headless = true
    }).await()
    
    println(results)
}

See examples/kotlin-js/ for a complete Kotlin/JS example with Gradle setup.

Note: The Kotlin/JS example is provided as a reference. It requires xpcrawl to be published to npm and additional Kotlin/JS setup (coroutines, Gradle, JDK). For most use cases, we recommend using TypeScript for type safety or JavaScript for simplicity.


XPath Quick Reference

Here are some common XPath patterns you can use with xpcrawl:

| Pattern | Description | |---------|-------------| | //h1 | All <h1> elements | | //h1/text() | Text content of all <h1> elements | | //a/@href | All href attributes from links (auto-resolved to absolute URLs) | | //img/@src | All src attributes from images (auto-resolved to absolute URLs) | | //div[@class='content'] | All divs with class "content" | | //div[@id='main']//p | All paragraphs inside the div with id "main" | | //a[contains(@href, 'product')] | Links containing "product" in href | | //span[@class='price']/text() | Text content of price spans |

Note: When extracting href or src attributes, xpcrawl automatically converts relative URLs to absolute URLs based on the page's base URI.

Advanced Usage

Piping HTML Content

You can pipe HTML content directly to xpcrawl instead of providing a URL:

curl -s https://example.com | xpcrawl "//h1/text()"

Using with jq for JSON Output

Combine with jq to create structured JSON output:

xpcrawl "https://example.com/products" "//div[@class='product']/@data-id" | \
jq -R -s 'split("\n") | map(select(length > 0)) | {product_ids: .}'

Debugging with Visible Mode

When developing your XPath queries or troubleshooting issues, use --visible to see what the browser is doing:

xpcrawl "https://example.com" "//h1" --visible

Handling Rate Limits

Add delays between requests to avoid overwhelming servers or triggering rate limits:

xpcrawl "https://example.com" "//a/@href" --delay 2000  # 2 second delay (with random jitter)

Troubleshooting

No Output

  • Check your XPath: Use browser DevTools to test your XPath query. Right-click → Inspect, then use the Console to test with $x("your-xpath-here").
  • Wait for content: Some sites load content dynamically. The crawler waits for networkidle2 by default, but complex SPAs might need additional handling.
  • Check for blocking: If the site detects automation, try adding --delay or use --visible to solve CAPTCHAs.

Relative URLs Not Working

xpcrawl automatically resolves relative URLs in href and src attributes to absolute URLs. If you're getting relative URLs, make sure you're extracting the attribute (e.g., //a/@href) rather than the element itself.

Timeout Errors

If you see "Navigation timeout" errors:

  • The site might be slow or blocking automated access
  • Try increasing the timeout (currently hardcoded to 60 seconds)
  • Use --visible to see what's happening in the browser

License

MIT

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.