npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@algolia/404-crawler

v1.0.10

Published

Detect 404/Not found pages from sitemap

Downloads

25

Readme

404 Crawler 🏊‍♂️

A command line interface to crawl and detect 404 pages from sitemap.

Screenshot

📊 Usage

Install

Make sure npm is installed in your computer. To know more about it, visit https://docs.npmjs.com/downloading-and-installing-node-js-and-npm

In a terminal, run

npm install -g @algolia/404-crawler

After that, you'll be able to use the command 404crawler in your terminal

Examples

  • Crawl and detect every 404 pages from algolia website's sitemap:

    404crawler crawl -u https://algolia.com/sitemap.xml
  • Use JavaScript rendering to crawl and identify all 404 or 'Not Found' pages on the Algolia website.

    404crawler crawl -u https://algolia.com/sitemap.xml --render-js
  • Crawl and identify all 404 pages on the Algolia website by analyzing its sitemap, including all potential sub-path variations

    404crawler crawl -u https://algolia.com/sitemap.xml --include-variations

Options

  • --sitemap-url or -u: Required URL of the sitemap.xml file.

  • --render-js or -r: Use JavaScript rendering to crawl and identify a 'Not Found Page' if the status code isn't a 404. This option is useful for websites that returns a 200 status code even if the page is not found (Next.js with custom not found page for example)

  • --output or -o: Ouput path for the JSON file of the results. Example: crawler/results.json. If not set, no file is written after the crawl.

  • --include-variations or -v: Include all sub-path variations from URLs found in the sitemap.xml. For example, if https://algolia.com/foo/bar/baz is found in the sitemap, the crawler will test https://algolia.com/foo/bar/baz, https://algolia.com/foo/bar, https://algolia.com/foo and https://algolia.com

  • --exit-on-detection or -e: Exit when a 404 or a 'Not Found' page is detected.

  • --run-in-parallel or -p: Run the crawler with multiple pages in parallel. By default, the number of parallel instances is set to 10. See --batch-size option to configure this number.

  • --batch-size or -s: Number or parallel instances of crawler to run: the more this number is, the more resources are consumed. Only available when --run-in-parallel option is set. If not set, default to 10.

  • --browser-type or -b: Type of the browser to use to crawl pages. Can be 'firefox', 'chromium' or 'webkit'. If not set, default to 'firefox'.

👨‍💻 Get started (maintainers)

This CLI is built with TypeScript and uses ts-node to run the code locally.

Install

Install all dependencies

pnpm i

Run locally

pnpm 404crawler crawl <options>

Deploy

  1. Update package.json version

  2. Commit and push changes

  3. Build JS files in dist/ with

    pnpm build
  4. Initialize npm with Algolia org as scope

    npm init --scope=algolia
  5. Follow instructions

  6. Publish package with

    npm publish

🔗 References

This package uses: