npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

crawl-links

v1.0.10

Published

takes an array of urls and crawls the urls for links

Downloads

9

Readme

Crawl Links

Crawl Links is a library that allows you to recursively crawl and scrape URLs from a website up to a specified depth. It provides flexibility for both Node.js and web browser environments.

developed by Wisdom Oparaocha

build and use dist/index.js — for web browsers

Repository

Compatibility

  • Node.js: Version 12 or above
  • Web Browsers: All modern browsers that support ES6

Installation

You can install Crawl Links using npm:

npm install crawl-links

Configuration

To use Crawl Links, follow these steps:

Import the library into your project:

// For Node.js
const crawlLinks = require('crawl-links');

// For Web Browser (with module bundler)
import crawlLinks from 'crawl-links';

Define the configuration options

The crawl-links package provides configuration options to customize the crawling behavior. These options should be passed as properties of an options object when calling the crawlLinks function.

const options = {
  urls: ['https://example.com'],
  maxDepth: 2,
  maxConcurrentRequests: 5,
  sameDomain: true
};

const result = await crawlLinks(options);
console.log(result);
  • urls: An array of URLs to crawl. The crawler will start from these URLs and recursively follow links.
  • maxDepth: The maximum depth level to crawl. The crawler will stop at this depth level and not follow further links.
  • maxConcurrentRequests: The maximum number of concurrent requests to make at a time.
  • sameDomain: If sameDomain is set to true, only links from the same domain will be crawled. If sameDomain is set to false or not provided, links from any domain will be crawled.

In the example above, we define an options object with the urls, maxDepth, and maxConcurrentRequests properties. We pass this object to the crawlLinks function to initiate the crawling process. The crawler will start from the specified URLs and crawl up to a depth of 2, making a maximum of 10 concurrent requests.

Call the crawlLinks function

crawlLinks(options)
  .then((result) => {
    // Process the result
    console.log('Scraped Links:', result.scrapedLinks);
    console.log('Ignored Links:', result.ignoreLinks);
  })
  .catch((error) => {
    console.error('Error:', error);
  });

Usage

  • To start crawling, provide an array of URLs in the urls option.
  • The crawler will visit each URL and extract all the links found on the page.
  • It will then follow the links to the specified depth level, scraping URLs along the way.
  • Concurrent requests are limited to the specified maxConcurrentRequests value.
  • The result object contains the scrapedLinks and ignoreLinks arrays, which can be processed accordingly.

Examples

Here are some examples of how to use Crawl Links:

Example 1: Single-Page Scraping

const options = {
  urls: ['https://example.com'],
  maxDepth: 0,
  maxConcurrentRequests: 5
};

crawlLinks(options)
  .then((result) => {
    console.log('Scraped Links:', result.scrapedLinks);
  })
  .catch((error) => {
    console.error('Error:', error);
  });

This example performs scraping on a single page (maxDepth: 0), extracting all the links found on that page.

Example 2: Multi-Page Scraping

const options = {
  urls: ['https://example.com'],
  maxDepth: 2,
  maxConcurrentRequests: 10,
};

crawlLinks(options)
  .then((result) => {
    console.log('Scraped Links:', result.scrapedLinks);
    console.log('Ignored Links:', result.ignoreLinks);
  })
  .catch((error) => {
    console.error('Error:', error);
  });

This example performs scraping on a website up to a depth of 2, with a maximum of 10 concurrent requests.

URL Normalization

The crawl-links script performs URL normalization to ensure consistent and uniform URLs across different types. Here's how it handles normalization for various URL formats:

Absolute URLs

Absolute URLs, such as https://example.com/path/to/page, are already complete URLs that point to a specific web page. The script does not modify absolute URLs during the normalization process. They are used as-is to retrieve the webpage content.

Relative URLs

Relative URLs, such as /path/to/page or ../path/to/page, are URLs that are relative to the current page's URL. The script converts relative URLs to absolute URLs by combining them with the base URL of the current page. For example, if the base URL is https://example.com, a relative URL of /path/to/page will be normalized to https://example.com/path/to/page.

Protocol-less URLs

Protocol-less URLs, such as //example.com/path/to/page, do not specify a protocol (e.g., http or https). The script automatically adds the appropriate protocol based on the page's URL. For example, if the current page URL is https://example.com, a protocol-less URL of //example.com/path/to/page will be normalized to https://example.com/path/to/page.

Fragment URLs

Fragment URLs, such as https://example.com/page#section, include a fragment identifier starting with a # symbol. The script removes the fragment part during normalization to avoid duplicate URLs. For example, https://example.com/page#section will be normalized to https://example.com/page.

Trailing Slashes

The script removes trailing slashes from URLs to ensure consistency. For example, https://example.com/path/ will be normalized to https://example.com/path.

By performing these normalization techniques, the crawl-links script ensures that URLs are consistent, avoids duplicates caused by different URL variations, and facilitates proper navigation through the website's structure.

Contribution

Contributions, issues, and feature requests are welcome! Feel free to check the GitHub repository and contribute to make Crawl Links even better.

License This project is licensed under the MIT License