npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@giladbeer/node-spider

v2.0.2

Published

Node.js web spider for site search

Downloads

48

Readme

node-spider

Build Status npm Known Vulnerabilities MIT License

A Node.js web spider for site search. Inspired by the deprecated https://github.com/algolia/docsearch-scraper NOTE: The project is in a very early stage.

overview

node-spider lets you crawl your website, scrape content that matches html selectors you specified in a config file, then index them in a search engine (currently only supports Algolia) to serve your site search features.

Under the hood, the project uses puppeteer-cluster, which in turn uses puppeteer

getting started

installation

npm

npm install --save puppeteer # the project uses puppeteer-cluster and puppeteer under the hood
npm install --save puppeteer-cluster # the project uses puppeteer-cluster and puppeteer under the hood
npm install --save @giladbeer/node-spider

yarn

yarn add puppeteer puppeteer-cluster @giladbeer/node-spider

usage

import { crawlSite } from '@giladbeer/node-spider';

const letsStartCrawling = async () => {
  await crawlSite({
      configFilePath: 'path/to/your/config.json',
      searchEngineOpts: {
        algolia: {
          apiKey: '<your algolia API key>',
          appId: '<your algolia app ID>',
          indexName: '<your algolia index name>'
        }
      },
      diagnostics: true,
      logLevel: 'debug',
      maxIndexedRecords: 300
    });
}

letsStartCrawling().then(() => {
  process.exit(0);
})

API docs (WIP)

crawlSite

instantiates a Spider object, initializing it based on your config file and settings, then invoking its crawl method.

crawlSite options:

| Property | Required | Type | Description | | --- | --- | --- | --- | | configFilePath | N | string | the path to your config json file | | config | N | CrawlSiteOptionsCrawlerConfig | alternatively to passing a config file path, can pass the config file's properties here | | searchEngineOpts | N | SearchEngineOpts | search engine settings | | logLevel | N | "debug" / "warn" / "error" | log level | | diagnostics | N | boolean | whether or not to output diagnostics | | diagnosticsFilePath | N | string | path to the file where diagnostics will be written to | | timeout | N | number | timeout in ms | | maxIndexedRecords | N | number | maximum number of records to index. If reached, the crawling jobs will terminate |

CrawlSiteOptionsCrawlerConfig

| Property | Required | Type | Description | | --- | --- | --- | --- | | startUrls | Y | string / string[] | list of urls that the crawler will start from | | scraperSettings | Y | ScraperSettings | html selectors for telling the crawler which content to scrape for indexing | | allowedDomains | N | string / string[] | list of allowed domains. When not specified, defaults to the domains of your startUrls | | ignoreUrls | N | string / string[] | list of url patterns to ignore | | maxConcurrency | N | number | maximum concurrent puppeteer clusters to run |

ScraperSettings

all of the scraper settings groups (each group except the default ties to a specific URL pattern) | Property | Required | Type | Description | | --- | --- | --- | --- | | default | Y | ScraperPageSettings | default scraper page settings - will be applied when the scraped url doesn't match any other scraper page settings group | | [your scraper page-level settings group name] | N | ScraperPageSettings | page-level settings group. Can add as many as you want. Each group will be applied to a given url pattern. During crawling, the settings for each page will be chosen based on which group's urlPatten field matches the page url. The default one will be chosen if no match was found | | shared | Y | ScraperPageSettings | shared scraper settings - settings defined here will be applied for all pages unless there is an overriding setting in the default or the specific settings group that is matches the current page |

ScraperPageSettings

A group of a scraper settings - mostly hierarchy and metadata selectors, grouped by a specific URL pattern | Property | Required | Type | Description | | --- | --- | --- | --- | | hierarchySelectors | Y | HierarchySelectors | selectors hierarchy (see below) | | metadataSelectors | Y | Record<string, string> | metadata selectors. Mapping from html selectors to custom additional fields in the index, e.g. can scrape meta tags of a certain content pattern and store under a custom field | | urlPattern | Y | string | URL pattern. During crawling, the settings group for each page will be chosen based on which group's urlPatten field matches the page url. The default one will be chosen if no match was found | | pageRank | N | number | custom ranking for the matched pages. Defaults to 0 | | respectRobotsMeta | N | boolean | whether or not the crawler should respect noindex meta tag. Defaults to false | | excludeSelectors | N | string[] | list of html selectors to exclude from being scraped | | userAgent | N | string | custom user agent to set when running puppeteer | | headers | N | Record<string, string> | request headers to include when crawling the site | | basicAuth | N | { user: string; password: string } | basic auth credentials |

HierarchySelectors

hierarchy selectors. Essentially a mapping from html selectors to indexed hierarchy levels | Property | Required | Type | Description | | --- | --- | --- | --- | | l0 | N | string | HTML selectors for matching l0, e.g. "span[class='myclass'], .myclass2" | | l1 | N | string | HTML selectors for matching l1 | | l2 | N | string | HTML selectors for matching l2| | l3 | N | string | HTML selectors for matching l3 | | l4 | N | string | HTML selectors for matching l4 | | content | N | string | HTML selectors for matching content |