npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

web-tree-crawl

v1.1.4

Published

Simple framework for crawling/scraping web sites. The result is a tree, where each node is a single request.

Downloads

15

Readme

Note to english-speakers

Many comments, issues, etc. are partially written in german. If you want something translated create an issue and I'll take care of it.

Introduction

web-tree-crawl is available on npm and on gitlab

Idea

The crawling process is tree-shaped: You start with a single URL (the root), download a document (a node) and discover new URLs (child nodes) which in turn will be downloaded. So every crawled document is a node in the tree and every URL is an edge. The tree spans only over new edges; edges to already known urls will be stored, but not processed.

The end result will be a tree representing the crawl-process. All discovered information will be stored in this tree.

The main difference between different crawler is which URLs and which data will be scraped from discovered documents. So those two scraper will need to be supplied by the user while the library web-tree-crawl takes care of everything else.

Example

note: everything here is ECMA6

So lets say you want the last couple of comics from xkcd.com. All you have to do is

"use strict";
const crawler = require('web-tree-crawl');

// configure your crawler
let ts = new crawler("https://xkcd.com");
ts.config.maxRequests = 5;
ts.config.dataScraper = crawler.builtin.dataScraper.generalHtml;
ts.config.urlScraper = crawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');

// exectute!
ts.buildTree(function (root) {
    //  print discovered data to std::out
    console.log(JSON.stringify(crawler.builtin.treeHelper.getDataAsFlatArray(root), null, "\t"));
});

For more examples see: https://gitlab.com/wotanii/web-tree-crawl/tree/master/examples

Details/Dokumentation

Use web-tree-crawl like this:

  1. create the object & set the initial url
  2. modify the config-object
  3. call buildTree & wait for the callback

Config

You will always want to define those config-attributes :

  • maxRequests: how many documents may be crawled?
  • dataScraper: what data do you want to find?
  • urlScraper: how does the crawler look for new urls?

There are more, but their defaults work well on most websites and are pretty much self-explanatory (if not, let me know by opening an issue).

Url Scraper

These are functions, that scrape urls from a document. The crawler will apply this function to all crawled documents to discover new documents.

Create your own url scraper or use a builtin. All url scraper must have this signature:

  • parameters
    1. string: content of current document
    2. string: url of current document
  • returns
    1. string[]: discovered urls

Data Scraper

These are functions, that scrape data from a document. The crawler will apply this function to all crawled documents to decide what data to store for this document.

The crawler will not use this data in anyway, so you can return whatever you want.

Create your own data scraper or use a builtin. All data scraper must have this signature:

  • parameters
    1. string: content of current document
    2. string: current node
  • returns
    1. anything

Builtin

There are some static builtin function, that you don't need to use, but they will make your life easier. Some of those function can be used directly and some are factories, that return those functions.

Url Scraper

These are functions, that will scrape for urls in a usual manner. Use them by putting them in your config like this:

ts.config.urlScraper = treeCrawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');

Data Scraper

These are functions, that will scrape for data in a usual manner. Use them by putting them in your config like this:

ts.config.dataScraper = treeCrawler.builtin.dataScraper.generalHtml;

Tree Helper

These are functions, that help to extract information from the result-tree. Use these function once buildTree has finished.

They will modify your tree (e.g. treeCrawler.builtin.treeHelper.addParentsToNodes) or they will extract data from your tree (e.g. crawler.builtin.treeHelper.getDataAsFlatArray)

Dev-Setup

sudo apt install npm nodejs

git clone [email protected]:wotanii/web-tree-crawl.git
cd web-tree-crawl/
npm install

npm test

if tests fail with your set-up, either create an issue or comment on an existing issue