npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

crawly-mccrawlface

v0.5.5

Published

Crawl data from webpages and apply content extraction.

Downloads

92

Readme

Crawly McCrawlface

A small crawler that downloads html from the web and applies content extraction.

#Install npm install crawly-mccrawlface

//Create crawler and supply seed as string or array of strings
const crawler = new Crawler('https://budick.eu');
// or if you want multiple domains:
// const crawler = new Crawler(['https://budick.eu', 'https://hackerberryfinn.com']);

//start crawling
crawler.start();
crawler.on('finished', () => {
  // the crawler has loaded all the sites that could be reached by the seed from the domain of the seed.
  // get content of the site by its url



});

Data extraction

The crawler uses google's NLP API to extract data from text. To use this feature you have to supply the key as environment variable:

Windows: setx GOOGLE_NLP_API 1234APIKEY or you can find the tool from System by typing 'environment' into the search box in start menu.

Unix: export GOOGLE_NLP_API=1234APIKEY

Or: Create .env file with the content: GOOGLE_NLP_API=1234APIKEY

To accomplish that it uses the google-nlp-api package.

Caching

You can cache responses from websites using a simple object that has a set and get method and some persistence.

Some examples:

Redis (with ioredis:

const Redis = require('ioredis');
const redis = new Redis({
    port: 6379,          // Redis port
    host: 'localhost',   // Redis host
    family: 4,
    password: 'superSecurePassword',
    db: 0
});
crawler.setCache({
    get: function (key) {
        return redis.get(key);
    },
    set: function (key, value, expire) {
        redis.set(key, value, 'EX', expire);
    }
});
crawler.addCache(cache);

Usages in Meteor

=> here

Options

const options = {
			readyIn: 50,
			goHaywire: false,
			userAgent: 'CrawlyMcCrawlface',
			expireDefault: 7 * 24 * 60 * 60 * 1000
};
const crawler = new Crawler([...some urls...], options);

readyIn (Number): Number of sites, that have to be loaded that ready-event is fired.

goHaywire (Boolean): On defautl the crawler will only get content from the domains that where in the seed. On haywire mode the crawler will never stop and go crazy on the web. You should not use this mode for now. Or use it at your own risk, I'm not you boss.

userAgent (String): User Agent

expireDefault (Number): Expire key that is set in cache.

Filters

You can decide which sites are crawled by adding filters as string or regex. Filters are concatted with OR.

crawler.addFilter('details.html');
// if(url.match('details.html')){//url is crawled}

crawler.addFilter(new RegExp('[0-9]{5}'), i);
//if(url.match('details.html') || url.match(/[0-9]{5}/)){//url is crawled}

Events:

Crawler

ready is fired when five sites where loaded, this is the first point where content extraction can be applied. If the content is crawled from different domains, the event will not be helpful anymore. You should use siteAdded or sitesChanged.

siteAdded is fired when a new site was added. It contains the new site as object.

sitesChanged is fired when a new site was added, it contains the count of all sites.

finished is fired when the queue is empty. On default usages, this is the point when everything is ready.

ready is called, when there are enough sites (default: 50 or set with options.readyIn) to do a content extraction or all sites of domain were crawled.

API

todo

Test

Test with:

npm test

Content extraction

There are multiple algorythm used to do content extraction:

Gold Miner

Gold miner will only work if at least two sites with the same template were crawled. The extraction works by looking on the differences between the sites. The nodes with a difference more than the mean differences of all nodes are extracted as content.

Link Quota Filtering + Text density + some counting

If only one site is crawled and its content extracted, nodes are classified with link quota filtering, text density and count of special tags.

For questions or problems

Feel free to open an issue or send an email: [email protected]

License

AGPL-3.0