npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

jason-the-miner

v1.1.1

Published

Harvesting data at the <html> mine.

Downloads

17

Readme

Jason the Miner

npm npm Node version

Harvesting data at the <html> mine... Jason the Miner, a versatile Web scraper for Node.js.

⛏ Features

  • Composable: via a modular architecture based on pluggable processors. The output of one processor feeds the input of the next one. There are 3 types of processors:
    1. loaders: to fetch the data (via HTTP requests, by reading text files, etc.)
    2. parsers: to parse the data (HTML by default) & extract the relevant parts according to a predefined schema
    3. transformers: to transform and/or output the results (to a CSV file, via email, etc.)
  • Configurable: each processor can be chosen & configured independently
  • Extensible: you can register your own custom processors
  • CLI-friendly: Jason the Miner works well with pipes & redirections
  • Promise-based API
  • MIT-licensed

⛏ Installing

$ npm install -g jason-the-miner

⛏ Demos

Clone the project...

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run demos

...and have a look at the "demos" folder, among them, you'll find scraping:

  • Simple GitHub search (JSON, CSV, Markdown output)
  • Extended GitHub search with issues (including following links & paginating issues)
  • Goodreads books and following to Amazon to grab their product ID
  • Google search and follow search results for finding mobile apps
  • IMDb images gallery links (with pagination)
  • Mixcloud stats, templating them & sending them by mail
  • Mixcloud SPA scraping controlling a headless browser
  • Avatars download
  • Bulk insertions to Elasticsearch from a CSV file
  • ...

⛏ Examples

CLI

Scraping the most popular Javascript scrapers from GitHub:

// github-config.json
{
  "load": {
    "http": {
      "url": "https://github.com/search",
      "params": {
        "q": "scraper",
        "l": "JavaScript",
        "type": "Repositories",
        "s": "stars",
        "o": "desc"
      }
    }
  },
  "parse": {
    "html": {
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos.json"
    }
  }
}
$ jason-the-miner -c github-config.json

Alternatively, with pipes & redirections:

// github-config.json
{
  "parse": {
    "html": {
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  }
}
$ curl https://github.com/search?q=scraper&l=JavaScript&type=Repositories&s=stars&o=desc | jason-the-miner -c github-config.json > github-repos.json

API

const JasonTheMiner = require('jason-the-miner');

const jason = new JasonTheMiner();

const load = {
  http: {
    url: "https://github.com/search",
    params: {
      q: "scraper",
      l: "JavaScript",
      type: "Repositories",
      s: "stars",
      o: "desc"
    }
  }
};

const parse = {
  html: {
    "repos": [".repo-list .repo-list-item h3 > a"]
  }
};

jason.harvest({ load, parse }).then(results => console.log(results));

⛏ The config file

{
  "load": {
    "[loader name]": {
      // loader options
    }
  },
  "parse": {
    "[parser name]": {
      // parser options
    }
  },
  "transform": {
    "[transformer name]": {
      // transformer options
    }
  }
}

Loaders

Jason the Miner comes with 5 built-in loaders:

| Name | Description | Options | | --- |---| --- | | http | Uses axios as HTTP client | All axios request options + [_concurrency=1] (to limit the number of concurrent requests when following/paginating) & [_cache] (to cache responses on the file system) | | browser | Uses puppeteer as browser | puppeteer launch, goto, screenshot, pdf and evaluate options | | file | Reads the content of a file | path, [stream=false], [encoding="utf8"] & [_concurrency=1] (to limit the number of concurrent requests when paginating) | | csv-file | Uses csv-parse to read a CSV file | All csv-parse options in a csv object + path+ [encoding="utf8"] | | stdin | Reads the content from the standard input | [encoding="utf8"] |

For example, an HTTP load config which responses will be cached in the "tests/http-cache" folder:

...
"load": {
  "http": {
    "baseURL": "https://github.com",
    "url": "/search?l=JavaScript&o=desc&q=scraper&s=stars&type=Repositories",
    "_concurrency": 2,
    "_cache": {
      "_folder": "tests/http-cache"
    }
  }
}
...

Check the demos folder for more examples.

Parsers

Currently, Jason the Miner comes with a 2 built-in parsers:

| Name | Description | Options | | --- |---| --- | |html|Parses HTML, built with Cheerio|A parse schema| |csv|Parses CSV, built with csv-parse|All csv-parse options|

HTML schema definition

Examples
...
  "html": {
    // Single value
    "repo": ".repo-list .repo-list-item h3 > a"

    // Collection of values
    "repos": [".repo-list .repo-list-item h3 > a"]

    // Single object
    "repo": {
      "name": ".repo-list .repo-list-item h3 > a",
      "description": ".repo-list .repo-list-item div:first-child"
    }

    // Single object, providing a root selector _$
    "repo": {
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child"
    }

    // Collection of objects
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child"
    }]

    // Following
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child",
      "_follow": {
        "_link": "h3 > a",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count",
          "stars": "li:nth-child(2) a.social-count",
          "forks": "li:nth-child(3) a.social-count"
        }
      }
    }]

    // Paginating
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child",
      "_paginate": {
        "_link": ".pagination > a[rel='next']",
        "_depth": 1
      }
    }]
  }
...

Full flavour

...
  "html": {
    "title": "title | trim",
    "metas": {
      "lang": "html < attr(lang)",
      "content-type": "meta[http-equiv='Content-Type'] < attr(content)"
    },
    "stylesheets": ["link[rel='stylesheet'] < attr(href)"],
    "repos": [{
      "_$": ".repo-list .repo-list-item ? text(crawler)",
      "_slice": "0,3",
      "name": "h3 > a",
      "last-update": "relative-time < attr(datetime)",
      "_follow": {
        "_link": "h3 > a",
        "description": "meta[property='og:description'] < attr(content) | trim",
        "url": "link[rel='canonical'] < attr(href)",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count | trim",
          "stars": "li:nth-child(2) a.social-count | trim",
          "forks": "li:nth-child(3) a.social-count | trim"
        },
        "_follow": {
          "_link": ".js-repo-nav span[itemprop='itemListElement']:nth-child(2) > a",
          "open-issues": [{
            "_$": ".js-navigation-container li > div > div:nth-child(3)",
            "desc": "a:first-child | trim",
            "opened": "relative-time < attr(datetime)"
          }],
          "_paginate": {
            "_link": "a[rel='next']",
            "_slice": "0,1",
            "_depth": 2
          }
        }
      }
    }],
  }
...

As you can see, a schema is a plain object that recursively defines:

  • the names of the values/collection of values that you want to extract: "title" (single value), "metas" (object), "stylesheets" (collection of values), "repos" (collection of objects)
  • how to extract them: [selector] ? [matcher] < [extractor] | [filter] (check "Parse helpers" below)

Additional instructions can be passed to the parser:

  • _$ acts as a root selector: further parsing will happen in the context of the element identified by this selector
  • _slice limits the number of elements to parse, like String.prototype.slice(begin[, end])
  • _follow tells Jason to follow a single link (fetch new data) & to continue scraping after the new data is received
  • _paginate tells Jason to paginate (fetch & scrape new data) & to merge the new values in the current context, here multiple links can be selected to scrape in parallel multiple pages
Parse helpers

The following syntax specifies how to extract a value:

[property name]: [selector] ? [matcher] < [extractor] | [filter]

For instance:

...
"repos": [".repo-list-item h3 > a ? text(crawler) < attr(title) | trim"]
...

Will extract a "repos" array of values from the links identified by the ".repo-list-item h3 > a" selector, matching only the ones containing the text "crawler". The values will be retrieved from the "title" attribute of each link and will be trimmed.

Jason has 4 built-in element matchers:

  • text(regexString)
  • html(regexString)
  • attr(attributeName,regexString)
  • slice(begin,end)

They are used to test an element in order to decide whether to include/discard it from parsing. If not specified, Jason includes every element.

7 built-in text extractors:

  • text([optionalStaticText]) (by default)
  • html()
  • attr(attributeName)
  • regex(regexString)
  • date(inputFormat,outputFormat) (parses a date with moment)
  • uuid() (generates a uuid v1 with uuid)
  • count() (counts the number of elements matching the selector, needs an array schema definition)

and 5 built-in text filters:

  • trim
  • single-space
  • lowercase
  • uppercase
  • json-parse (to parse JSON, like JSON-LD)

Transformers

| Name | Description | Options | | --- |---| --- | | stdout | Writes the results to stdout | [encoding="utf8"] | | json-file | Writes the results to a JSON file | path & [encoding="utf8"] | | csv-file | Writes the results to a CSV file using csv-stringify | csv: same as csv-stringify + path, [encoding='utf8'] and [append=false] (whether to append the results to an existing file or not) | | download-file | Downloads files to a given folder using axios | [baseURL], [parseKey], [folder='.'], [namePattern='{name}'], [maxSizeInMb=1] & [concurrency=1] | email | Sends the results by email using nodemailer | Same as nodemailer, split between the smtp and message options |

Jason supports a single transformer or an array of transformers:

{
  ...
  "transform": [{
    "json-file": {
      "path": "./github-repos.json"
    }
  }, {
    "csv-file": {
      "path": "./github-repos.csv"
    }
  }]
}

⛏ Bulk processing

Parameters can be defined in a CSV file and applied to configure the processors:

{
  "bulk": {
    "csv-file": {
      "path": "./github-search-queries.csv",
      "csv": {
        "columns": true,
        "delimiter": ","
      }
    }
  },
  "load": {
    "http": {
      "baseURL": "https://github.com",
      "url": "/search?l={language}&o=desc&q={query}&s=stars&type=Repositories",
      "_concurrency": 2
    }
  },
  "parse": {
    "html": {
      "title": "< text(Best {language} repos)",
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos-{language}.json"
    }
  }
}

github-search-queries.csv :

language,query
JavaScript,scraper
Python,scraper

⛏ API

constructor({ fallbacks = {} } = {})

fallbacks defines which processor to use when not explicitly configured (or missing in the config file):

  • load: 'identity',
  • parse: 'identity',
  • transform: 'identity',
  • bulk: null

The fallbacks change when using the CLI (see bin/jason-the-miner.js):

  • load: 'stdin',
  • parse: 'html',
  • transform: 'stdout',
  • bulk: null

loadConfig(configFile)

Loads a config from a JSON or JS file.

jason.loadConfig('./harvest-me.json');

harvest({ bulk, load, parse, transform } = {})

Launches the harvesting process:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest())
  .catch(error => console.error(error));

You can pass custom options to temporarily override the current config:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest({
    load: {
      http: {
        url: "https://github.com/search?q=scraper&l=Python&type=Repositories"
      }
    }
  }))
  .catch(error => console.error(error));

To permanently override the current config, you can modify Jason's config property:

const allResults = [];

jason
  .loadConfig('./harvest-me.json')
  .then(() => jason.harvest())
  .then((results) => {
    allResults.push(results);

    jason.config.load.http.url = 'https://github.com/search?q=scraper&l=Python&type=Repositories';

    return jason.harvest();
  })
  .then((results) => {
    allResults.push(results);
  })
  .catch(error => console.error(error));
registerHelper({ category, name, helper })

Registers a parse helper in one of the 3 categories: match, extract or filter. helper must be a function.

const url = require('url');

jason.registerHelper({
  category: 'filter',
  name: 'remove-query-params',
  helper: (href = '') => {
    if (!href || href === '#') {
      return href;
    }

    const { protocol, host, pathname } = url.parse(href);

    return `${protocol}//${host}${pathname}`;
  }
});
registerProcessor({ category, name, processor })

Registers a new processor in one of the 3 categories: load, parse or transform. processor must be a class implementing the run() method:

jason.registerProcessor({
  category: 'transform',
  name: 'template',
  processor: Templater
});

class Templater {
  constructor({ config }) {
    // receives automatically its config
  }

  /**
   * @param {*} results
   * @return {Promise.<*>}
   */
  run({ results }) {
    // must be implemented & must return a promise.
  }
}

jason.config.transform = {
  template: {
    "templatePath": "my-template.tpl",
    "outputPath": "my-page.html"
  }
};

Be aware that loaders must also implement the getConfig() and buildLoadOptions({ link }) methods. Have a look at the source code for more info.

⛏ Testing

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run test

⛏ Resources

  • Web Scraping With Node.js: https://www.smashingmagazine.com/2015/04/web-scraping-with-nodejs/
  • X-ray, The next web scraper. See through the noise: https://github.com/lapwinglabs/x-ray
  • Simple, lightweight & expressive web scraping with Node.js: https://github.com/eeshi/node-scrapy
  • Node.js Scraping Libraries: http://blog.webkid.io/nodejs-scraping-libraries/
  • https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-illegal/
  • http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/
  • Web scraping o rastreo de webs y legalidad: https://www.youtube.com/watch?v=EJzugD0l0Bw
  • Scraper API blog: https://www.scraperapi.com/blog/

⛏ A final note...

Please take these guidelines in consideration when scraping:

  • The content being scraped is not copyright protected.
  • The act of scraping does not burden the services of the site being scraped.
  • The scraper does not violate the Terms of Use of the site being scraped.
  • The scraper does not gather sensitive user information.
  • The scraped content adheres to fair use standards.