npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

little-scraper

v5.0.0

Published

Little Node Scraper with simple by powerful async flow control goodness

Readme

Little Scraper That Could 🚂

Little Node Scraper with simple but powerful async flow control goodness. Batteries included, features:

  • retries falied requests with exponential backoff
  • configurable parallelization
  • randomized delays for non-uniform load
  • browswer headers rotation to appear as a client browser
  • proxy support
  • 👑 the only library with support of generator functions as a source of urls. (Useful for APIs or websites that use pagination, but won't tell you what is the last page.)
  • can save results to a JSON file (beta feature)
  • all above is configurable

How to use

import { log } from 'console-tools'

/* To scrape something you would have to do 3 things:

1. Create an array of urls (with optional context) to scrape from
2. Define a scraping function that scrapes results from given HTML page
3. Configure and run scraper with defined function and created array of urls

So let's do this.
*/

// #1 Lets define a function that would just return an array of urls we want to scrape from
// We will use google search results as an example and  simple context object that indicates
// on which page number particular result was found

const createUrls = () => {
 const createUrl = startFrom => `https://www.npmjs.com/browse/star?offset=${startFrom}`
 // const createUrl = startFrom => `https://ramen.is/sdf=${startFrom}`
 // first 5 goole search results

 // each page has 36 packages as of Feb 20th 2017, hardcoding paging value for simplicity
 return ([0, 36, 36 * 2, 36 * 3, 36 * 4]
   .map((startFrom, i) => ({
     url: createUrl(startFrom),
     context: {pageNumber: i}
   }))
 )
}

// #2 Scraper funcion is just a function that accepts an object that consists of
// Http Response and url with context. It will be called for each url crated above.
// Let's use cheerio to scrape search result titles from google.

import cheerio from 'cheerio' // jQuery-like HTML manipulation

const scrapeTopStarredPackagesFromNpm = ({response, urlWithContext}) => {
 const $ = cheerio.load(response.body)

 const $results = $('.package-widget .package-details .name')

 const values = $results
   .toArray()
   .map((elem) => $(elem).text())

 log.done(urlWithContext.url)
 // log.json(values)
 return values
}

// #3 lets configure scraper and run it

import createScraper from 'index'

const scrape = createScraper({
 scrapingFunc: scrapeTopStarredPackagesFromNpm,
 concurrency: 2,
 delay: 1000,
})

// now when it's ready lets start crawling, its as simple as
async function runScraping () {
 const results = await scrape({ fromUrls: createUrls() })
 log.json(results)
}

export default runScraping

Scraper Configuration

{
  // a function that will be called for every result returned by requesting every url passed to scraper
  scrapingFunc = (() => []),
  // default pessimistic delay between scraping function calls
  delay = 1000,

  // number of times to retry a request if it fails or returns non 200 error code
  retryAttempts = 3,
  // default retry delay is double of delay in between requests + one second
  retryDelay = delay * 2 + 1000,
  // if response status code is not one of the described below request would be treated as failed
  successStatusCodes = [200],
  // would randomize given delay, so it's within [x/2, x*2] range
  randomizeDelay = true,
  // 1 equal to sequential
  concurrency = 1,
  // default name for file to be used for storing resulting data, this name will also be used for intermediate
  // files created to chache results if "cacheIntermediateResultsToFile" is set to true
  fileName = 'tmp',
  // for long-running scraping processes it's useful to dump results into files so if something fails we don't loose
  // progress
  cacheIntermediateResultsToFile = false,
  writeResultsToFile = false,
  proxyUrl = ''
}