npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@humanwhocodes/puppeteer-data-extractor

v0.4.0

Published

Query selector-based data extractor for Puppeteer.

Readme

Puppeteer Data Extractor

by Nicholas C. Zakas

If you find this useful, please consider supporting my work with a donation.

Description

A utility to extract data from a web page using Puppeteer.

Prerequisites

  • Node.js 12.22+

Usage

Install using npm or yarn:

npm install @humanwhocodes/puppeteer-data-extractor --save

# or

yarn add @humanwhocodes/puppeteer-data-extractor

Import into your project:

// CommonJS
const { DataExtractor } = require("@humanwhocodes/puppeteer-data-extractor");

// ESM
import { DataExtractor } from "@humanwhocodes/puppeteer-data-extractor";

API

After importing, create a new instance of DataExtractor. The constructor expects one object argument that defines the data schema.

For example:

const extractor = new DataExtractor({
    name: {
        type: "string",
        selector: "[itemprop='name']"
    },
    username: {
        type: "string",
        selector: "[itemprop='additionalName']"
    },
    stars: {
        type: "number",
        selector: "svg.octicon-star + span",
        optional: true
    }
});

// create a browser instance
const browser = await puppeteer.launch({
    args: ["--no-sandbox", "--disable-setuid-sandbox", "--no-gpu", "--disable-dev-shm-usage"],
    ignoreHTTPSErrors: true
});

// setup a page and make sure it's ready for extraction
const page = await browser.newPage();
await page.goto("https://humanwhocodes.com");
await page.waitForSelector("main");

// pull data from the page
const data = await extractor.extractFrom(page);

In the schema, the property names are the names you'd like to appear in your extract data while the property values instruct the data extractor on how to find and convert the data. In the previous example, you may end up with a data structure looking like this:

{
    "name": "Nicholas Zakas",
    "username": "nzakas",
    "stars": 60
}

Schema Types

There are several different schema types you can use.

Primitive Types

There are three primitive schema types: string, number, and boolean. Each type has three possible properties:

  1. selector (required) - the CSS selector to find the element.
  2. optional - a boolean indicating if the element can be safely skipped. When true, if the CSS selector evaluates to null then the field is set to undefined; when omitted or false, an error is thrown when the CSS selector evaluates to null.
  3. convert - a function used to convert the value into some other form. This function is run after the text is extracted and from the element and converted (for number and boolean), and before that text is inserted into the final data structure.

The primitive schema types all act the same except for how they convert the extracted value:

  • number will strip out any non-numeric characters and convert into a number. For instance, $5,000 would have the $ and , stripped and be converted into 5000.
  • boolean converts the values of true, t, yes, y and 1 into true (all of these are case insensitive); everything else is converted to false.

If you want a more specific conversion, you should use "string" and specify a convert function. Here's an example:

{
    name: {
        type: "string",
        selector: ".fullName",
        convert(value) {
            return value.split(" ");
        }
    }
}

"array" Type

The "array" type lets you specify a collection of elements whose text should be extracted and the results put into an array. Here, the selector property is expected to return more than one element, and there is an additional items property that contains another schema that will be used for each item in the array. For example:

{
    books: {
        type: "array",
        selector: "#sidebar > ul:nth-of-type(2) > li",
        items: {
            title: {
                type: "string",
                selector: "img"
            }
        }
    },
}

Here, an array is created based on the alt text of images in a list. Note that the selector for title is resolved relative to the parent selector, so img becomes #sidebar > ul:nth-of-type(2) > li img. The resulting data structure looks like this:

{
    "books": [
        { "title": "Understanding ECMAScript 6" },
        { "title": "Professional Ajax" },
        { "title": "Professional JavaScript" },
    ]
}

The elements of the array are always an object, but if you'd like them to be a primitive value, you can always provide a convert function, such as:

{
    books: {
        type: "array",
        selector: "#sidebar > ul:nth-of-type(2) > li",
        items: {
            title: {
                type: "string",
                selector: "img"
            }
        },
        convert(books) {
            return books.map(book => book.title);
        }
    },
}

"object" Type

The "object" type lets you specify a collection of properties whose text should be extracted and the results put into an object. There is an additional properties property that contains another schema. For example:

{
    book: {
        type: "object",
        selector: "#book",
        properties: {
            title: {
                type: "string",
                selector: ".title"
            },
            pubYear: {
                type: "number",
                selector: ".published-date"
            },
            author: {
                type: "string",
                selector: ".author"
            }
        }
    },
}

Here, an object is created with three properties. As with "array", the selectors inside of the property schemas are relative to the parent selector. The resulting data structure looks like this:

{
    "book": {
        "title": "Understanding ECMAScript 6",
        "pubYear": 2016,
        "author": "Nicholas C. Zakas"
    }
}

You can also use a convert function.

"custom" Type

The "custom" type lets you control exactly how data is extracted from the page by specifying an extract function. The extract receives the element indicated by selector and is executed in the context of the Puppeteer page, meaning it does not act as a closure. The element passed in is an HTMLElement instance that you can interrogate to find the data you want. Then, return a JSON-serializable value from extract. For example:

{
    archive: {
        type: "custom",
        selector: "#sidebar > ul:nth-of-type(4)",
        extract(element) {

            // this function is NOT a closure and runs in the Puppeteer page

            const result = [];

            for (const child of element.children) {
                result.push(Number(child.innerText))
            }

            return result;
        }
    },
}

You can also use a convert function with "custom", and that function does not execute inside of the Puppeteer page, so you can make further customizations to the returned data.

"switch" Type

The "switch" type lets you specify multiple possible values for the key, and the first one that matches will be the value. You do so by providing a cases array, each of which has a pattern to match (if) and a value to use (then). For example:

{
    references: {
        type: "switch",
        cases: [
            {
                if: "ol.references",
                then: {
                    type: "array",
                    selector: "ol.references > li",
                    items: {
                        name: {
                            type: "string"
                        }
                    }
                }
            },
            {
                if: "#references",
                then: {
                    type: "array",
                    selector: "#references + ol > li",
                    items: {
                        name: {
                            type: "string"
                        }
                    }
                }
            }
        ]
    }
}

In this example, the key references has two possible options. The first is to create an array based on the selector ol.references > li and the second is based on the selector #references + ol > li. If the first selector matches, then that case is executed and the second is not; if the first selector doesn't match, then the second selector is checked.

Note: If no cases match then an error is thrown.

Developer Setup

  1. Fork the repository
  2. Clone your fork
  3. Run npm install to setup dependencies
  4. Run npm test to run tests

License

Apache 2.0