npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

extrae

v0.1.6

Published

A web scraping framework written in coffeescript

Downloads

17

Readme

No API? No problem!

Extrae is a framework to allow you easily extract data from web pages in a structured manner.

It is written in CoffeeScript and uses Backbone.js to define classes and models for the data extracted, cheerio to provide jQuery-like node selecting API over HTML and request to fetch HTML from the Internet.

Install from npm:

npm install extrae

Build Status

A simple example

You have some HTML you want to extract movies from. The HTML looks like this:

html = """
<html><body>
    <ul id="movies">
        <li class="movie">
            <span class="title">The Terminator</span>
            <span class="year">1984</span>
        </li>
        <li class="movie">
            <span class="title">Terminator 2: Judgment Day</span>
            <span class="year">1991</span>
        </li>
    </ul>
</body></html>
"""

Let's extract all the movies and for each movie their title and year. The collection of nodes for each movie can be extracted with the string selector #movies .movie, then each element matched will be used as base to find the title via the selector .title and year with .year.

You can define a model for each movie and the attributes to extract:

Extrae = require "extrae"

class MovieModel extends Extrae.Model
# add field definitions to the MovieModel prototype
MovieModel
    .addFieldDefinition 'title', new Extrae.Fields.StringField
    .addFieldDefinition 'year',  new Extrae.Fields.NumberField

And then the rules to extract every field. Rules consist on a string selector and a function to extract the data. Extractor functions receive as parameter the element(s) matched by the selector so you can use the cheerio API to extract data.

# add rules to the MovieModel prototype
MovieModel
    .addExtractRule 'title', new Extrae.ExtractRule '.title', ($) -> $.text()
    .addExtractRule 'year' , new Extrae.ExtractRule '.year', ($) ->  parseInt $.text(), 10

Next define a collection for the movies and set as its model the MovieModel written in the previous step:

class MovieCollection extends Extrae.Collection
    model = MovieModel

All ready in our data layer, let's create a scraper to extract the data:

scraper = new Extrae.Scraper \
                # base selector for the movie items for the collection
                '#movies .movie',
                # model or collection to extract the data and be returned
                MovieCollection

Now let's work the magic:


# scraper.scrape will return a MovieCollection instance with the
# extracted data
extractedCollection = scraper.scrape html

# using Backbone.js toJSON method for the collection we can get all the data
# as a POJO (Plain Old Javascript Object)
extractedCollection.toJSON()

# [
#     { "title" : "The Terminator", "year" : 1984 },
#     { "title" : "Terminator 2: Judgment Day", "year" : 1991 }
# ]

# Use the data extracted wisely.

If the resource containing the HTML to parse is anywhere on the Internet, use the UrlScraper class. The constructor is slightly different and results are provided in a callback as fetching the data is asynchronous. See the example:

scraper = new Extrae.UrlScraper \
                'http://example.com/movies.html',  # url for the resource
                '#movies .movie',  # base selector for the items
                MovieCollection  # model or collection for the results

# the UrlScrapper is asynchronous so data is handled in a callback
callback = (err, response, collection)->
    console.log collection.toJSON()

# scrape!
scraper.scrape callback