extrae

v0.1.6

Published

2 years ago

A web scraping framework written in coffeescript

Downloads

0High
0Medium
0Low

carrasti

coffeescript web scraper spider crawler

No API? No problem!

Extrae is a framework to allow you easily extract data from web pages in a structured manner.

It is written in CoffeeScript and uses Backbone.js to define classes and models for the data extracted, cheerio to provide jQuery-like node selecting API over HTML and request to fetch HTML from the Internet.

Install from npm:

npm install extrae

A simple example

You have some HTML you want to extract movies from. The HTML looks like this:

html = """
<html><body>
    <ul id="movies">
        <li class="movie">
            <span class="title">The Terminator</span>
            <span class="year">1984</span>
        </li>
        <li class="movie">
            <span class="title">Terminator 2: Judgment Day</span>
            <span class="year">1991</span>
        </li>
    </ul>
</body></html>
"""

Let's extract all the movies and for each movie their title and year. The collection of nodes for each movie can be extracted with the string selector #movies .movie, then each element matched will be used as base to find the title via the selector .title and year with .year.

You can define a model for each movie and the attributes to extract:

Extrae = require "extrae"

class MovieModel extends Extrae.Model
# add field definitions to the MovieModel prototype
MovieModel
    .addFieldDefinition 'title', new Extrae.Fields.StringField
    .addFieldDefinition 'year',  new Extrae.Fields.NumberField

And then the rules to extract every field. Rules consist on a string selector and a function to extract the data. Extractor functions receive as parameter the element(s) matched by the selector so you can use the cheerio API to extract data.

# add rules to the MovieModel prototype
MovieModel
    .addExtractRule 'title', new Extrae.ExtractRule '.title', ($) -> $.text()
    .addExtractRule 'year' , new Extrae.ExtractRule '.year', ($) ->  parseInt $.text(), 10

Next define a collection for the movies and set as its model the MovieModel written in the previous step:

class MovieCollection extends Extrae.Collection
    model = MovieModel

All ready in our data layer, let's create a scraper to extract the data:

scraper = new Extrae.Scraper \
                # base selector for the movie items for the collection
                '#movies .movie',
                # model or collection to extract the data and be returned
                MovieCollection

Now let's work the magic:


# scraper.scrape will return a MovieCollection instance with the
# extracted data
extractedCollection = scraper.scrape html

# using Backbone.js toJSON method for the collection we can get all the data
# as a POJO (Plain Old Javascript Object)
extractedCollection.toJSON()

# [
#     { "title" : "The Terminator", "year" : 1984 },
#     { "title" : "Terminator 2: Judgment Day", "year" : 1991 }
# ]

# Use the data extracted wisely.

If the resource containing the HTML to parse is anywhere on the Internet, use the UrlScraper class. The constructor is slightly different and results are provided in a callback as fetching the data is asynchronous. See the example:

scraper = new Extrae.UrlScraper \
                'http://example.com/movies.html',  # url for the resource
                '#movies .movie',  # base selector for the items
                MovieCollection  # model or collection for the results

# the UrlScrapper is asynchronous so data is handled in a callback
callback = (err, response, collection)->
    console.log collection.toJSON()

# scrape!
scraper.scrape callback

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

extrae

v0.1.6

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

No API? No problem!

A simple example