scrape-brrr

v1.1.0

Published

4 years ago

Simple web page scraping

0High
0Medium
0Low

chauchakching

scraper parser crawl

scrape-brrr

Simple web page scraping.

Install

yarn add scrape-brrr

Try it online

Usage examples

*The following examples use typescript style import. For plain nodejs, use

const { scrape } = require('scrape-brrr')

Dead-simple usage

/**
 *  <body>
 *    <div>
 *      <span>
 *        <p>sentence 1</p>
 *        <p>sentence 2</p>
 *        <p>sentence 3</p>
 *      </span>
 *    </div>
 *    <p>footer</p>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', 'div p:not(:first-child)')
// ["sentence 2", "sentence 3"]

Scrape single item

/**
 *  <body>
 *    <div>Best wof</div>
 *    <span>Largest wof</span>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [
  {
    name: 'stats',
    selector: 'div',
  },
  {
    name: 'another-stats',
    selector: 'span',
  },
])
// { 
//   stats: "Best wof"
//   "another-stats": "Largest wof"
// }

Scrape multiple items

/**
 *  <body>
 *    <div>
 *      <span class="name">husky</span>
 *      <span class="name">golden</span>
 *    </div>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [{
  name: 'bestWofs',
  selector: 'div .name',
  many: true
}])
// { bestWofs: ["husky", "golden"] }

Nested fields

/**
 *  <body>
 *    <div>
 *      <span class="name">husky</span>
 *      <span class="name">golden</span>
 *    </div>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [{
  name: 'bestWofs',
  selector: 'div',
  many: true,
  nested: [
    {
      name: 'name',
      selector: 'span',
    }
  ]
}])
// { 
//   bestWofs: [
//     { name: "husky" }, 
//     { name: "golden" },
//   ]
// }

Extract link / HTML element attribute

/**
 *  <body>
 *    <span class="title" id="best">Best wof</div>
 *    <a href="/other-stats">other stats</a>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [
  {
    name: 'key',
    selector: 'span',
    attr: 'id'
  },
  {
    name: 'otherLink',
    selector: 'a',
    attr: 'href'
  },
])
// { 
//   key: "best",
//   otherLink: "/other-stats"
// }

Transform

/**
 *  <body>
 *    <div>
 *      <span class="rank">1</span>
 *      <span class="name">husky</span>
 *    </div>
 *    <div>
 *      <span class="rank">2</span>
 *      <span class="name">golden</span>
 *    </div>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [{
  name: 'best',
  selector: 'div',
  many: true,
  nested: [
    {
      name: 'rank',
      selector: '.rank',
    },
    {
      name: 'name',
      selector: '.name',
    }
  ],
  transform: arr => arr[0]
}])
// { 
//   best: { name: "husky" },
// }

Website with dynamic content by js

Use puppeteer to load page with javascript to scrape dynamic content.

/**
 *  <body>
 *    <h1>
 *      tick tok tick tok
 *    </h1>
 *    <script>
 *      document.querySelector('h1').textContent = 'boom!'
 *    </script>
 *  </body>
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', 'h1', { dynamic: true })
// ["boom!"]

Other features

Handle non-utf8 charset response from server (e.g. chinese encoding big5)

Development

yarn install

yarn test

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

scrape-brrr

Install

Try it online

Usage examples

Dead-simple usage

Scrape single item

Scrape multiple items

Nested fields

Extract link / HTML element attribute

Transform

Website with dynamic content by js

Other features

Development