htmlgrabr

v1.1.1

Published

4 years ago

A Node.js library to grab and clean HTML content.

Downloads

0High
0Medium
0Low

ncarlier

HTMLGrabr library

A Node.js library to grab and clean HTML content.

Features

Extract page content from an URL (HTMLGrabr.grabURL(url: URL): GrabbedPage)
Extract page content from a string (HTMLGrabr.grab(s: string): GrabbedPage)
Extract Open Graph properties
Clean the page content:
- Extract main HTML content using mozilla-readability
- Sanitize HTML content using DOMPurify, with some extras:
  - Remove unwanted links or images
  - Remove pixel tracker
  - Remove unwanted attributes (such as style, class, id, ...)
  - And more

Usage

npm install --save htmlgrabr

The in your code:

const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const { URL } = require('url')

const grabber = new HTMLGrabr()

grabber.grabUrl(new URL('https://about.readflow.app'))
  .then(page => {
    console.log(page)
  }, err => {
    console.error(err)
  })

API

Create new instance:

const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const grabber = new HTMLGrabr(config)

Configuration object:

interface GrabberConfig {
  debug?: boolean                     // Print debug logs if true
  pretty?: boolean                    // Beautify HTML content if true
  isBlockedHost?: BlockedHostCtrlFunc // Function used to detect unwanted URLs
  rewriteURL?: URLRewriterFunc        // Function used to rewrite HTML src attributes
  rules?: Map<string, Rule>           // Rule definitions (see below)
  headers?: Headers                   // HTTP headers to set
}

Rule definition:

export interface Rule {
  selector: string             // HTML query selector
  type: 'redirect' | 'content' // Rule type:
  // - 'redirect' will use 'src' or 'href' attribute to redirect content extraction
  // - 'content' to specify content to extract
}

Grab a page:

const result = grabber.grabUrl(new URL('https://...'))

Result object:

interface GrabbedPage {
  title: string        // Page title
  url: string | null   // Source URL
  image: string | null // Page illustration
  html: string         // HTML content
  text: string         // Text content (from HTML)
  excerpt: string      // Excerpt (from meta data or HTML)
  length: number       // Read length
  images: ImageMeta[]  // Embedded image URLs
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

HTMLGrabr library

Features

Usage

API