npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

crosspostapp-parser

v1.0.2

Published

To extract main article from given URL

Downloads

9

Readme

article-parser

Extract main article, main image and meta data from URL.

NPM CI test Coverage Status Quality Gate Status JavaScript Style Guide

Demo

Installation

$ npm install article-parser

# pnpm
$ pnpm install article-parser

# yarn
$ yarn add article-parser

Usage

const { extract } = require('article-parser')

// es6 module syntax
import { extract } from 'article-parser'

// test
const url = 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646'

extract(url).then((article) => {
  console.log(article)
}).catch((err) => {
  console.trace(err)
})

Result:

{
  url: 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646',
  title: 'How to make your MongoDB container more secure?',
  description: 'Start it with docker   The most simple way to get MongoDB instance in your machine is using...',
  links: [
    'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646'
  ],
  image: 'https://res.cloudinary.com/practicaldev/image/fetch/s--qByI1v3K--/c_imagga_scale,f_auto,fl_progressive,h_500,q_auto,w_1000/https://dev-to-uploads.s3.amazonaws.com/i/p4sfysev3s1jhw2ar2bi.png',
  content: '...', // full article content here
  author: '@ndaidong',
  source: 'dev.to',
  published: '',
  ttr: 162
}

APIs

extract(String url | String html)

Load and extract article data. Return a Promise object.

Example:

const { extract } = require('article-parser')

const getArticle = async (url) => {
  try {
    const article = await extract(url)
    return article
  } catch (err) {
    console.trace(err)
    return null
  }
}

getArticle('https://domain.com/path/to/article')

If the extraction works well, you should get an article object with the structure as below:

{
  "url": URI String,
  "title": String,
  "description": String,
  "image": URI String,
  "author": String,
  "content": HTML String,
  "published": Date String,
  "source": String, // original publisher
  "links": Array, // list of alternative links
  "ttr": Number, // time to read in second, 0 = unknown
}

addQueryRules(Array queryRules)

Add custom rules to get main article from the specific domains.

This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content.

Example:

const { addQueryRules, extract } = require('article-parser')

// extractor doesn't work for you!
extract('https://bad-website.domain/page/article')

// add some rules for bad-website.domain
addQueryRules([
  {
    patterns: [
      /http(s?):\/\/bad-website.domain\/*/
    ],
    selector: '#noop_article_locates_here',
    unwanted: [
      '.advertise-area',
      '.stupid-banner'
    ]
  }
])

// extractor will try to find article at `#noop_article_locates_here`

// call it again, hopefully it works for you now :)
extract('https://bad-website.domain/page/article')

While adding rules, you can specify a transform() function to fine-tune article content more thoroughly.

Example rule with transformation:

const { addQueryRules } = require('article-parser')

addQueryRules([
  {
    patterns: [
      /http(s?):\/\/bad-website.domain\/*/
    ],
    selector: '#article_id_here',
    transform: ($) => {
      // with $ is cheerio's DOM instance which contains article content
      // so you can do everything cheerio supports
      // for example, here we replace all <h1></h1> with <b></b>
      $('h1').replaceWith(function () {
        const h1Html = $(this).html()
        return `<b>${h1Html}</b>`
      })
      // at the end, you mush return $
      return $
    }
  }
])

Please refer cheerio's docs for more info.

Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

  • getParserOptions()
  • setParserOptions(Object parserOptions)
  • getRequestOptions()
  • setRequestOptions(Object requestOptions)
  • getSanitizeHtmlOptions()
  • setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

Here are default properties/values:

Object parserOptions:

{
  wordsPerMinute: 300, // to estimate "time to read"
  urlsCompareAlgorithm: 'levenshtein', // to find the best url from list
  descriptionLengthThreshold: 40, // min num of chars required for description
  descriptionTruncateLen: 156, // max num of chars generated for description
  contentLengthThreshold: 200 // content must have at least 200 chars
}

Read string-comparison docs for more info about urlsCompareAlgorithm.

Object requestOptions:

{
  headers: {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0',
    accept: 'text/html; charset=utf-8'
  },
  responseType: 'text',
  responseEncoding: 'utf8',
  timeout: 6e4,
  maxRedirects: 3
}

Read axios' request config for more info.

Object sanitizeHtmlOptions:

{
  allowedTags: [
    'h1', 'h2', 'h3', 'h4', 'h5',
    'u', 'b', 'i', 'em', 'strong', 'small', 'sup', 'sub',
    'div', 'span', 'p', 'article', 'blockquote', 'section',
    'details', 'summary',
    'pre', 'code',
    'ul', 'ol', 'li', 'dd', 'dl',
    'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood',
    'fieldset', 'legend',
    'figure', 'figcaption', 'img', 'picture',
    'video', 'audio', 'source',
    'iframe',
    'progress',
    'br', 'p', 'hr',
    'label',
    'abbr',
    'a',
    'svg'
  ],
  allowedAttributes: {
    a: ['href', 'target', 'title'],
    abbr: ['title'],
    progress: ['value', 'max'],
    img: ['src', 'srcset', 'alt', 'width', 'height', 'style', 'title'],
    picture: ['media', 'srcset'],
    video: ['controls', 'width', 'height', 'autoplay', 'muted'],
    audio: ['controls'],
    source: ['src', 'srcset', 'data-srcset', 'type', 'media', 'sizes'],
    iframe: ['src', 'frameborder', 'height', 'width', 'scrolling'],
    svg: ['width', 'height']
  },
  allowedIframeDomains: ['youtube.com', 'vimeo.com']
}

Read sanitize-html docs for more info.

Test

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
npm install
npm test

# quick evaluation
npm run eval {URL_TO_PARSE_ARTICLE}

License

The MIT License (MIT)