npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

article-parser-zic

v1.7.3

Published

Extract clean article data from given URL.

Downloads

19

Readme

article-parser

Extract main article, main image and meta data from URL.

NPM Build Status Coverage Status Dependency Status Known Vulnerabilities

Installation

npm install article-parser

Usage

import ArticleParser from 'article-parser';

let url = 'http://yhoo.it/1MJUFov';

ArticleParser.extract(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

APIs

  • configure(Object conf)
  • extract(String url)
  • parseWithEmbedly(String url [, String EmbedlyKey])
  • parseMeta(String html, String url)
  • getArticle(String html)
  • absolutify(String baseURL, String url)
  • purify(String url)

configure(Object conf)

{
  wordsPerMinute: Number, // default 300, use to estimate time to read
  blackList: Array, // a set of domain we don't want to parse
  exceptDomain: Array, // a set of domain that will be parsed using Embedly
  adsDomain: Array, // a set of domain that often contains utm_, pk_ in URLs we want to clean
  htmlRules: Object, // passed to sanitize-html to clean HTML, refer: https://www.npmjs.com/package/sanitize-html
  SoundCloudKey: String, // use to get audio duration. Get it here https://developers.soundcloud.com/
  YouTubeKey: String, // use to get video duration. Get it here https://console.developers.google.com/,
  EmbedlyKey: String, // use to extract with Embedly API. Refer http://docs.embed.ly/docs/extract
}

Default configurations may work for most case.

extract(String url)

Extract article data from specified url.

var ArticleParser = require('article-parser');

var url = 'http://yhoo.it/1MJUFov';

ArticleParser.extract(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

Now article would be something like this:

{
  alias: 'how-to-stay-calm-when-you-know-you-ll-be-stressed-daniel-levitin-ted-talks-1449068980884',
  url: 'https://www.youtube.com/watch?v=8jPQjjsBbIc',
  canonicals: [ 'https://www.youtube.com/watch?v=8jPQjjsBbIc' ],
  title: 'How to Stay Calm When You Know You\'ll Be Stressed | Daniel Levitin | TED Talks',
  description: 'You\'re not at your best when you\'re stressed. In fact, your brain has evolved over millennia to release cortisol in stressful situations, inhibiting...',
  image: 'https://i.ytimg.com/vi/8jPQjjsBbIc/hqdefault.jpg',
  content: '<iframe width="480" height="270" src="https://www.youtube.com/embed/8jPQjjsBbIc?feature=oembed" frameborder="0" allowfullscreen></iframe>',
  author: 'TED',
  source: 'YouTube',
  domain: 'www.youtube.com',
  duration: 741,
  publishedTime: '2013-11-12T19:57:40+00:00'
}

parseWithEmbedly(String url [, String EmbedlyKey])

Extract article data from specified url using Embedly Extract API:

The second parameter is optional. If you've added your Embedly key via configure() method, you can ignore it here.

var ArticleParser = require('article-parser');

var url = 'http://yhoo.it/1MJUFov';

ArticleParser.parseWithEmbedly(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

parseMeta(String html, String url)

Get meta data from webpage's html.

var ArticleParser = require('article-parser');
var fetch = require('node-fetch');

var url = 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6';

fetch(url).then((res) => {
  return res.text();
}).then((html) => {
  let metaData = ArticleParser.parseMeta(html, url);
  return metaData;
});

Now metaData would be something like this:

{
  url: 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6',
  canonical: 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6',
  title: 'Setup Rocket Chat within 10 minutes',
  description: 'Do you want to get your own Slack app for your company or your team. Rocket Chat may be what you need.',
  image: 'https://cdn-images-1.medium.com/max/800/1*9IX5MWrnaCBzzeS3h5N2oA.png',
  author: '@ndaidong',
  source: 'Medium',
  publishedTime: '2013-11-12T19:57:40+00:00'
}

getArticle(String html)

Get main article content from webpage's html:

var ArticleParser = require('article-parser');
var fetch = require('node-fetch');

var url = 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6';

fetch(url).then((res) => {
  return res.text();
}).then((html) => {
  let content = ArticleParser.getArticle(html);
  return content;
})
.then((article) => {
  console.log(article);
})
.catch((err) => {
  console.log(err);
});

Now content would be clean text of main article extracted from url.

absolutify(String baseURL, String url)

Return an absolute url.

var imgSrc = absolutify('https://www.awesome.com/articles/hello-world.html', '../images/avatar.png');
console.log(imgSrc); // https://www.awesome.com/images/avatar.png

purify(String url)

Return a purified url.

var fullUrl = 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6#.98xbvjtjw?utm_medium=email&utm_source=Newsletter&utm_campaign=Autumn+Newsletter&utm_content=logo+link'
var goodURL = purify(fullUrl);
console.log(goodURL); // https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6

Test

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
npm install
npm test

License

The MIT License (MIT)