article-parser-zic

v1.7.3

Published

2 years ago

Extract clean article data from given URL.

Downloads

0High
0Medium
0Low

zicjin

article extractor parser readability util

article-parser

Extract main article, main image and meta data from URL.

Installation

npm install article-parser

Usage

import ArticleParser from 'article-parser';

let url = 'http://yhoo.it/1MJUFov';

ArticleParser.extract(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

APIs

configure(Object conf)
extract(String url)
parseWithEmbedly(String url [, String EmbedlyKey])
parseMeta(String html, String url)
getArticle(String html)
absolutify(String baseURL, String url)
purify(String url)

configure(Object conf)

{
  wordsPerMinute: Number, // default 300, use to estimate time to read
  blackList: Array, // a set of domain we don't want to parse
  exceptDomain: Array, // a set of domain that will be parsed using Embedly
  adsDomain: Array, // a set of domain that often contains utm_, pk_ in URLs we want to clean
  htmlRules: Object, // passed to sanitize-html to clean HTML, refer: https://www.npmjs.com/package/sanitize-html
  SoundCloudKey: String, // use to get audio duration. Get it here https://developers.soundcloud.com/
  YouTubeKey: String, // use to get video duration. Get it here https://console.developers.google.com/,
  EmbedlyKey: String, // use to extract with Embedly API. Refer http://docs.embed.ly/docs/extract
}

Default configurations may work for most case.

extract(String url)

Extract article data from specified url.

var ArticleParser = require('article-parser');

var url = 'http://yhoo.it/1MJUFov';

ArticleParser.extract(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

Now article would be something like this:

{
  alias: 'how-to-stay-calm-when-you-know-you-ll-be-stressed-daniel-levitin-ted-talks-1449068980884',
  url: 'https://www.youtube.com/watch?v=8jPQjjsBbIc',
  canonicals: [ 'https://www.youtube.com/watch?v=8jPQjjsBbIc' ],
  title: 'How to Stay Calm When You Know You\'ll Be Stressed | Daniel Levitin | TED Talks',
  description: 'You\'re not at your best when you\'re stressed. In fact, your brain has evolved over millennia to release cortisol in stressful situations, inhibiting...',
  image: 'https://i.ytimg.com/vi/8jPQjjsBbIc/hqdefault.jpg',
  content: '<iframe width="480" height="270" src="https://www.youtube.com/embed/8jPQjjsBbIc?feature=oembed" frameborder="0" allowfullscreen></iframe>',
  author: 'TED',
  source: 'YouTube',
  domain: 'www.youtube.com',
  duration: 741,
  publishedTime: '2013-11-12T19:57:40+00:00'
}

parseWithEmbedly(String url [, String EmbedlyKey])

Extract article data from specified url using Embedly Extract API:

The second parameter is optional. If you've added your Embedly key via configure() method, you can ignore it here.

var ArticleParser = require('article-parser');

var url = 'http://yhoo.it/1MJUFov';

ArticleParser.parseWithEmbedly(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

parseMeta(String html, String url)

Get meta data from webpage's html.

var ArticleParser = require('article-parser');
var fetch = require('node-fetch');

var url = 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6';

fetch(url).then((res) => {
  return res.text();
}).then((html) => {
  let metaData = ArticleParser.parseMeta(html, url);
  return metaData;
});

Now metaData would be something like this:

{
  url: 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6',
  canonical: 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6',
  title: 'Setup Rocket Chat within 10 minutes',
  description: 'Do you want to get your own Slack app for your company or your team. Rocket Chat may be what you need.',
  image: 'https://cdn-images-1.medium.com/max/800/1*9IX5MWrnaCBzzeS3h5N2oA.png',
  author: '@ndaidong',
  source: 'Medium',
  publishedTime: '2013-11-12T19:57:40+00:00'
}

getArticle(String html)

Get main article content from webpage's html:

var ArticleParser = require('article-parser');
var fetch = require('node-fetch');

var url = 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6';

fetch(url).then((res) => {
  return res.text();
}).then((html) => {
  let content = ArticleParser.getArticle(html);
  return content;
})
.then((article) => {
  console.log(article);
})
.catch((err) => {
  console.log(err);
});

Now content would be clean text of main article extracted from url.

absolutify(String baseURL, String url)

Return an absolute url.

var imgSrc = absolutify('https://www.awesome.com/articles/hello-world.html', '../images/avatar.png');
console.log(imgSrc); // https://www.awesome.com/images/avatar.png

purify(String url)

Return a purified url.

var fullUrl = 'https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6#.98xbvjtjw?utm_medium=email&utm_source=Newsletter&utm_campaign=Autumn+Newsletter&utm_content=logo+link'
var goodURL = purify(fullUrl);
console.log(goodURL); // https://medium.com/@ndaidong/setup-rocket-chat-within-10-minutes-2b00f3366c6

Test

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
npm install
npm test

License

The MIT License (MIT)