defuddle-js

v0.1.0

Published

3 months ago

Extract main content and metadata from HTML pages. Works in browser and Node.js.

Downloads

111

0High
0Medium
0Low

sonic0002

html content-extraction readability article metadata parser browser nodejs

defuddle-js

Extract main content and structured metadata from any HTML page. Works in the browser (via <script> tag) and Node.js.

Based on defuddle by kepano.

Installation

npm install defuddle-js

Usage

Browser (`<script>` tag)

<script src="dist/defuddle.umd.js"></script>
<script>
  const { Defuddle } = DefuddleLib;

  const result = new Defuddle(document).parse();
  console.log(result.title, result.content);

  // Or parse an HTML string:
  const doc = new DOMParser().parseFromString(html, 'text/html');
  const result2 = new Defuddle(doc, { url: 'https://example.com/article' }).parse();
</script>

Browser (ESM `import`)

import { Defuddle } from 'defuddle-js';

const result = Defuddle.parse(html, { url: 'https://example.com/article' });
console.log(result.title, result.content);

Node.js

const { Defuddle } = require('defuddle-js');
const { parseHTML } = require('linkedom');

const html = '<html>...'; // your HTML string
const { document } = parseHTML(html);
const result = new Defuddle(document, { url: 'https://example.com/article' }).parse();

console.log(result.title);
console.log(result.author);
console.log(result.content); // cleaned HTML

Or using the static convenience method:

const result = Defuddle.parse(html, {
  url: 'https://example.com/article',
  parseHtml: html => require('linkedom').parseHTML(html).document,
});

Output fields

| Field | Type | Description | |-------|------|-------------| | content | string | Cleaned HTML of the main content | | title | string | Article title | | description | string | Article description or excerpt | | author | string | Author name | | published | string | Publication date (ISO 8601) | | site | string | Site name | | domain | string | Domain (without www.) | | favicon | string | Favicon URL | | image | string | Featured image URL | | language | string | Language code (e.g. en, fr-FR) | | wordCount | number | Word count of the content | | parseTime | number | Parse time in milliseconds | | schemaOrgData | array\|null | Parsed JSON-LD schema.org data | | metaTags | array | All collected meta tag objects |

Options

new Defuddle(doc, {
  url: 'https://example.com',      // used for URL resolution and domain extraction
  contentSelector: 'article.main', // override content selection
  debug: false,                    // log removal steps
}).parse();

Development

npm test        # run tests
npm run build   # build dist/

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

defuddle-js

Installation

Usage

Browser (<script> tag)

Browser (ESM import)

Node.js

Output fields

Options

Development

License

Browser (`<script>` tag)

Browser (ESM `import`)