defuddle-js
v0.1.0
Published
Extract main content and metadata from HTML pages. Works in browser and Node.js.
Downloads
111
Maintainers
Readme
defuddle-js
Extract main content and structured metadata from any HTML page. Works in the browser (via <script> tag) and Node.js.
Based on defuddle by kepano.
Installation
npm install defuddle-jsUsage
Browser (<script> tag)
<script src="dist/defuddle.umd.js"></script>
<script>
const { Defuddle } = DefuddleLib;
const result = new Defuddle(document).parse();
console.log(result.title, result.content);
// Or parse an HTML string:
const doc = new DOMParser().parseFromString(html, 'text/html');
const result2 = new Defuddle(doc, { url: 'https://example.com/article' }).parse();
</script>Browser (ESM import)
import { Defuddle } from 'defuddle-js';
const result = Defuddle.parse(html, { url: 'https://example.com/article' });
console.log(result.title, result.content);Node.js
const { Defuddle } = require('defuddle-js');
const { parseHTML } = require('linkedom');
const html = '<html>...'; // your HTML string
const { document } = parseHTML(html);
const result = new Defuddle(document, { url: 'https://example.com/article' }).parse();
console.log(result.title);
console.log(result.author);
console.log(result.content); // cleaned HTMLOr using the static convenience method:
const result = Defuddle.parse(html, {
url: 'https://example.com/article',
parseHtml: html => require('linkedom').parseHTML(html).document,
});Output fields
| Field | Type | Description |
|-------|------|-------------|
| content | string | Cleaned HTML of the main content |
| title | string | Article title |
| description | string | Article description or excerpt |
| author | string | Author name |
| published | string | Publication date (ISO 8601) |
| site | string | Site name |
| domain | string | Domain (without www.) |
| favicon | string | Favicon URL |
| image | string | Featured image URL |
| language | string | Language code (e.g. en, fr-FR) |
| wordCount | number | Word count of the content |
| parseTime | number | Parse time in milliseconds |
| schemaOrgData | array\|null | Parsed JSON-LD schema.org data |
| metaTags | array | All collected meta tag objects |
Options
new Defuddle(doc, {
url: 'https://example.com', // used for URL resolution and domain extraction
contentSelector: 'article.main', // override content selection
debug: false, // log removal steps
}).parse();Development
npm test # run tests
npm run build # build dist/License
MIT
