site-metadata-extractor
v1.1.0
Published
web(site) resource metadata extractor
Maintainers
Readme
Site Metadata Extractor
Cleans and extracts a web(site) resource's metadata.
Metadata extraction fields currently supported:
| Name | Data Type | | ------------------------ | -------------- | | author | array (jsonb) | | canonical_url | string | | copyright | string | | date (publish date) | date | | description | text | | favicon | text | | image (primary/og image) | text | | jsonld (structured data) | object (jsonb) | | keywords | array (jsonb) | | lang | string | | locale | string | | origin | string | | publisher | string | | site_name | string | | tags | array (jsonb) | | title | string | | type | string | | truncated_text | text | | status | string | | videos | array (jsonb) | | links | array (jsonb) |
Install
NPM:
$ npm install site-metadata-extractor --saveYarn:
$ yarn add site-metadata-extractorUsage
Feed in a raw markup from a webpage to get extracted metadata fields.
Modern typed API
The typed API is additive. The default export remains available for existing callers, while new callers can pass already-fetched public HTML into:
import {
extractFromHtml,
extractMetadataOnly,
extractLazy,
type ExtractedResource,
} from "site-metadata-extractor";
const resource: ExtractedResource = extractFromHtml(html, {
inputUrl: "https://example.com/requested-url",
finalUrl: "https://example.com/final-url",
lang: "en",
});This package does not fetch network resources. Fetch HTML in the calling application, apply any SSRF/private-network protections there, then pass the HTML string into the extractor.
extractFromHtml(html, options) returns stable typed output for ingestion:
- URL fields:
inputUrl,finalUrl,canonicalUrl,normalizedUrl,domain - metadata:
title,softTitle,description,author,publisher,siteName,lang,locale,publishedAt,modifiedAt - assets:
faviconCandidates,imageCandidates,primaryImage - structured/raw data:
jsonld,rawMeta - content:
links,videos,readableText,textStats - extraction metadata:
extraction.packageVersion,extraction.strategyVersion,extraction.warnings,extraction.confidence
extractMetadataOnly(html, options) returns the same shape but skips readable
text, link, and video extraction.
extractLazy(html, options) uses instance-local caches and exposes:
metadata(), readableText(), links(), videos(), and extract().
Exported output types include ExtractedResource, AssetCandidate,
ExtractedLink, ExtractedVideo, TextStats, and ExtractionMetadata.
Migration note: the legacy default export still returns the historical field
names such as canonicalLink, favicon, image, and text. New integrations
should prefer extractFromHtml so candidate URLs are resolved against
finalUrl/inputUrl, JSON-LD is always an array, oversized metadata is bounded,
and malformed JSON-LD is reported in extraction.warnings instead of being
logged.
From .html file:
import fs from "fs";
import siteMetadataExtractor from "site-metadata-extractor";
const getMetadataFromFile = (filename) => {
const filepath = path.resolve(__dirname, `../data/${filename}.html`);
const markup = fs.readFileSync(filepath).toString();
// feel free to use localhost as the second parameter for testing
const metadata = siteMetadataExtractor(markup, "YOUR_SITE_ORIGIN_HERE");
return metadata;
};
getMetadataFromFile("example");From a server request:
import axios from 'axios';
import siteMetadataExtractor from 'site-metadata-extractor';
const processSite = async (url) => {
return axios.get(url, config = {})
.then(res => {
const { headers } = res;
const contentType = headers['content-type'];
if (contentType.includes('text/html')) {
return {
body: res.data,
url
};
} else {
return {};
}
})
.catch(err => {
console.log(err);
});
};
processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)
.then((data) => {
...
siteMetadataExtractor(data, "https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/", "en");
...
});Development
- Run:
git clone https://github.com/sc10ntech/site-metadata-extractor.git - Change into project directory and install deps:
cd site-metadata-extractor && npm i
Creids & Disclaimer
site-metadata-extractor was inspired by, and tries to be the spiritual successor to node-unfluff
