@jcottam/html-metadata

v3.1.2

Published

5 months ago

This JavaScript library simplifies the extraction of HTML Meta and OpenGraph tags from HTML content or URLs.

0High
0Medium
0Low

jcottam

metadata cheerio seo open-graph og-tags html-metadata html-meta-tags

HTML Metadata

@jcottam/html-metadata is a lightweight, TypeScript-first JavaScript library for extracting HTML meta tags, Open Graph tags, and other metadata from HTML content or URLs. Perfect for social media sharing, SEO analysis, and web scraping applications.

Compatibility: Works seamlessly with Node.js (CommonJS) and modern browsers (ES6+).

Features

🚀 Fast & Lightweight - Built on Cheerio for optimal performance
📱 Open Graph Support - Extract all Open Graph meta tags for social media
🎯 TypeScript Ready - Full type definitions and IntelliSense support
🌐 URL & HTML Support - Extract from URLs or HTML strings directly
🔧 Configurable - Customizable extraction with filtering and timeout options
🛡️ Error Resilient - Graceful handling of malformed HTML and network errors
📦 Zero Dependencies - Only depends on Cheerio for HTML parsing

Installation

npm install @jcottam/html-metadata

Usage

ES6/ESM Import

import { extractFromUrl, extractFromHTML } from "@jcottam/html-metadata"

CommonJS Require

const { extractFromUrl, extractFromHTML } = require("@jcottam/html-metadata")

Examples

Extract metadata from a URL

import { extractFromUrl } from "@jcottam/html-metadata"

// Basic usage
const metadata = await extractFromUrl("https://www.retool.com")
console.log(metadata)
// Output: { lang: "en", title: "Retool", og:title: "...", og:description: "...", ... }

// With options
const options = {
  timeout: 5000, // 5 second timeout
  metaTags: ["og:title", "og:description", "og:image"], // Only extract specific tags
}
const filteredMetadata = await extractFromUrl("https://example.com", options)

Extract metadata from HTML string

import { extractFromHTML } from "@jcottam/html-metadata"

const html = `
<html lang="en">
  <head>
    <title>My Website</title>
    <meta property="og:title" content="My Amazing Website" />
    <meta property="og:description" content="This is a brief description" />
    <meta property="og:image" content="https://example.com/image.jpg" />
    <link rel="icon" href="/favicon.ico" />
  </head>
</html>
`

const metadata = extractFromHTML(html)
console.log(metadata)
// Output: {
//   lang: "en",
//   title: "My Website",
//   "og:title": "My Amazing Website",
//   "og:description": "This is a brief description",
//   "og:image": "https://example.com/image.jpg",
//   favicon: "/favicon.ico"
// }

Resolve relative URLs with baseUrl

const html = '<html><head><link rel="icon" href="/favicon.ico" /></head></html>'
const options = { baseUrl: "https://example.com" }
const metadata = extractFromHTML(html, options)
console.log(metadata.favicon) // "https://example.com/favicon.ico"

API Reference

Methods

`extractFromHTML(html: string, options?: Options): ExtractedData`

Extracts metadata from an HTML string.

Parameters:

html (string): The HTML content to parse
options (Options, optional): Configuration options

Returns: ExtractedData - Object containing extracted metadata

`extractFromUrl(url: string, options?: Options): Promise<ExtractedData | null>`

Extracts metadata from a URL by fetching the HTML content.

Parameters:

url (string): The URL to fetch and extract metadata from
options (Options, optional): Configuration options

Returns: Promise<ExtractedData | null> - Promise that resolves to extracted metadata or null if extraction fails

Types

`Options`

type Options = {
  /** Base URL for resolving relative links (e.g., favicon, apple-touch-icon) */
  baseUrl?: string
  /** Fetch timeout in milliseconds for URL extraction */
  timeout?: number
  /** Specific meta tags to extract. If not provided, all meta tags will be extracted */
  metaTags?: string[]
}

`ExtractedData`

type ExtractedData = {
  /** Language attribute from the HTML tag */
  lang?: string
  /** Page title from the title tag */
  title?: string
  /** Favicon URL */
  favicon?: string
  /** Apple touch icon URL */
  "apple-touch-icon"?: string
  /** Open Graph and other meta tag properties */
  [key: string]: string | undefined
}

Example Response

{
  "lang": "en",
  "title": "Retool | The fastest way to build internal software.",
  "og:type": "website",
  "og:url": "https://retool.com/",
  "og:title": "Retool | The fastest way to build internal software.",
  "og:description": "Retool is the fastest way to build internal software. Use Retool's building blocks to build apps and workflow automations that connect to your databases and APIs, instantly.",
  "og:image": "https://d3399nw8s4ngfo.cloudfront.net/og-image-default.webp",
  "favicon": "/favicon.png",
  "apple-touch-icon": "/apple-touch-icon.png"
}

Browser Usage & CORS

When using extractFromUrl in browsers, you may encounter CORS restrictions. To bypass CORS:

Server-side usage: Run extractFromUrl on a server
Proxy services: Use a CORS proxy like AllOrigins
Browser extensions: Use CORS-disabling browser extensions for development

Error Handling

The library handles errors gracefully:

// Network errors return null
const result = await extractFromUrl("https://invalid-url.com")
if (result === null) {
  console.log("Failed to fetch or parse the URL")
}

// Malformed HTML is handled gracefully
const metadata = extractFromHTML(
  "<html><head><meta property='og:title' content='Test'"
)
console.log(metadata["og:title"]) // "Test"

Supported Meta Tags

The library extracts the following types of metadata:

HTML attributes: lang from <html> tag
Title: Content from <title> tag
Favicon: href from <link rel="icon"> tags
Apple Touch Icon: href from <link rel="apple-touch-icon"> tags
Meta tags: All <meta> tags with name or property attributes
Open Graph: All og:* properties
Twitter Cards: All twitter:* properties
Custom meta tags: Any custom meta tags you define

Development

Prerequisites

Node.js 18+
npm

Setup

git clone https://github.com/jcottam/html-metadata.git
cd html-metadata
npm install

Scripts

npm run build    # Build the library
npm test         # Run tests
npm run release  # Release new version (manual)

Automated Workflow

This project uses automated dependency management and releases:

Renovate Bot: Automatically updates dependencies and creates pull requests
GitHub Actions: Automatically releases new versions when changes are pushed to main
Manual Release: Use npm run release for immediate releases or specific version bumps

Testing

The project uses Vitest for testing. Run tests with:

npm test

Dependencies

Cheerio: Fast, flexible HTML parsing
Vitest: Next-generation testing framework
Rollup: Module bundler for multiple formats

Contributing

We welcome contributions! Please follow these guidelines:

Fork the repository and create a feature branch
Make changes and ensure tests pass (npm test)
Add tests for new functionality
Update documentation if needed
Submit a pull request with a clear description

Development Guidelines

Follow TypeScript best practices
Add JSDoc comments for new functions
Ensure all tests pass
Update README for new features
Use conventional commit messages

License

MIT License - see LICENSE.md for details.