@sarfarajey/html-extract

v1.0.0

Published

3 months ago

HTML and feed extraction utilities — htmlToText, og:image, title, RSS/HTML fetch with timeout. Zero dependencies, edge-compatible.

0High
0Medium
0Low

sarfarajey

html rss atom feed og-image open-graph metadata scraping link-preview extractor

@sarfarajey/html-extract

Lightweight HTML and feed extraction utilities. Zero dependencies, no native modules — works in Node 18+, browsers, Cloudflare Workers, Deno, and Bun.

Useful for:

RSS / Atom aggregators
Link-preview cards (og:image, title)
Article ingestion pipelines
Any place you need plain text from arbitrary HTML

Install

npm install @sarfarajey/html-extract

HTML → text

import { htmlToText } from '@sarfarajey/html-extract';

htmlToText('<p>Hello <b>world</b>.</p><script>evil()</script>');
// → 'Hello world.'

Uses DOMParser when available; otherwise falls back to a regex stripper that removes <script> and <style> content and decodes common entities. The fallback path is what Cloudflare Workers and Node use.

Open Graph image

import { extractOgImage } from '@sarfarajey/html-extract';

extractOgImage(html);
// → 'https://example.com/cover.jpg' | null

Handles both attribute orderings (property first vs content first) — a subtle bug source if you write the regex yourself. Normalises protocol-relative URLs.

Title

import { extractTitleFromHtml } from '@sarfarajey/html-extract';

extractTitleFromHtml('<title>Example &amp; Co</title>');
// → 'Example & Co'

Truncates to 200 characters and decodes common entities.

Fetch helpers

Both helpers add a 10-second abort timeout (configurable) and a desktop User-Agent so UA-gating sites do not 403.

import { fetchHtml, fetchFeedXml } from '@sarfarajey/html-extract';

const html = await fetchHtml('https://example.com/article');
const xml  = await fetchFeedXml('https://example.com/feed.xml');

Options:

await fetchHtml(url, {
    timeoutMs: 5000,
    userAgent: 'my-bot/1.0',
});

Both throw on non-2xx — wrap individual URLs in try/catch if you are iterating over a list.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sarfarajey/html-extract

Install

HTML → text

Open Graph image

Title

Fetch helpers

License