@sarfarajey/html-extract
v1.0.0
Published
HTML and feed extraction utilities — htmlToText, og:image, title, RSS/HTML fetch with timeout. Zero dependencies, edge-compatible.
Maintainers
Readme
@sarfarajey/html-extract
Lightweight HTML and feed extraction utilities. Zero dependencies, no native modules — works in Node 18+, browsers, Cloudflare Workers, Deno, and Bun.
Useful for:
- RSS / Atom aggregators
- Link-preview cards (og:image, title)
- Article ingestion pipelines
- Any place you need plain text from arbitrary HTML
Install
npm install @sarfarajey/html-extractHTML → text
import { htmlToText } from '@sarfarajey/html-extract';
htmlToText('<p>Hello <b>world</b>.</p><script>evil()</script>');
// → 'Hello world.'Uses DOMParser when available; otherwise falls back to a regex stripper that removes <script> and <style> content and decodes common entities. The fallback path is what Cloudflare Workers and Node use.
Open Graph image
import { extractOgImage } from '@sarfarajey/html-extract';
extractOgImage(html);
// → 'https://example.com/cover.jpg' | nullHandles both attribute orderings (property first vs content first) — a subtle bug source if you write the regex yourself. Normalises protocol-relative URLs.
Title
import { extractTitleFromHtml } from '@sarfarajey/html-extract';
extractTitleFromHtml('<title>Example & Co</title>');
// → 'Example & Co'Truncates to 200 characters and decodes common entities.
Fetch helpers
Both helpers add a 10-second abort timeout (configurable) and a desktop User-Agent so UA-gating sites do not 403.
import { fetchHtml, fetchFeedXml } from '@sarfarajey/html-extract';
const html = await fetchHtml('https://example.com/article');
const xml = await fetchFeedXml('https://example.com/feed.xml');Options:
await fetchHtml(url, {
timeoutMs: 5000,
userAgent: 'my-bot/1.0',
});Both throw on non-2xx — wrap individual URLs in try/catch if you are iterating over a list.
License
MIT
