ogpeek

v0.5.0

Published

13 days ago

Peek into any page's Open Graph tags — parser, fetcher, and validator.

0High
0Medium
0Low

minjun-kim

open-graph ogp og parser debug validator ogpeek

ogpeek

peek into any page's Open Graph tags — and the favicon / JSON-LD signals that travel with them

Korean: README.ko.md

A small engine that handles parsing, fetching, and validating OpenGraph tags in a single package. Open Graph stays the primary signal; alongside it the engine also surfaces the auxiliary head metadata most pages ship with — favicons, apple-touch-icons, mask-icons, msapplication tiles, application-name / theme-color, and JSON-LD blocks. Single external dependency: htmlparser2. Runs on Node 20+, Bun, Workers, and the browser.

Install

npm install ogpeek
# or
pnpm add ogpeek
# or
yarn add ogpeek

Two entry points

| entry | purpose | runtime | dependencies | | --- | --- | --- | --- | | ogpeek | parse, validate, types | Node · Bun · Workers · browser | htmlparser2 | | ogpeek/fetch | fetch a remote URL (timeout / size cap / redirect tracing) | anywhere globalThis.fetch exists | none (no Node built-ins) |

The root entry is pure logic, so as long as you do not import ogpeek/fetch no runtime dependency comes along for the ride. The fetch subpath also avoids Node built-ins, so it loads as-is on edge and browser runtimes — SSRF policy decisions have been pushed out of the engine specifically to make this possible.

Quick start

import { parse } from "ogpeek";
import { fetchHtml } from "ogpeek/fetch";

const { html, finalUrl } = await fetchHtml("https://ogp.me");
const result = parse(html, { url: finalUrl });

console.log(result.ogp.title);
console.log(result.ogp.images);
for (const w of result.warnings) {
  console.log(`[${w.severity}] ${w.code}: ${w.message}`);
}

API

`parse(html: string, options?: ParseOptions): OgDebugResult`

html — the raw HTML string.
options.url — the base used to resolve relative URLs to absolute URLs. If omitted, the og:url declared in the document is used as the base.
options.jsonldScope — "head" | "document". Where to harvest <script type="application/ld+json"> blocks from. Default is "head" to keep the scan cost predictable; pass "document" to also walk <body> (JSON-LD is often placed there).

The return shape:

type OgDebugResult = {
  ogp: OpenGraph;                  // normalized OG tree
  typed: TypedObject | null;       // article / book / profile / music.* / video.*
  twitter: Record<string, string>; // twitter:* passthrough
  raw: Array<{ property: string; content: string }>; // declaration order
  warnings: Warning[];
  // Auxiliary metadata travelling alongside OG:
  icons: Icon[];                   // <link rel="icon" | "apple-touch-icon" | ...>
  jsonld: JsonLd[];                // <script type="application/ld+json"> blocks
  meta: {
    title: string | null;
    canonical: string | null;      // <link rel="canonical">
    prefixDeclared: boolean;       // <html prefix="og: https://ogp.me/ns#">
    charset: string | null;
    applicationName: string | null;// <meta name="application-name">
    themeColor: string | null;     // <meta name="theme-color">
    msTileImage: string | null;    // <meta name="msapplication-TileImage">
    msTileColor: string | null;    // <meta name="msapplication-TileColor">
  };
};

Each structured property (og:image:width and friends) attaches to the most recent parent (og:image). If one appears before any parent, it is reported as an ORPHAN_STRUCTURED_PROPERTY warning.

Auxiliary metadata

Open Graph remains the primary signal. The auxiliary fields are surfaced so that "how does this page advertise itself elsewhere?" debugging stays in one place — they are intentionally kept thin (no schema.org rule checking, no manifest.json fetching).

type Icon = {
  rel: string;     // matched icon token, normalized to one of:
                   // "icon" | "apple-touch-icon"
                   // | "apple-touch-icon-precomposed" | "mask-icon"
                   // | "fluid-icon" (lower-cased)
                   //
                   // <link rel> is a space-separated token set, so a tag
                   // like `rel="shortcut icon"` (legacy IE) or
                   // `rel="icon apple-touch-icon"` (multi-role) is parsed
                   // per token. Multi-role declarations emit one Icon per
                   // matched token, sharing the same href.
  href: string;
  sizes?: string;  // "32x32 16x16" or "any"
  type?: string;   // "image/png"
  color?: string;  // mask-icon color
};

type JsonLd = {
  raw: string;            // original script body
  parsed: unknown | null; // JSON.parse result, or null on failure
  types: string[];        // every @type seen (recurses into @graph)
  error?: string;         // populated when parsed === null
};

Severity is set on every warning (error / warn / info). Consumers typically render all of them and let the user filter at display time; the engine never decides what is "important enough to show".

`fetchHtml(url: string, options?: FetchOptions): Promise<FetchResult>`

Fetches a remote URL and returns the HTML as a string. Timeout, response size cap, and redirect tracing are built in. Redirects are received with redirect: "manual" so options.guard runs again on every hop. The result includes redirects: { from, to, status }[] containing every redirect hop in occurrence order — the UI can replay the "URL entered → 302 → final" flow exactly.

options.userAgent — User-Agent for outbound requests. Default is a browser-like UA.
options.timeoutMs — request timeout. Default 8000.
options.maxBytes — response size cap. Default 5 MiB. The stream is cancelled when exceeded.
options.guard — (url: URL) => Promise<void> | void. Called right before the initial request and before every redirect hop. Throw a FetchError to block, just return to allow. If unset, no checks are performed — ogpeek does not make SSRF policy decisions.
options.fetch — (url: string, init: RequestInit) => Promise<Response>. A function that performs the HTTP transport for a single hop only. fetchHtml calls this for each redirect hop and reads back one response. Redirect tracing, timeout, maxBytes, content-type judgement, and guard invocation stay owned by fetchHtml, so this injection point is a narrow slot for "transport policy only" — custom dispatcher, DoH resolver, mTLS, etc. Default is globalThis.fetch.

On failure it throws a FetchError (fields: code, status, message). The main codes: INVALID_URL, UNSUPPORTED_SCHEME, TIMEOUT, NETWORK, UPSTREAM_STATUS, NOT_HTML, TOO_LARGE, REDIRECT_LOOP, TOO_MANY_REDIRECTS, BAD_REDIRECT, GUARD_FAILED (when the guard threw something other than a FetchError).

SSRF is the caller's responsibility

The engine does not make SSRF policy decisions. The definitions of "private range" and the behaviour of resolvers vary across cloud / on-prem / edge, so making the library own this responsibility leads to a combinatorial explosion. Instead a single guard hook lets the caller inject a guard appropriate to its deployment environment.

import { fetchHtml, FetchError } from "ogpeek/fetch";

await fetchHtml(userInput, {
  guard(url) {
    if (url.hostname === "169.254.169.254") {
      throw new FetchError("BLOCKED_METADATA", 400, "cloud metadata blocked");
    }
  },
});

A real-world guard layers hostname check → DNS resolve → IP-range classification. Use ipaddr.js to classify ranges; on Node, the canonical approach is to use undici's Agent({ connect: { lookup } }) to connect directly to the validated IP, which also defends against DNS rebinding. Edge runtimes (Cloudflare Workers and friends) do not let you open raw TCP, so the practical ceiling there is DoH (cloudflare-dns.com/dns-query) plus a hostname check. For the full threat model and reference implementations, see the OWASP SSRF Prevention Cheat Sheet. This repo's website/lib/ssrf-guard.ts is a concrete example of a Workers-compatible DoH guard.

Warning codes

| code | severity | description | | --- | --- | --- | | OG_TITLE_MISSING | error | og:title is missing | | OG_TITLE_TOO_LONG | warn | og:title exceeds 60 characters — truncated by KakaoTalk | | OG_TYPE_MISSING | error | og:type is missing | | OG_IMAGE_MISSING | error | og:image is missing | | OG_URL_MISSING | error | og:url is missing | | OG_URL_MISMATCH | warn | og:url host/path disagrees with the actual request URL | | OG_TYPE_UNKNOWN | warn | og:type value is not in the OGP-spec whitelist | | URL_NOT_ABSOLUTE | warn | a URL-typed property is not absolute | | DUPLICATE_SINGLETON | warn | a single-valued property is declared more than once | | ORPHAN_STRUCTURED_PROPERTY | warn | a structured property appears with no parent | | INVALID_DIMENSION | warn | width/height failed integer parsing | | MISSING_PREFIX_ATTR | info | <html prefix> is not declared | | JSONLD_PARSE_ERROR | warn | a <script type="application/ld+json"> block did not parse as JSON |

Related projects

The web tool built on this engine: https://github.com/minjun0219/ogpeek

License

MIT.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ogpeek

Install

Two entry points

Quick start

API

parse(html: string, options?: ParseOptions): OgDebugResult

Auxiliary metadata

fetchHtml(url: string, options?: FetchOptions): Promise<FetchResult>

SSRF is the caller's responsibility

Warning codes

Related projects

License

`parse(html: string, options?: ParseOptions): OgDebugResult`

`fetchHtml(url: string, options?: FetchOptions): Promise<FetchResult>`