@mrmartineau/xtractr

v0.2.7

Published

3 months ago

Extract clean, structured content from web pages with automatic short-link expansion and lightweight page-type detection.

Downloads

157

0High
0Medium
0Low

mrmartineau

xtractr

Extract clean, structured content from web pages with automatic short-link expansion and lightweight page-type detection.

xtractr fetches a URL, follows redirects, parses the HTML, converts the main content to Markdown, and returns normalized metadata you can use in apps, pipelines, and AI workflows.

It is primarily intended for use inside a Cloudflare Worker runtime.

Features

Expands shortened URLs and returns the full redirect chain
Extracts readable page content with defuddle
Converts extracted HTML content to Markdown
Detects content/page type using:
- Open Graph (og:type)
- JSON-LD (@type) with nested traversal
- domain and file-extension fallback rules
Enforces a max page size (5MB) for safer fetching
Works in ESM/CJS builds with TypeScript declarations

Installation

npm install @mrmartineau/xtractr
bun add @mrmartineau/xtractr
pnpm add @mrmartineau/xtractr
yarn add @mrmartineau/xtractr

Usage

import { xtract } from '@mrmartineau/xtractr'

const result = await xtract('https://bit.ly/example')

console.log(result.title)
console.log(result.content) // markdown
console.log(result.redirectUrls)
console.log(result.pageType)

Cloudflare Worker example

import { xtract } from '@mrmartineau/xtractr'

export default {
  async fetch(request: Request): Promise<Response> {
    const { searchParams } = new URL(request.url)
    const target = searchParams.get('url')

    if (!target) {
      return new Response('Missing "url" query parameter', { status: 400 })
    }

    const data = await xtract(target)
    return Response.json(data)
  },
}

API

`xtract(targetUrl: string): Promise<XtractResponse>`

Fetches, parses, and extracts structured content from a URL.

`XtractResponse`

title: string - extracted page title
author: string - extracted author (if found)
published: string - published date string (if found)
description: string - summary/description (if found)
domain: string - source domain
content: string - extracted main content as Markdown
wordCount: number - estimated word count from extracted content
source: string - original input URL
url: string - final fetched URL
resolvedUrl: string - resolved URL after unshortening/fetch redirects
redirectUrls: string[] - full redirect chain
urlType: LinkType - detected type for the URL
pageType: LinkType - detected type for the extracted page
favicon?: string - favicon URL if available
image?: string - representative image URL if available
site?: string - site/publication name if available

`LinkType`

type LinkType =
  | 'link'
  | 'video'
  | 'audio'
  | 'recipe'
  | 'image'
  | 'document'
  | 'article'
  | 'game'
  | 'book'
  | 'event'
  | 'product'
  | 'note'
  | 'file'

Notes and limits

Non-HTML responses are rejected.
Responses larger than 5MB are rejected.
Redirect chasing is capped (currently 20 hops).
Intended runtime: Cloudflare Workers.
Also works in other runtimes that provide fetch (Node 18+ recommended).

Development

# Build (CJS + ESM + DTS)
npm run build

# Watch mode
npm run dev

# Lint/format
npm run check

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme