marklift

v0.1.3

Published

3 months ago

URL → Clean Markdown SDK (agent-optimized)

0High
0Medium
0Low

saura3h

markdown url readability llm agent sdk

Marklift

URL → Clean Markdown — Fetch a webpage, extract the main content, and convert it to LLM-friendly Markdown. Built for agents and pipelines.

Fetches HTTP(S) URLs with configurable timeout and headers
Source types: website, twitter (Nitter), reddit — inferred from URL when not specified. Medium adapter is removed for now.
Extracts article content with Mozilla Readability (or raw body)
Converts to Markdown with Turndown and custom rules
Optimizes for agents: normalizes spacing, dedupes links, strips tracking params, optional chunking
Typed API and CLI

Requirements: Node.js 18+

The Web Is Not LLM-Ready — Raw HTML is noisy, heavy, tracking junk, inconsistent, and expensive in tokens.

Install

npm install marklift

Usage

Programmatic

import { urlToMarkdown } from "marklift";

// source is inferred from URL when omitted (twitter/x.com → twitter, reddit → reddit, else website)
const result = await urlToMarkdown("https://example.com/article", {
  timeout: 10_000,
});
const tweet = await urlToMarkdown("https://x.com/user/status/123"); // uses twitter adapter

console.log(result.title);
console.log(result.markdown);
console.log(result.wordCount, result.sections.length, result.links.length);

CLI

# Install globally to get the `marklift` command
npm install -g marklift

# Convert a URL to Markdown (prints to stdout). Source is inferred from URL.
marklift https://example.com
marklift https://x.com/user/status/123   # uses twitter adapter
marklift https://reddit.com/r/...         # uses reddit adapter

# Output full result as JSON
marklift https://example.com --json

# Options
marklift https://example.com --timeout 15000
marklift https://example.com --chunk-size 2000
marklift https://example.com --source website   # override inferred source

CLI options:

| Option | Description | | ------------------------------------- | ------------------------------------------------------------------ | | --source <website\|twitter\|reddit> | Source adapter (default: inferred from URL). Override when needed. | | --timeout <ms> | Request timeout in milliseconds (default: 15000) | | --chunk-size <n> | Split markdown into chunks of ~n characters | | --json | Output full result as JSON instead of markdown |

API

`urlToMarkdown(url, options?)`

Converts a URL to clean Markdown. Returns a Promise<MarkdownResult>.

Options:

| Option | Type | Description | | ----------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | source | "website" \| "twitter" \| "reddit" | Source adapter. Default: inferred from URL (twitter.com/x.com/nitter → twitter, reddit.com → reddit, else website). Override to force a specific adapter. | | timeout | number | Request timeout in ms (default: 15000) | | headers | Record<string, string> | Custom HTTP headers (e.g. User-Agent) | | chunkSize | number | If set, result.chunks will contain token-safe chunks |

Result (MarkdownResult):

url — Original URL
title — Page title
description — Meta description (if present)
markdown — Full markdown with source-specific frontmatter (see below) + body
sections — { heading, content }[] by heading (stable order)
links — Deduplicated links, sorted (tracking params stripped)
wordCount — Approximate word count
contentHash — SHA-256 of optimized markdown (stability checks)
metadata? — Structured metadata (OG, canonical, author, publishedAt, image, language)
chunks? — When chunkSize is set: { content, index, total }[] (no split inside code blocks or tables)

`urlToMarkdownStream(url, options?)`

Async generator that yields MarkdownChunk (meta, sections, links) as they are produced. Useful for streaming into an LLM or pipeline.

Markdown format (per source)

Each adapter outputs markdown with a frontmatter block (--- … ---) then the body.

Website (and reddit). Format type: website. Medium not supported currently.

---
source: https://example.com/article
canonical: https://example.com/article
title: Example Article Title
description: Short meta description
author: John Doe
published_at: 2025-01-12
language: en
content_hash: <sha256>
word_count: 1243
---
# Title

Body content…

Twitter:

---
platform: twitter
source: https://twitter.com/username/status/1234567890
tweet_id: 1234567890
author:
  name: Author Name
published_at: 2025-01-10T18:22:00Z
language: en
content_hash: <sha256>
---
Body content…

Errors

InvalidUrlError — Invalid or non-HTTP(S) URL
FetchError — Network error, timeout, or non-2xx response
ParseError — Readability or parsing failure

Production: Website and reddit adapters use a browser-like User-Agent by default so requests from servers/datacenters get full HTML. The Twitter adapter keeps the Marklift User-Agent so Nitter works. Override via headers if needed.

Example

import { urlToMarkdown, urlToMarkdownStream } from "marklift";

// One-shot (source inferred from URL)
const result = await urlToMarkdown("https://blog.example.com/post", {
  timeout: 10_000,
  chunkSize: 2000,
});
console.log(result.title, result.wordCount);
if (result.chunks) {
  for (const chunk of result.chunks) {
    // Send chunk to LLM, etc.
  }
}

// Streaming
for await (const chunk of urlToMarkdownStream(
  "https://blog.example.com/post"
)) {
  process.stdout.write(chunk.content);
}

Testing

npm test          # unit + E2E (E2E needs network)
npm run test:unit # unit only (no network)
npm run test:e2e  # E2E with real URLs only

Set SKIP_E2E=1 to skip E2E tests (e.g. in CI without network).

Contributing

Contributions are welcome. See CONTRIBUTING.md for setup, code style, and how to submit changes.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme