html2llm

v0.1.2

Published

11 hours ago

Convert HTML to CSX (Compact S-Expression) for token-efficient LLM context

0High
0Medium
0Low

crayoooon

html2llm

Convert HTML to CSX (Compact S-Expression) — a token-efficient format for feeding web content to LLMs.

Why CSX?

Raw HTML is token-expensive: every <div class="wrapper"> pays the cost of angle brackets, tag names, attribute syntax, and closing tags. CSX strips all noise and encodes the same structure in a format LLMs understand natively.

| Format | Example | |--------|---------| | HTML | <p class="lead">Hello <strong>world</strong></p> | | Hiccup | [:p {:class "lead"} "Hello " [:strong "world"]] | | CSX | (p.lead Hello (b world)) |

Install

npm install html2llm

Usage

As a module

import { htmlToCSX, urlToCSX } from "html2llm";

// Convert HTML string to CSX
const html = `<article class="post">
  <h1>Title</h1>
  <p>This is <strong>important</strong> content.</p>
</article>`;

console.log(htmlToCSX(html));
// (article.post (h1 Title) (p "This is" (b important) content.))

console.log(htmlToCSX(html, { minify: false }));
// (article.post
//   (h1 Title)
//   (p "This is" (b important) content.))

// Fetch a URL and convert its HTML to CSX
const csx = await urlToCSX("https://example.com");
console.log(csx);
// (div (h1 "Example Domain") (p ...))

// Pretty-printed
const prettyCsx = await urlToCSX("https://example.com", { minify: false });

// Use headless Chromium to bypass JS anti-bot walls (zhihu, etc.)
const zhihuCsx = await urlToCSX("https://www.zhihu.com/question/...", {
  headless: true,
});

As a CLI

# From file
html2llm page.html

# From URL (fetches and converts)
html2llm https://example.com

# From URL with headless browser (bypasses anti-bot JS challenges)
html2llm https://www.zhihu.com/question/... --headless

# From stdin
curl -s https://example.com | html2llm

# Pretty-printed
html2llm page.html --pretty
html2llm https://example.com --pretty

As a web service (like Jina)

Start the server:

# Development (hot reload)
npm run dev:server

# Production
npm run build && npm start

Then curl any URL to get CSX:

# Fetch a page
curl "localhost:3000/https://example.com"

# Pretty-printed
curl "localhost:3000/https://example.com?pretty"

# Bypass JS anti-bot challenges with headless Chromium
curl "localhost:3000/https://www.zhihu.com/question/...?headless"

# Combine both
curl "localhost:3000/https://www.zhihu.com/question/...?headless&pretty"

# Auto-prepends https:// if you omit the protocol
curl "localhost:3000/example.com"

# Health check
curl "localhost:3000/health"

The server:

Listens on PORT env (default 3000)
Fetches URLs server-side with a 15s timeout
Limits responses to 5 MB
Returns CSX as text/plain; charset=utf-8
Sets X-Original-URL response header

Public instance

A hosted instance is available at html2llm.cyncyn.xyz — use it like r.jina.ai:

# Fetch any webpage as CSX
curl "https://html2llm.cyncyn.xyz/https://example.com"

# Pretty-printed for readability
curl "https://html2llm.cyncyn.xyz/https://example.com?pretty"

# Feed to an LLM in one pipeline
curl -s "https://html2llm.cyncyn.xyz/https://example.com" | llm "summarize this page"

# Bypass JS anti-bot walls with headless Chromium
curl "https://html2llm.cyncyn.xyz/https://www.zhihu.com/question/...?headless"

# Omit https:// — auto-prepended
curl "https://html2llm.cyncyn.xyz/example.com"

# Health check
curl "https://html2llm.cyncyn.xyz/health"

Query params:

| Param | Effect | |---|---| | ?pretty | Indented, human-readable output | | ?headless | Use headless Chromium to bypass JS challenges (slower) | | ?pretty&headless | Combine both |

Docker

docker build -t html2llm .
docker run -p 3000:3000 html2llm

# With custom port
docker run -p 8080:8080 -e PORT=8080 html2llm

CSX Format Rules

| Rule | Example | |------|---------| | Element | (tag children...) | | ID shorthand | (div#main ...) | | Class shorthand | (div.card.dark ...) | | Attributes | (img src=photo.jpg alt=photo) | | Quoted attr value | (img alt="a nice photo") | | Single-word text | (p Hello) | | Multi-word text | (p "Hello world") | | Boolean attribute | (input type=checkbox checked) | | Tag synonyms | strong→b, em→i |

Stripped: script, style, svg, noscript, iframe, link, meta, HTML comments, all non-semantic attributes.

Flattened: bare div/span with no attributes and a single child are hoisted.

Benchmark

Token counts approximated at 4 characters/token (GPT-family).

| URL | HTML tokens | CSX tokens | Reduction | |-----|-------------|------------|-----------| | github.com |142589 | 33022 | 76.8% | | news.ycombinator.com | 8,619 | 5,793 | 32.8% | | bbc.com/news | 81,821 | 20,038 | 75.5% |

Results from running npm run benchmark against live URLs on 2026-05-15.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

html2llm

Why CSX?

Install

Usage

As a module

As a CLI

As a web service (like Jina)

Public instance

Docker

CSX Format Rules

Benchmark

License