html2llm
v0.1.2
Published
Convert HTML to CSX (Compact S-Expression) for token-efficient LLM context
Readme
html2llm
Convert HTML to CSX (Compact S-Expression) — a token-efficient format for feeding web content to LLMs.
Why CSX?
Raw HTML is token-expensive: every <div class="wrapper"> pays the cost of angle brackets, tag names, attribute syntax, and closing tags. CSX strips all noise and encodes the same structure in a format LLMs understand natively.
| Format | Example |
|--------|---------|
| HTML | <p class="lead">Hello <strong>world</strong></p> |
| Hiccup | [:p {:class "lead"} "Hello " [:strong "world"]] |
| CSX | (p.lead Hello (b world)) |
Install
npm install html2llmUsage
As a module
import { htmlToCSX, urlToCSX } from "html2llm";
// Convert HTML string to CSX
const html = `<article class="post">
<h1>Title</h1>
<p>This is <strong>important</strong> content.</p>
</article>`;
console.log(htmlToCSX(html));
// (article.post (h1 Title) (p "This is" (b important) content.))
console.log(htmlToCSX(html, { minify: false }));
// (article.post
// (h1 Title)
// (p "This is" (b important) content.))
// Fetch a URL and convert its HTML to CSX
const csx = await urlToCSX("https://example.com");
console.log(csx);
// (div (h1 "Example Domain") (p ...))
// Pretty-printed
const prettyCsx = await urlToCSX("https://example.com", { minify: false });
// Use headless Chromium to bypass JS anti-bot walls (zhihu, etc.)
const zhihuCsx = await urlToCSX("https://www.zhihu.com/question/...", {
headless: true,
});As a CLI
# From file
html2llm page.html
# From URL (fetches and converts)
html2llm https://example.com
# From URL with headless browser (bypasses anti-bot JS challenges)
html2llm https://www.zhihu.com/question/... --headless
# From stdin
curl -s https://example.com | html2llm
# Pretty-printed
html2llm page.html --pretty
html2llm https://example.com --prettyAs a web service (like Jina)
Start the server:
# Development (hot reload)
npm run dev:server
# Production
npm run build && npm startThen curl any URL to get CSX:
# Fetch a page
curl "localhost:3000/https://example.com"
# Pretty-printed
curl "localhost:3000/https://example.com?pretty"
# Bypass JS anti-bot challenges with headless Chromium
curl "localhost:3000/https://www.zhihu.com/question/...?headless"
# Combine both
curl "localhost:3000/https://www.zhihu.com/question/...?headless&pretty"
# Auto-prepends https:// if you omit the protocol
curl "localhost:3000/example.com"
# Health check
curl "localhost:3000/health"The server:
- Listens on
PORTenv (default 3000) - Fetches URLs server-side with a 15s timeout
- Limits responses to 5 MB
- Returns CSX as
text/plain; charset=utf-8 - Sets
X-Original-URLresponse header
Public instance
A hosted instance is available at html2llm.cyncyn.xyz — use it like r.jina.ai:
# Fetch any webpage as CSX
curl "https://html2llm.cyncyn.xyz/https://example.com"
# Pretty-printed for readability
curl "https://html2llm.cyncyn.xyz/https://example.com?pretty"
# Feed to an LLM in one pipeline
curl -s "https://html2llm.cyncyn.xyz/https://example.com" | llm "summarize this page"
# Bypass JS anti-bot walls with headless Chromium
curl "https://html2llm.cyncyn.xyz/https://www.zhihu.com/question/...?headless"
# Omit https:// — auto-prepended
curl "https://html2llm.cyncyn.xyz/example.com"
# Health check
curl "https://html2llm.cyncyn.xyz/health"Query params:
| Param | Effect |
|---|---|
| ?pretty | Indented, human-readable output |
| ?headless | Use headless Chromium to bypass JS challenges (slower) |
| ?pretty&headless | Combine both |
Docker
docker build -t html2llm .
docker run -p 3000:3000 html2llm
# With custom port
docker run -p 8080:8080 -e PORT=8080 html2llmCSX Format Rules
| Rule | Example |
|------|---------|
| Element | (tag children...) |
| ID shorthand | (div#main ...) |
| Class shorthand | (div.card.dark ...) |
| Attributes | (img src=photo.jpg alt=photo) |
| Quoted attr value | (img alt="a nice photo") |
| Single-word text | (p Hello) |
| Multi-word text | (p "Hello world") |
| Boolean attribute | (input type=checkbox checked) |
| Tag synonyms | strong→b, em→i |
Stripped: script, style, svg, noscript, iframe, link, meta, HTML comments, all non-semantic attributes.
Flattened: bare div/span with no attributes and a single child are hoisted.
Benchmark
Token counts approximated at 4 characters/token (GPT-family).
| URL | HTML tokens | CSX tokens | Reduction | |-----|-------------|------------|-----------| | github.com |142589 | 33022 | 76.8% | | news.ycombinator.com | 8,619 | 5,793 | 32.8% | | bbc.com/news | 81,821 | 20,038 | 75.5% |
Results from running
npm run benchmarkagainst live URLs on 2026-05-15.
License
MIT
