@ahtmljs/schema
v0.9.5
Published
AHTML semantic snapshot schema — types, validator, builder, and dual-format serializers (JSON + token-optimal compact text). The agent-readable contract layer for the web.
Maintainers
Readme
@ahtmljs/schema
The AI-agent contract layer for any TypeScript app — types, validator, dual-format serializers, structural diff, linter, JWS signing, and pluggable KV/Cache interfaces. Used by every other @ahtmljs/* package.
npm install @ahtmljs/schemaimport { snapshot, toCompact, validate } from '@ahtmljs/schema';
const snap = snapshot('https://shop.com/p/mbp-14', 'product_detail')
.add({ id: 'p:mbp-14', type: 'product', name: 'MacBook Pro 14"',
price: { amount: 1999, currency: 'USD' } })
.build();
console.log(toCompact(snap)); // token-optimal text — feed straight to an LLM
console.log(validate(snap)); // structured issues, never throwsZero runtime dependencies. ESM-only. Runs on Node 20+, Cloudflare Workers, Vercel Edge, Bun, and Deno — no node:* imports anywhere in the package.
How well does an AI read it?
We asked four frontier models 20 questions about the same page — given in 4 different formats.
| Format you give the AI | Tokens used | Right answers | |---|---:|---:| | Plain HTML | 684 | 91% | | llms.txt | 227 | 89% | | AHTML compact | 338 | 95% | | AHTML JSON | 365 | 100% |
AHTML JSON: every answer right. AHTML compact: ~50% fewer tokens than HTML — and more accurate.
- Real API calls to gpt-4o-mini, claude-haiku-4.5, gemini-2.5-flash, llama-3.3-70b at temperature=0.
- 20 hand-graded questions an AI agent actually wants to know: price, in stock?, SKU, return window, confirmation needed?, author, publication date, etc.
- Tokens counted with the official OpenAI + Anthropic tokenizers (
gpt-tokenizer,@anthropic-ai/tokenizer). Notext.length/4guessing. - Reproduce:
git clone https://github.com/DibbayajyotiRoy/AHTML && cp .env.example .env && bash scripts/run-llm-benchmark.sh
What this package gives you
- TypeScript types for
Snapshot, six entity primitives (Product,Document,Task,Profile,Dataset,Conversation), plusAction,Policy,Provenance,Links,SnapshotDiff,Chunk(the RAG primitive). snapshot()builder DSL — fluent, typed, deterministic.- Zero-dependency runtime validator (
validate) returning structured issues withpath+severity. Plus a throwing variant (validateStrict) for hot paths that want to bubble up anAHTMLErrorwith codeSCHEMA_INVALID. lint(s)quality linter — best-practice rules beyond validity: a priced product with no stock, a product-detail page with no actions, an action withcharge_cardside-effects but no required confirmation, a truncated dataset with nonextlink, a dangling action target. Every finding has a stableruleid you can suppress in CI.- Two serializations:
toJson(s)/fromJson(text)— canonical JSON, deterministic, signable.application/ahtml+json.toCompact(s)/fromCompact(text)— token-optimal text, lossless round-trip.application/ahtml+text. Default for LLMs.
- Streaming:
toJsonSeq(s)produces NDJSON forapplication/ahtml+json-seq— feed snapshots to an LLM as they assemble. diff(prev, next)/applyDiff(prev, d)— structural snapshot diffing for theapplication/ahtml-diff+jsonincremental endpoint.computeEtag(s)— content-addressed weak ETag, deterministic across runtimes.sign()/verifySnapshot()(new in v0.8) — detached JWS over the canonical JSON via Web Crypto. Works on Workers, Edge, Bun, Deno.KvStoreandCacheStore<T>interfaces — pluggable contracts implemented by@ahtmljs/next, the agent client, and any third-party Redis/D1/KV adapter.- JSON Schema 2020-12 spec at
./schema.json— also published asapplication/schema+jsonfor tool generation. - Property-based fuzzing tests ensure every snapshot round-trips losslessly between compact and JSON forms.
Quickstart — full builder
import { snapshot, toCompact, toJson, validate, lint } from '@ahtmljs/schema';
const snap = snapshot('https://shop.com/products/mbp-14', 'product_detail')
.ttl(60)
.policy({ agents_welcome: true, license: 'MIT', rate_limit: '100/min' })
.add({
id: 'product:mbp-14',
type: 'product',
name: 'MacBook Pro 14"',
price: { amount: 1999, currency: 'USD' },
stock: { status: 'in_stock', quantity: 42 },
})
.action({
id: 'purchase',
target: 'product:mbp-14',
category: 'transact',
execute_url: '/api/checkout',
auth: 'required',
cost: { amount: 1999, currency: 'USD', category: 'purchase' },
reversible: { reversible: true, window: 'P30D', policy: 'full_refund' },
side_effects: ['charge_card', 'email_buyer', 'decrement_stock'],
confirmation: 'required',
})
.build();
console.log(toCompact(snap)); // token-optimal text — default for LLM agents
console.log(toJson(snap)); // canonical JSON — sign-able
const issues = validate(snap);
if (issues.some((i) => i.severity === 'error')) throw new Error('invalid');
// validate() asks "is it legal?" — lint() asks "is it useful to an agent?"
for (const w of lint(snap)) {
console.warn(`[${w.rule}] ${w.path}: ${w.message}`);
}Diff and apply — incremental snapshots
For long-running agents, you do not want to re-send the whole snapshot every minute. The diff format is content-aware and stable.
import { diff, applyDiff, computeEtag } from '@ahtmljs/schema';
const d = diff(prev, next); // SnapshotDiff
const reconstructed = applyDiff(prev, d);
console.assert(computeEtag(reconstructed) === computeEtag(next));computeEtag(s) is deterministic across Node, Workers, and Deno — the same snapshot produces the same etag everywhere, which is what makes the application/ahtml-diff+json endpoint cacheable.
Signing — detached JWS over canonical JSON (v0.8)
For supply-chain trust, sign the snapshot with Web Crypto. No Node crypto dependency.
import { sign, verifySnapshot, toJson } from '@ahtmljs/schema';
const { publicKey, privateKey } = await crypto.subtle.generateKey(
{ name: 'ECDSA', namedCurve: 'P-256' }, true, ['sign', 'verify'],
);
const snap = snapshot('https://news.com/article/42', 'document_detail')
.add({ id: 'doc:42', type: 'document', name: 'The story' })
.build();
const signature = await sign(snap, { key: privateKey, kid: 'site-2026-q2' });
// signature is a detached JWS — ship in the `AHTML-Signature` header
const result = await verifySnapshot(snap, signature, {
trustedKeys: { 'site-2026-q2': publicKey },
});
if (!result.ok) throw result.error; // typed AHTMLError, code SIGNATURE_INVALIDFailures throw or return a typed AHTMLError with one of the 13 stable codes introduced in v0.6: SCHEMA_INVALID, DIFF_INVALID, COMPACT_PARSE, JSON_PARSE, ETAG_MISMATCH, NETWORK, HTTP_STATUS, AUTH_REQUIRED, POLICY_DENIED, RATE_LIMITED, TIMEOUT, CACHE_POISONED, SIGNATURE_INVALID. Agents can switch on err.code.
KV and Cache interfaces
The schema package owns the contracts; adapters provide implementations.
import type { KvStore, CacheStore } from '@ahtmljs/schema';
const myKv: KvStore = {
async get(key) { /* ... */ return null; },
async set(key, value, opts) { /* ... */ },
async delete(key) { /* ... */ },
};
const myCache: CacheStore<Snapshot> = {
async get(key) { /* ... */ return null; },
async set(key, value, ttlMs) { /* ... */ },
};Pass either to @ahtmljs/next's plugin, to @ahtmljs/agent's client, or to your own Workers/Vercel KV layer. Snapshots are serializable via toJson so any string-keyed KV works.
RAG chunks — deterministic, content-addressed
Chunk is the primitive every RAG pipeline needs but rarely standardizes: a stable id derived from content, a parent_id linking back to the entity, byte and token offsets, and an embedding_hint.
import { chunksFromEntity, computeChunkId } from '@ahtmljs/schema';
const chunks = chunksFromEntity(snap.entities[0], { targetTokens: 512 });
// each chunk.id is sha256(content) — same input always yields same id,
// so two agents indexing the same page never duplicate vectorsThe chunk id is deterministic across runtimes, which is what makes "how to cite a web page in a rag answer" actually reproducible.
Emitters (v0.8)
The well-known descriptor, MCP tool list, OpenAPI 3.1 spec, and llms.txt emitters used to live in @ahtmljs/next. As of v0.8 they are extracted here under @ahtmljs/schema/emit/* and re-exported by adapters. Use them directly in any framework.
import { emitWellKnown, emitMcp, emitOpenApi, emitLlmsTxt }
from '@ahtmljs/schema/emit';
const wellKnown = emitWellKnown({ origin: 'https://shop.com', routes });
const mcpTools = emitMcp({ snapshots }); // MCP spec 2025-11-25
const openapi = emitOpenApi({ snapshots }); // OpenAPI 3.1, JSON Schema 2020-12
const llmsTxt = emitLlmsTxt({ origin, routes }); // llmstxt.org formatThis is what lets one plugin expose your site as an MCP server, an OpenAPI provider, a JSON-LD source, and an llms.txt — without duplicate code.
Why this exists — concrete numbers
- 321 tests passing across the AHTML monorepo at v0.7. v0.8 adds JWS signing tests.
- 5 wire formats all defined here:
application/ahtml+text,application/ahtml+json,application/ahtml+json-seq,application/ahtml-diff+json,application/schema+json. - 0 runtime dependencies. The whole package is reachable from Cloudflare Workers' 1 MB script limit without bundler tricks.
- Lossless round-trip. Property-based fuzzing covers all six entity types —
fromCompact(toCompact(s)) === sandfromJson(toJson(s)) === sare invariants, not aspirations.
What is AHTML?
AHTML turns any website into an MCP server, an OpenAPI 3.1 provider, a JSON-LD source, and a token-optimal semantic snapshot — from one plugin. This package is the schema underneath. Most users want:
@ahtmljs/next— Next.js plugin, auto-emits MCP + llms.txt + OpenAPI@ahtmljs/vite— Vite / SvelteKit / Astro / Remix plugin@ahtmljs/agent— typed client SDK with retry, timeout, request coalescing, streaming@ahtmljs/langchain— LangChain document loader
Discovery
Every AHTML-enabled site exposes /.well-known/ahtml.json. Point an agent at the origin and it discovers the snapshot routes, the MCP endpoint, the OpenAPI spec URL, and the llms.txt location.
Documentation
- Repository:
DibbayajyotiRoy/AHTML - Spec:
SPEC.md - Plan / roadmap:
PLAN.md - For AI assistants:
docs/agents.md - Comparison vs MCP / llms.txt / schema.org / OpenAPI:
docs/compare.md
Search keywords / Related
People land on this package looking for: ahtml, agent-readable html, mcp server nextjs, how to add mcp to a nextjs app, how to make my site readable by ai agents, how to expose my website as an mcp server, convert website to mcp server, generate llms.txt automatically from nextjs, json-ld vs llms.txt vs mcp, ai-ready website, machine-readable website, structured data for llms, well-known ahtml, token-efficient html, token-optimal compact text serializer for llm, snapshot for llm, json schema for ai snapshot, lossless json serializer, snapshot diff and apply, snapshot lint rules, snapshot signing jws, detached jws typescript, kv cache interface, rag chunk primitive, deterministic content-addressed chunk id, snapshot etag determinism, property-based fuzzing snapshot, json-ld extractor, schema.org to ahtml, openapi 3.1 generator, model context protocol, rss for ai agents, site to mcp server, best way to feed html to gpt, reduce tokens when scraping html for llm, rag pipeline for an entire website, how to cite a web page in a rag answer.
Compared with: firecrawl, scrapingbee, crawlee, apify, browserless, playwright scraper, puppeteer scraper, jina reader, r.jina.ai, schema.org, json-ld, llms.txt, llmstxt.org, anthropic mcp sdk, openai mcp sdk, cursor mcp, modelcontextprotocol typescript sdk, claude desktop mcp, fastmcp, mcp-framework, smithery mcp, vercel ai sdk, langchain webloader, cheerio loader, unstructured.io, readability.js, mozilla readability, trafilatura, diffbot, browserbase, spider rs, exa search, tavily, perplexity api, scrapegraph ai. AHTML is the publisher-side contract those tools would consume if your site emitted one.
AI agent build queries this package answers: build ai agent that browses the web, agent http fetching with cache, agent retry with backoff typescript, request coalescing fetch, typed errors for ai agent sdk, streaming snapshot to llm, llm context window optimizer, tokenizer for cost estimate o200k_base.
npm keywords
Current keywords in package.json: ahtml, agent, agent-web, semantic-web, ai, llm, crawler, mcp, model-context-protocol, llms-txt, json-ld, schema, openapi. Proposed additions for v0.8 (paste into package.json):
{
"keywords": [
"ahtml",
"agent",
"agent-web",
"semantic-web",
"ai",
"llm",
"crawler",
"mcp",
"mcp-server",
"model-context-protocol",
"llms-txt",
"json-ld",
"schema",
"openapi",
"openapi-3.1",
"json-schema",
"rag",
"rag-chunks",
"jws",
"detached-jws",
"web-crypto",
"snapshot",
"snapshot-diff",
"etag",
"kv-store",
"cache-store",
"edge-runtime",
"cloudflare-workers",
"vercel-edge",
"deno",
"bun",
"tokenizer",
"ai-agent",
"agent-sdk"
]
}License
MIT — copyright Dibbayajyoti Roy.
