@ahtmljs/langchain

v0.4.0

Published

18 hours ago

LangChain.js document loader for AHTML. Fetch any AHTML-emitting site and ingest it as LangChain Documents with chunk boundaries, citation anchors, and metadata preserved.

Downloads

315

0High
0Medium
0Low

dibbayajyotiroy

ahtml langchain langchain-loader document-loader rag agent agent-web ai llm vector-db embeddings

@ahtmljs/langchain

LangChain.js document loader for AHTML.

Pull any AHTML-emitting site directly into your RAG pipeline. Returns LangChain Documents with chunk boundaries, citation anchors, and metadata preserved.

npm install @ahtmljs/langchain @ahtmljs/agent @ahtmljs/schema

📊 How well does an AI read it?

We asked an AI 20 questions about the same page — given in 4 different formats:

| Format you give the AI | Tokens used | Right answers | |---|---:|---:| | Plain HTML | 684 | 91% | | llms.txt | 227 | 89% | | AHTML compact | 338 | 95% | | AHTML JSON | 365 | 100% ✓ |

AHTML JSON: every answer right. AHTML compact: ~50% fewer tokens than HTML — and still more accurate. RAG pipelines ingest snapshots straight via this loader.

Real API calls to gpt-4o-mini, claude-haiku-4.5, gemini-2.5-flash, llama-3.3-70b at temperature=0.
20 hand-graded questions an AI agent actually wants to know: price, in stock?, SKU, return window, confirmation needed?, author, publication date, etc.
Tokens counted with the official OpenAI + Anthropic tokenizers (gpt-tokenizer, @anthropic-ai/tokenizer). No text.length/4 guessing.
Cost from real provider usage × public prices.
Reproduce: git clone https://github.com/DibbayajyotiRoy/AHTML && cp .env.example .env && bash scripts/run-llm-benchmark.sh

Full report · Source

Quickstart

import { AHTMLLoader } from '@ahtmljs/langchain';
import { Chroma } from '@langchain/community/vectorstores/chroma';
import { OpenAIEmbeddings } from '@langchain/openai';

const loader = new AHTMLLoader('https://docs.acmecloud.com');
const docs = await loader.load();

await Chroma.fromDocuments(docs, new OpenAIEmbeddings(), { collectionName: 'acmecloud-docs' });

That's it. The site's AHTML route serves typed entities; the loader maps each one to a LangChainDocument. For long-form document entities, each chunks[] entry becomes its own record with chunk_anchor preserved for citation-grounded answers.

Why this is better than HTML scraping loaders

| | HTML loader (@langchain/community/document_loaders/web/html) | AHTMLLoader | |---|---|---| | Source format | DOM scrape (Cheerio / Playwright) | typed semantic snapshot | | Tokens to embed | full HTML noise | only the agent-readable content | | Chunk boundaries | heuristic (paragraphs, separators) | publisher-defined, content-addressed | | Citation anchors | brittle CSS selectors | first-class chunk_anchor | | Stability | DOM churn breaks pipelines | etag-stable, content-addressed ids | | Auth | manual cookies | passes bearer to AHTML's auth: required |

API

`new AHTMLLoader(url | url[], options?)`

interface AHTMLLoaderOptions {
  fetch?: typeof fetch;       // custom fetch (e.g. for testing or proxies)
  agent?: string;             // User-Agent string
  bearer?: string;            // for auth: required content
  includeParent?: boolean;    // include parent doc as a record (default: true)
  filterType?: 'product' | 'document' | 'task' | 'profile' | 'dataset' | 'conversation';
}

`.load(): Promise<LangChainDocument[]>`

Fetches each URL, walks each entity, returns a flat array of Documents.

For document entities with chunks:

Parent doc record (full content)
One record per chunk with pageContent = the chunk's byte range and metadata.chunk_anchor for in-context citations

Metadata preserved on every record

| Field | Source | |---|---| | source | snapshot URL | | entity_id | the entity's stable id | | entity_type | product / document / task / profile / dataset / conversation | | page_type | snapshot's page_type | | fetched_at | timestamp | | etag | ETag for cache invalidation | | license | snapshot.policy.license for legal pipelines | | title, author, published_at (documents) | the typed fields | | chunk_id, chunk_anchor, byte_range (chunks) | from Document.chunks[] |

Citation example

A common RAG pattern: ground the answer in a source URL + anchor.

const docs = await new AHTMLLoader('https://docs.acmecloud.com/api/auth').load();
const chunk = docs.find((d) => d.metadata.chunk_anchor === '#bearer-tokens');

// In your prompt:
// "Source: {{source}}{{chunk_anchor}}\n{{pageContent}}"
// → "Source: https://docs.acmecloud.com/api/auth#bearer-tokens\n..."

License

MIT. See the main AHTML repository.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ahtmljs/langchain

📊 How well does an AI read it?

Quickstart

Why this is better than HTML scraping loaders

API

new AHTMLLoader(url | url[], options?)

.load(): Promise<LangChainDocument[]>

Metadata preserved on every record

Citation example

License

`new AHTMLLoader(url | url[], options?)`

`.load(): Promise<LangChainDocument[]>`