llm-page-context
v0.1.0
Published
Turn any web page into clean LLM-ready context strings and structured documents.
Downloads
133
Maintainers
Readme
llm-page-context
Turn any web page into:
- a structured in-memory document
- a minimal LLM-context row
- a ready-to-dump
llm_contextstring
This package is designed for the common workflow:
- fetch a page in a real browser
- extract the main readable content
- convert it into Markdown/text
- return a shape that is easy to feed into an LLM or RAG pipeline
Install
npm
npm install llm-page-contextpnpm
pnpm add llm-page-contextRequirements
- Node.js 22+
The package uses Puppeteer, so the first install may download a browser binary.
If needed, install Chrome for Testing manually:
npx puppeteer browsers install chromeQuick start
import {
extractLlmReadyPage,
toLlmContextString,
} from "llm-page-context";
const page = await extractLlmReadyPage("https://example.com");
const llmContext = toLlmContextString(page);
console.log(llmContext);What you get back
extractLlmReadyPage(url) returns a structured in-memory object like:
{
schema_version: 1,
source: {
requested_url: "https://example.com/",
final_url: "https://example.com/",
captured_at: "2026-03-21T09:31:25.956Z",
content_hash: "..."
},
document: {
title: "Example Domain",
description: "This domain is for use in documentation examples...",
markdown: "...",
text: "...",
html: "...",
metadata: {
description: "..."
},
headings: [],
outgoing_links: [
"https://iana.org/domains/example"
]
},
stats: {
markdown_chars: 145,
text_chars: 141,
heading_count: 0,
link_count: 1,
estimated_tokens: 38
}
}API
extractLlmReadyPage(url, options?)
Extracts one page and returns the full structured result in memory.
Example
import { extractLlmReadyPage } from "llm-page-context";
const page = await extractLlmReadyPage("https://react.dev/reference/react/useMemo");
console.log(page.document.title);
console.log(page.document.markdown);Parameters
url: string- The page to extract.
options?: object- Optional extraction settings.
Supported options
headless?: boolean- Default:
true - Run the browser headless or visible.
- Default:
onlyMainContent?: boolean- Default:
true - Uses Readability to prefer the main readable content instead of the whole page.
- Default:
timeout?: number- Default:
90000 - Navigation timeout in milliseconds.
- Default:
bodyTimeout?: number- Default:
15000 - Timeout while waiting for
body.
- Default:
waitUntil?: "load" | "domcontentloaded" | "networkidle0" | "networkidle2"- Default:
"networkidle2" - Puppeteer navigation wait mode.
- Default:
userAgent?: string- Optional custom user agent.
browserArgs?: string[]- Optional Chromium launch args.
Returns
Returns the full structured page object.
toLlmContextString(page, options?)
Converts the structured page into a single string that can be dumped directly into an LLM prompt.
Example
import {
extractLlmReadyPage,
toLlmContextString,
} from "llm-page-context";
const page = await extractLlmReadyPage("https://docs.stripe.com/payments/checkout");
const context = toLlmContextString(page);
console.log(context);Output shape
It produces a string in this format:
Title: ...
URL: ...
Summary: ...
Headings:
- ...
- ...
Content:
...Supported options
maxHeadings?: number- Default:
20 - Limit how many headings are included in the string.
- Default:
Best use case
Use this when you want:
- one prompt-ready string
- no extra transformation step
- a simple context blob for ChatGPT / Claude / OpenAI API / local models
toMinimalLlmContextColumns(page)
Converts the structured page into a compact row-like object.
Example
import {
extractLlmReadyPage,
toMinimalLlmContextColumns,
} from "llm-page-context";
const page = await extractLlmReadyPage("https://example.com");
const row = toMinimalLlmContextColumns(page);
console.log(row);Returned columns
{
url,
title,
summary,
content_markdown,
content_text,
headings,
estimated_tokens,
content_hash,
captured_at,
llm_context
}Notes
llm_contextis already a string, so you can userow.llm_contextdirectly.headingsis a plain string array.estimated_tokensis a rough estimate, not tokenizer-exact.
Best use case
Use this when you want:
- one JSON row per page
- a CSV/SQLite/Postgres-friendly shape
- both structured columns and a ready-to-use prompt string
saveLlmReadyPage(page, options?)
Saves a previously extracted page to disk.
Example
import {
extractLlmReadyPage,
saveLlmReadyPage,
} from "llm-page-context";
const page = await extractLlmReadyPage("https://example.com");
const saved = await saveLlmReadyPage(page, {
outputDir: "./outputs"
});
console.log(saved);Supported options
outputDir?: string- Default:
outputs/easy-llm-ready
- Default:
fileSlug?: string- Override the filename stem.
Returns
{
url,
jsonPath,
mdPath,
title,
estimatedTokens
}Usage patterns
1) Just get a prompt-ready string
import { extractLlmReadyPage, toLlmContextString } from "llm-page-context";
const page = await extractLlmReadyPage("https://example.com");
const promptContext = toLlmContextString(page);2) Build a RAG row
import {
extractLlmReadyPage,
toMinimalLlmContextColumns,
} from "llm-page-context";
const page = await extractLlmReadyPage("https://example.com");
const row = toMinimalLlmContextColumns(page);
const ragDoc = {
id: row.content_hash,
url: row.url,
title: row.title,
content: row.content_markdown,
headings: row.headings,
estimatedTokens: row.estimated_tokens,
};3) Save files after custom post-processing
import {
extractLlmReadyPage,
saveLlmReadyPage,
} from "llm-page-context";
const page = await extractLlmReadyPage("https://example.com");
page.document.markdown = page.document.markdown.replace(/\n{3,}/g, "\n\n");
await saveLlmReadyPage(page, {
outputDir: "./clean-pages",
fileSlug: "example-page",
});CLI
The package also exposes a CLI:
npx llm-page-context https://example.comPrint a quick summary
npx llm-page-context https://example.comPrint the minimal JSON row
npx llm-page-context https://example.com --minimalPrint the prompt-ready string only
npx llm-page-context https://example.com --contextSave JSON + Markdown files
npx llm-page-context https://example.com --saveLocal dev examples
pnpm run extract -- https://example.com
pnpm run extract -- https://example.com --minimal
pnpm run extract -- https://example.com --context
pnpm run extract -- https://example.com --saveSchema details
source
Describes where the content came from.
requested_urlfinal_urlcaptured_atcontent_hash
document
The extracted content.
titledescriptionmarkdowntexthtmlmetadataheadingsoutgoing_links
stats
Useful operational metadata.
markdown_charstext_charsheading_countlink_countestimated_tokens
Limitations
- Some pages still include navigation or boilerplate noise.
- Heavily authenticated or anti-bot protected sites may fail.
- Token count is approximate.
- Different sites may need different browser timing settings.
Attribution
This project is inspired by the open-source extraction pipeline in AnyCrawl.
- AnyCrawl: https://github.com/any4ai/AnyCrawl
See THIRD_PARTY_NOTICES.md for attribution and license details.
