llm-page-context

v0.1.0

Published

3 months ago

Turn any web page into clean LLM-ready context strings and structured documents.

Downloads

0High
0Medium
0Low

npmc_5

llm context scraping crawler markdown rag puppeteer

llm-page-context

Turn any web page into:

a structured in-memory document
a minimal LLM-context row
a ready-to-dump llm_context string

This package is designed for the common workflow:

fetch a page in a real browser
extract the main readable content
convert it into Markdown/text
return a shape that is easy to feed into an LLM or RAG pipeline

Install

npm

npm install llm-page-context

pnpm

pnpm add llm-page-context

Requirements

Node.js 22+

The package uses Puppeteer, so the first install may download a browser binary.

If needed, install Chrome for Testing manually:

npx puppeteer browsers install chrome

Quick start

import {
  extractLlmReadyPage,
  toLlmContextString,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const llmContext = toLlmContextString(page);

console.log(llmContext);

What you get back

extractLlmReadyPage(url) returns a structured in-memory object like:

{
  schema_version: 1,
  source: {
    requested_url: "https://example.com/",
    final_url: "https://example.com/",
    captured_at: "2026-03-21T09:31:25.956Z",
    content_hash: "..."
  },
  document: {
    title: "Example Domain",
    description: "This domain is for use in documentation examples...",
    markdown: "...",
    text: "...",
    html: "...",
    metadata: {
      description: "..."
    },
    headings: [],
    outgoing_links: [
      "https://iana.org/domains/example"
    ]
  },
  stats: {
    markdown_chars: 145,
    text_chars: 141,
    heading_count: 0,
    link_count: 1,
    estimated_tokens: 38
  }
}

API

`extractLlmReadyPage(url, options?)`

Extracts one page and returns the full structured result in memory.

Example

import { extractLlmReadyPage } from "llm-page-context";

const page = await extractLlmReadyPage("https://react.dev/reference/react/useMemo");

console.log(page.document.title);
console.log(page.document.markdown);

Parameters

url: string
- The page to extract.
options?: object
- Optional extraction settings.

Supported options

headless?: boolean
- Default: true
- Run the browser headless or visible.
onlyMainContent?: boolean
- Default: true
- Uses Readability to prefer the main readable content instead of the whole page.
timeout?: number
- Default: 90000
- Navigation timeout in milliseconds.
bodyTimeout?: number
- Default: 15000
- Timeout while waiting for body.
waitUntil?: "load" | "domcontentloaded" | "networkidle0" | "networkidle2"
- Default: "networkidle2"
- Puppeteer navigation wait mode.
userAgent?: string
- Optional custom user agent.
browserArgs?: string[]
- Optional Chromium launch args.

Returns

Returns the full structured page object.

`toLlmContextString(page, options?)`

Converts the structured page into a single string that can be dumped directly into an LLM prompt.

Example

import {
  extractLlmReadyPage,
  toLlmContextString,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://docs.stripe.com/payments/checkout");
const context = toLlmContextString(page);

console.log(context);

Output shape

It produces a string in this format:

Title: ...

URL: ...

Summary: ...

Headings:
- ...
- ...

Content:
...

Supported options

maxHeadings?: number
- Default: 20
- Limit how many headings are included in the string.

Best use case

Use this when you want:

one prompt-ready string
no extra transformation step
a simple context blob for ChatGPT / Claude / OpenAI API / local models

`toMinimalLlmContextColumns(page)`

Converts the structured page into a compact row-like object.

Example

import {
  extractLlmReadyPage,
  toMinimalLlmContextColumns,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const row = toMinimalLlmContextColumns(page);

console.log(row);

Returned columns

{
  url,
  title,
  summary,
  content_markdown,
  content_text,
  headings,
  estimated_tokens,
  content_hash,
  captured_at,
  llm_context
}

Notes

llm_context is already a string, so you can use row.llm_context directly.
headings is a plain string array.
estimated_tokens is a rough estimate, not tokenizer-exact.

Best use case

Use this when you want:

one JSON row per page
a CSV/SQLite/Postgres-friendly shape
both structured columns and a ready-to-use prompt string

`saveLlmReadyPage(page, options?)`

Saves a previously extracted page to disk.

Example

import {
  extractLlmReadyPage,
  saveLlmReadyPage,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const saved = await saveLlmReadyPage(page, {
  outputDir: "./outputs"
});

console.log(saved);

Supported options

outputDir?: string
- Default: outputs/easy-llm-ready
fileSlug?: string
- Override the filename stem.

Returns

{
  url,
  jsonPath,
  mdPath,
  title,
  estimatedTokens
}

Usage patterns

1) Just get a prompt-ready string

import { extractLlmReadyPage, toLlmContextString } from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const promptContext = toLlmContextString(page);

2) Build a RAG row

import {
  extractLlmReadyPage,
  toMinimalLlmContextColumns,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const row = toMinimalLlmContextColumns(page);

const ragDoc = {
  id: row.content_hash,
  url: row.url,
  title: row.title,
  content: row.content_markdown,
  headings: row.headings,
  estimatedTokens: row.estimated_tokens,
};

3) Save files after custom post-processing

import {
  extractLlmReadyPage,
  saveLlmReadyPage,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");

page.document.markdown = page.document.markdown.replace(/\n{3,}/g, "\n\n");

await saveLlmReadyPage(page, {
  outputDir: "./clean-pages",
  fileSlug: "example-page",
});

CLI

The package also exposes a CLI:

npx llm-page-context https://example.com

Print a quick summary

npx llm-page-context https://example.com

Print the minimal JSON row

npx llm-page-context https://example.com --minimal

Print the prompt-ready string only

npx llm-page-context https://example.com --context

Save JSON + Markdown files

npx llm-page-context https://example.com --save

Local dev examples

pnpm run extract -- https://example.com
pnpm run extract -- https://example.com --minimal
pnpm run extract -- https://example.com --context
pnpm run extract -- https://example.com --save

Schema details

`source`

Describes where the content came from.

requested_url
final_url
captured_at
content_hash

`document`

The extracted content.

title
description
markdown
text
html
metadata
headings
outgoing_links

`stats`

Useful operational metadata.

markdown_chars
text_chars
heading_count
link_count
estimated_tokens

Limitations

Some pages still include navigation or boilerplate noise.
Heavily authenticated or anti-bot protected sites may fail.
Token count is approximate.
Different sites may need different browser timing settings.

Attribution

This project is inspired by the open-source extraction pipeline in AnyCrawl.

AnyCrawl: https://github.com/any4ai/AnyCrawl

See THIRD_PARTY_NOTICES.md for attribution and license details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llm-page-context

Install

npm

pnpm

Requirements

Quick start

What you get back

API

extractLlmReadyPage(url, options?)

Example

Parameters

Supported options

Returns

toLlmContextString(page, options?)

Example

Output shape

Supported options

Best use case

toMinimalLlmContextColumns(page)

Example

Returned columns

Notes

Best use case

saveLlmReadyPage(page, options?)

Example

Supported options

Returns

Usage patterns

1) Just get a prompt-ready string

2) Build a RAG row

3) Save files after custom post-processing

CLI

Print a quick summary

Print the minimal JSON row

Print the prompt-ready string only

Save JSON + Markdown files

Local dev examples

Schema details

source

document

stats

Limitations

Attribution

`extractLlmReadyPage(url, options?)`

`toLlmContextString(page, options?)`

`toMinimalLlmContextColumns(page)`

`saveLlmReadyPage(page, options?)`

`source`

`document`

`stats`