npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

llm-page-context

v0.1.0

Published

Turn any web page into clean LLM-ready context strings and structured documents.

Downloads

133

Readme

llm-page-context

Turn any web page into:

  • a structured in-memory document
  • a minimal LLM-context row
  • a ready-to-dump llm_context string

This package is designed for the common workflow:

  1. fetch a page in a real browser
  2. extract the main readable content
  3. convert it into Markdown/text
  4. return a shape that is easy to feed into an LLM or RAG pipeline

Install

npm

npm install llm-page-context

pnpm

pnpm add llm-page-context

Requirements

  • Node.js 22+

The package uses Puppeteer, so the first install may download a browser binary.

If needed, install Chrome for Testing manually:

npx puppeteer browsers install chrome

Quick start

import {
  extractLlmReadyPage,
  toLlmContextString,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const llmContext = toLlmContextString(page);

console.log(llmContext);

What you get back

extractLlmReadyPage(url) returns a structured in-memory object like:

{
  schema_version: 1,
  source: {
    requested_url: "https://example.com/",
    final_url: "https://example.com/",
    captured_at: "2026-03-21T09:31:25.956Z",
    content_hash: "..."
  },
  document: {
    title: "Example Domain",
    description: "This domain is for use in documentation examples...",
    markdown: "...",
    text: "...",
    html: "...",
    metadata: {
      description: "..."
    },
    headings: [],
    outgoing_links: [
      "https://iana.org/domains/example"
    ]
  },
  stats: {
    markdown_chars: 145,
    text_chars: 141,
    heading_count: 0,
    link_count: 1,
    estimated_tokens: 38
  }
}

API

extractLlmReadyPage(url, options?)

Extracts one page and returns the full structured result in memory.

Example

import { extractLlmReadyPage } from "llm-page-context";

const page = await extractLlmReadyPage("https://react.dev/reference/react/useMemo");

console.log(page.document.title);
console.log(page.document.markdown);

Parameters

  • url: string

    • The page to extract.
  • options?: object

    • Optional extraction settings.

Supported options

  • headless?: boolean

    • Default: true
    • Run the browser headless or visible.
  • onlyMainContent?: boolean

    • Default: true
    • Uses Readability to prefer the main readable content instead of the whole page.
  • timeout?: number

    • Default: 90000
    • Navigation timeout in milliseconds.
  • bodyTimeout?: number

    • Default: 15000
    • Timeout while waiting for body.
  • waitUntil?: "load" | "domcontentloaded" | "networkidle0" | "networkidle2"

    • Default: "networkidle2"
    • Puppeteer navigation wait mode.
  • userAgent?: string

    • Optional custom user agent.
  • browserArgs?: string[]

    • Optional Chromium launch args.

Returns

Returns the full structured page object.


toLlmContextString(page, options?)

Converts the structured page into a single string that can be dumped directly into an LLM prompt.

Example

import {
  extractLlmReadyPage,
  toLlmContextString,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://docs.stripe.com/payments/checkout");
const context = toLlmContextString(page);

console.log(context);

Output shape

It produces a string in this format:

Title: ...

URL: ...

Summary: ...

Headings:
- ...
- ...

Content:
...

Supported options

  • maxHeadings?: number
    • Default: 20
    • Limit how many headings are included in the string.

Best use case

Use this when you want:

  • one prompt-ready string
  • no extra transformation step
  • a simple context blob for ChatGPT / Claude / OpenAI API / local models

toMinimalLlmContextColumns(page)

Converts the structured page into a compact row-like object.

Example

import {
  extractLlmReadyPage,
  toMinimalLlmContextColumns,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const row = toMinimalLlmContextColumns(page);

console.log(row);

Returned columns

{
  url,
  title,
  summary,
  content_markdown,
  content_text,
  headings,
  estimated_tokens,
  content_hash,
  captured_at,
  llm_context
}

Notes

  • llm_context is already a string, so you can use row.llm_context directly.
  • headings is a plain string array.
  • estimated_tokens is a rough estimate, not tokenizer-exact.

Best use case

Use this when you want:

  • one JSON row per page
  • a CSV/SQLite/Postgres-friendly shape
  • both structured columns and a ready-to-use prompt string

saveLlmReadyPage(page, options?)

Saves a previously extracted page to disk.

Example

import {
  extractLlmReadyPage,
  saveLlmReadyPage,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const saved = await saveLlmReadyPage(page, {
  outputDir: "./outputs"
});

console.log(saved);

Supported options

  • outputDir?: string

    • Default: outputs/easy-llm-ready
  • fileSlug?: string

    • Override the filename stem.

Returns

{
  url,
  jsonPath,
  mdPath,
  title,
  estimatedTokens
}

Usage patterns

1) Just get a prompt-ready string

import { extractLlmReadyPage, toLlmContextString } from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const promptContext = toLlmContextString(page);

2) Build a RAG row

import {
  extractLlmReadyPage,
  toMinimalLlmContextColumns,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");
const row = toMinimalLlmContextColumns(page);

const ragDoc = {
  id: row.content_hash,
  url: row.url,
  title: row.title,
  content: row.content_markdown,
  headings: row.headings,
  estimatedTokens: row.estimated_tokens,
};

3) Save files after custom post-processing

import {
  extractLlmReadyPage,
  saveLlmReadyPage,
} from "llm-page-context";

const page = await extractLlmReadyPage("https://example.com");

page.document.markdown = page.document.markdown.replace(/\n{3,}/g, "\n\n");

await saveLlmReadyPage(page, {
  outputDir: "./clean-pages",
  fileSlug: "example-page",
});

CLI

The package also exposes a CLI:

npx llm-page-context https://example.com

Print a quick summary

npx llm-page-context https://example.com

Print the minimal JSON row

npx llm-page-context https://example.com --minimal

Print the prompt-ready string only

npx llm-page-context https://example.com --context

Save JSON + Markdown files

npx llm-page-context https://example.com --save

Local dev examples

pnpm run extract -- https://example.com
pnpm run extract -- https://example.com --minimal
pnpm run extract -- https://example.com --context
pnpm run extract -- https://example.com --save

Schema details

source

Describes where the content came from.

  • requested_url
  • final_url
  • captured_at
  • content_hash

document

The extracted content.

  • title
  • description
  • markdown
  • text
  • html
  • metadata
  • headings
  • outgoing_links

stats

Useful operational metadata.

  • markdown_chars
  • text_chars
  • heading_count
  • link_count
  • estimated_tokens

Limitations

  • Some pages still include navigation or boilerplate noise.
  • Heavily authenticated or anti-bot protected sites may fail.
  • Token count is approximate.
  • Different sites may need different browser timing settings.

Attribution

This project is inspired by the open-source extraction pipeline in AnyCrawl.

  • AnyCrawl: https://github.com/any4ai/AnyCrawl

See THIRD_PARTY_NOTICES.md for attribution and license details.