page-content-for-ai

v1.0.0

Published

5 months ago

Extract web page content in a format optimized for AI/LLM consumption with semantic information about forms, buttons, tables, and interactive elements

0High
0Medium
0Low

dtkien

ai llm html-to-markdown web-scraping content-extraction semantic-html turndown page-content browser-extension ai-agent web-automation

page-content-for-ai

Extract web page content in a format optimized for AI/LLM consumption. Converts HTML to semantic markdown with enhanced information about forms, buttons, tables, and interactive elements.

Features

✨ Semantic Extraction

Preserves form inputs with their current values and states
Captures button states (expanded/collapsed/disabled)
Converts HTML and ARIA tables to markdown tables
Identifies semantic sections (header, nav, main, footer)

🎯 AI-Optimized

Clean markdown output perfect for LLM context
Captures data-testid and component metadata
Includes page metadata (title, URL, language, viewport)
Tracks active input and scroll position

🚀 Modern Web Support

Handles ARIA roles (role="table", role="button", etc.)
Supports React/Vue/modern framework patterns
Works in both browser and Node.js environments
TypeScript support with full type definitions

Installation

npm install page-content-for-ai

Usage

Browser Environment

import { extractPageContent } from 'page-content-for-ai';

// Extract current page content
const content = extractPageContent(document.body, document, window);

console.log(content.title);          // "Example Page"
console.log(content.url);            // "https://example.com"
console.log(content.language);       // "en"
console.log(content.scrollPosition); // "25%"
console.log(content.content);        // Markdown representation

// Send to AI
const prompt = `Based on this page content, help the user:\n\n${content.content}`;

With Options

import { extractPageContent } from 'page-content-for-ai';

const content = extractPageContent(document.body, document, window, {
  includeFormData: true,   // Capture input values and states (default: true)
  includeTables: true,     // Convert tables to markdown (default: true)
  includeMetadata: true,   // Add data-testid attributes (default: true)
});

Browser Extension Example

// In your content script
import { extractPageContent } from 'page-content-for-ai';

chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'extractContent') {
    const content = extractPageContent(document.body, document, window);
    sendResponse(content);
  }
});

Custom Turndown Service

import { extractPageContent } from 'page-content-for-ai';
import TurndownService from 'turndown';

// Customize markdown conversion
const customTurndown = new TurndownService({
  headingStyle: 'setext',
  hr: '---',
});

const content = extractPageContent(document.body, document, window, {
  turndownService: customTurndown,
});

Output Format

The extracted content includes:

interface PageContent {
  title: string;           // Page title
  url: string;             // Current URL
  description: string;     // Meta description
  viewport: string;        // Viewport size (e.g., "1920x1080")
  language: string;        // Page language
  scrollPosition: string;  // Scroll percentage
  activeInput?: string;    // Currently focused input (if any)
  content: string;         // Markdown content
}

Example Markdown Output

--- HEADER ---
[Homepage](/)
[BUTTON: Menu | collapsed]
--- END HEADER ---

--- MAIN ---
# Welcome to Example Page

[INPUT: Email address | type: email | required]
[INPUT: Password | type: password]
[BUTTON: Sign In]

**Table: User Data**
| Name | Status | Actions |
| --- | --- | --- |
| John Doe | Active | Edit |
| Jane Smith | Pending | Edit |
--- END MAIN ---

--- FOOTER ---
[Privacy Policy](/privacy)
© 2025 Example Inc.
--- END FOOTER ---

Features in Detail

Form Extraction

Captures form inputs with their current state:

[INPUT: Search query | type: search | value: "example"]
[INPUT: Email | type: email | required]
[x] Remember me
[ ] Send notifications
[SELECT: Country | selected: "United States"]

Button States

Enhanced button information:

[BUTTON: Submit]
[BUTTON: Menu | collapsed]
[BUTTON: Save | disabled]
[BUTTON: Language Selector | role: combobox]

Table Support

Both HTML and ARIA tables:

**Table: Monthly Sales**
| Month | Revenue | Growth |
| --- | --- | --- |
| January | $50,000 | 5% |
| February | $55,000 | 10% |

Semantic Sections

Clear section boundaries:

--- NAVIGATION (Main Menu) ---
[Home](/) [About](/about) [Contact](/contact)
--- END NAVIGATION ---

--- MAIN ---
Page content here...
--- END MAIN ---

Use Cases

🤖 AI Chatbots - Provide page context to conversational AI
🔧 Browser Extensions - Extract content for AI-powered tools
📊 Web Scraping - Get clean, structured content for analysis
🧪 Testing - Verify page content in a readable format
📱 Mobile Apps - Parse web content for in-app AI features

Comparison with Alternatives

| Feature | page-content-for-ai | Mozilla Readability | Turndown | Cheerio | |---------|---------------------|---------------------|----------|---------| | Form State | ✅ | ❌ | ❌ | ❌ | | Button States | ✅ | ❌ | ❌ | ❌ | | ARIA Tables | ✅ | ❌ | ❌ | ❌ | | Semantic Sections | ✅ | ✅ | ❌ | ❌ | | Metadata | ✅ | Limited | ❌ | ❌ | | AI-Optimized | ✅ | Partial | ❌ | ❌ | | TypeScript | ✅ | ✅ | ✅ | ✅ |

Browser Compatibility

Works in all modern browsers:

Chrome/Edge 90+
Firefox 88+
Safari 14+

Node.js Usage

For server-side usage with JSDOM:

import { JSDOM } from 'jsdom';
import { extractPageContent } from 'page-content-for-ai';

const html = '<html>...</html>';
const dom = new JSDOM(html);

const content = extractPageContent(
  dom.window.document.body,
  dom.window.document,
  dom.window as any
);

API Reference

`extractPageContent(body, document, window, options?)`

Extract page content as a structured object.

Parameters:

body: HTMLElement - The HTML element to extract (usually document.body)
document: Document - The document object
window: Window - The window object
options?: PageContentOptions - Configuration options

Returns: PageContent

`extractPageContentAsToml(body, document, window, options?)`

Legacy TOML format output (deprecated).

Returns: string - TOML formatted content

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Acknowledgments

Built on top of Turndown for HTML to Markdown conversion
Inspired by Mozilla Readability for clean content extraction
Designed for modern AI/LLM applications

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

page-content-for-ai

Features

Installation

Usage

Browser Environment

With Options

Browser Extension Example

Custom Turndown Service

Output Format

Example Markdown Output

Features in Detail

Form Extraction

Button States

Table Support

Semantic Sections

Use Cases

Comparison with Alternatives

Browser Compatibility

Node.js Usage

API Reference

extractPageContent(body, document, window, options?)

extractPageContentAsToml(body, document, window, options?)

Contributing

License

Acknowledgments

`extractPageContent(body, document, window, options?)`

`extractPageContentAsToml(body, document, window, options?)`