page-content-for-ai
v1.0.0
Published
Extract web page content in a format optimized for AI/LLM consumption with semantic information about forms, buttons, tables, and interactive elements
Maintainers
Readme
page-content-for-ai
Extract web page content in a format optimized for AI/LLM consumption. Converts HTML to semantic markdown with enhanced information about forms, buttons, tables, and interactive elements.
Features
✨ Semantic Extraction
- Preserves form inputs with their current values and states
- Captures button states (expanded/collapsed/disabled)
- Converts HTML and ARIA tables to markdown tables
- Identifies semantic sections (header, nav, main, footer)
🎯 AI-Optimized
- Clean markdown output perfect for LLM context
- Captures
data-testidand component metadata - Includes page metadata (title, URL, language, viewport)
- Tracks active input and scroll position
🚀 Modern Web Support
- Handles ARIA roles (role="table", role="button", etc.)
- Supports React/Vue/modern framework patterns
- Works in both browser and Node.js environments
- TypeScript support with full type definitions
Installation
npm install page-content-for-aiUsage
Browser Environment
import { extractPageContent } from 'page-content-for-ai';
// Extract current page content
const content = extractPageContent(document.body, document, window);
console.log(content.title); // "Example Page"
console.log(content.url); // "https://example.com"
console.log(content.language); // "en"
console.log(content.scrollPosition); // "25%"
console.log(content.content); // Markdown representation
// Send to AI
const prompt = `Based on this page content, help the user:\n\n${content.content}`;With Options
import { extractPageContent } from 'page-content-for-ai';
const content = extractPageContent(document.body, document, window, {
includeFormData: true, // Capture input values and states (default: true)
includeTables: true, // Convert tables to markdown (default: true)
includeMetadata: true, // Add data-testid attributes (default: true)
});Browser Extension Example
// In your content script
import { extractPageContent } from 'page-content-for-ai';
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.action === 'extractContent') {
const content = extractPageContent(document.body, document, window);
sendResponse(content);
}
});Custom Turndown Service
import { extractPageContent } from 'page-content-for-ai';
import TurndownService from 'turndown';
// Customize markdown conversion
const customTurndown = new TurndownService({
headingStyle: 'setext',
hr: '---',
});
const content = extractPageContent(document.body, document, window, {
turndownService: customTurndown,
});Output Format
The extracted content includes:
interface PageContent {
title: string; // Page title
url: string; // Current URL
description: string; // Meta description
viewport: string; // Viewport size (e.g., "1920x1080")
language: string; // Page language
scrollPosition: string; // Scroll percentage
activeInput?: string; // Currently focused input (if any)
content: string; // Markdown content
}Example Markdown Output
--- HEADER ---
[Homepage](/)
[BUTTON: Menu | collapsed]
--- END HEADER ---
--- MAIN ---
# Welcome to Example Page
[INPUT: Email address | type: email | required]
[INPUT: Password | type: password]
[BUTTON: Sign In]
**Table: User Data**
| Name | Status | Actions |
| --- | --- | --- |
| John Doe | Active | Edit |
| Jane Smith | Pending | Edit |
--- END MAIN ---
--- FOOTER ---
[Privacy Policy](/privacy)
© 2025 Example Inc.
--- END FOOTER ---Features in Detail
Form Extraction
Captures form inputs with their current state:
[INPUT: Search query | type: search | value: "example"]
[INPUT: Email | type: email | required]
[x] Remember me
[ ] Send notifications
[SELECT: Country | selected: "United States"]Button States
Enhanced button information:
[BUTTON: Submit]
[BUTTON: Menu | collapsed]
[BUTTON: Save | disabled]
[BUTTON: Language Selector | role: combobox]Table Support
Both HTML and ARIA tables:
**Table: Monthly Sales**
| Month | Revenue | Growth |
| --- | --- | --- |
| January | $50,000 | 5% |
| February | $55,000 | 10% |Semantic Sections
Clear section boundaries:
--- NAVIGATION (Main Menu) ---
[Home](/) [About](/about) [Contact](/contact)
--- END NAVIGATION ---
--- MAIN ---
Page content here...
--- END MAIN ---Use Cases
- 🤖 AI Chatbots - Provide page context to conversational AI
- 🔧 Browser Extensions - Extract content for AI-powered tools
- 📊 Web Scraping - Get clean, structured content for analysis
- 🧪 Testing - Verify page content in a readable format
- 📱 Mobile Apps - Parse web content for in-app AI features
Comparison with Alternatives
| Feature | page-content-for-ai | Mozilla Readability | Turndown | Cheerio | |---------|---------------------|---------------------|----------|---------| | Form State | ✅ | ❌ | ❌ | ❌ | | Button States | ✅ | ❌ | ❌ | ❌ | | ARIA Tables | ✅ | ❌ | ❌ | ❌ | | Semantic Sections | ✅ | ✅ | ❌ | ❌ | | Metadata | ✅ | Limited | ❌ | ❌ | | AI-Optimized | ✅ | Partial | ❌ | ❌ | | TypeScript | ✅ | ✅ | ✅ | ✅ |
Browser Compatibility
Works in all modern browsers:
- Chrome/Edge 90+
- Firefox 88+
- Safari 14+
Node.js Usage
For server-side usage with JSDOM:
import { JSDOM } from 'jsdom';
import { extractPageContent } from 'page-content-for-ai';
const html = '<html>...</html>';
const dom = new JSDOM(html);
const content = extractPageContent(
dom.window.document.body,
dom.window.document,
dom.window as any
);API Reference
extractPageContent(body, document, window, options?)
Extract page content as a structured object.
Parameters:
body: HTMLElement- The HTML element to extract (usuallydocument.body)document: Document- The document objectwindow: Window- The window objectoptions?: PageContentOptions- Configuration options
Returns: PageContent
extractPageContentAsToml(body, document, window, options?)
Legacy TOML format output (deprecated).
Returns: string - TOML formatted content
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT © Trung Kien Dang
Acknowledgments
- Built on top of Turndown for HTML to Markdown conversion
- Inspired by Mozilla Readability for clean content extraction
- Designed for modern AI/LLM applications
