@mcp-b/smart-dom-reader

v3.0.0

Published

19 days ago

Token-efficient DOM extraction for AI agents - Extract interactive elements, semantic structure, and stable CSS selectors for LLM-powered browser automation

@mcp-b/smart-dom-reader

Token-efficient DOM extraction for AI agents - Extract interactive elements, semantic structure, and stable CSS selectors for LLM-powered browser automation

Full Documentation | Quick Start

@mcp-b/smart-dom-reader extracts DOM structure optimized for AI/LLM consumption. Get stable CSS selectors, interactive elements, and semantic page structure while minimizing token usage. Perfect for AI-powered browser automation, userscript generation, and web scraping with Claude, ChatGPT, or any LLM.

Why Use @mcp-b/smart-dom-reader?

| Feature | Benefit | | ----------------------- | -------------------------------------------------------------------------------- | | Token-Efficient | Progressive extraction minimizes LLM context window usage | | Stable Selectors | Ranked CSS selectors (ID > data-testid > ARIA > classes) for reliable automation | | AI-Optimized Output | Structured data designed for LLM understanding | | Zero Dependencies | Lightweight, runs in any browser environment | | Shadow DOM Support | Traverses shadow roots and iframes | | Stateless API | Works with any document context - Puppeteer, Playwright, browser extensions |

Use Cases

AI Browser Automation: Generate robust selectors for Puppeteer/Playwright scripts
Userscript Generation: LLMs create browser automation scripts with stable selectors
Web Scraping: Extract structured data with semantic context for AI processing
Test Automation: Generate reliable test selectors with multiple fallback strategies
Accessibility Analysis: Extract ARIA labels, roles, and semantic landmarks

Key Features

Two extraction approaches: Progressive (step-by-step) and Full (single-pass)
Stateless architecture: All functions accept document/element parameters
Multiple selector strategies: CSS, XPath, text-based, data-testid
Smart content detection: Automatically identifies main content areas
Context preservation: Maintains element relationships and semantic context
Shadow DOM & iframe support: Traverses complex DOM structures
Token-efficient: Optimized for LLM context windows

Installation

npm install @mcp-b/smart-dom-reader

Two Extraction Approaches

1. Full Extraction (SmartDOMReader)

When to use: You need all information upfront and have sufficient token budget for processing the complete output. Best for automation, testing, and scenarios where you know exactly what you need.

import { SmartDOMReader } from '@mcp-b/smart-dom-reader';

// Pass document explicitly - no window dependency
const doc = document; // or any Document object

// Interactive mode - extract only interactive elements
const interactiveData = SmartDOMReader.extractInteractive(doc);

// Full mode - extract interactive + semantic elements
const fullData = SmartDOMReader.extractFull(doc);

// Custom options
const customData = SmartDOMReader.extractInteractive(doc, {
  mainContentOnly: true,
  viewportOnly: true,
  includeHidden: false,
});

2. Progressive Extraction (ProgressiveExtractor)

When to use: Working with AI/LLMs where token efficiency is critical. Allows making intelligent decisions at each step rather than extracting everything upfront.

import { ProgressiveExtractor } from '@mcp-b/smart-dom-reader';

// Step 1: Get high-level page structure (minimal tokens)
// Structure can be extracted from the whole document or a specific container element
const structure = ProgressiveExtractor.extractStructure(document);
console.log(structure.summary); // Quick stats about the page
console.log(structure.regions); // Map of page regions
console.log(structure.suggestions); // AI-friendly hints

// Step 2: Extract details from specific region based on structure
const mainContent = ProgressiveExtractor.extractRegion(
  structure.summary.mainContentSelector,
  document,
  { mode: 'interactive' }
);

// Step 3: Extract readable content from a region
const articleText = ProgressiveExtractor.extractContent('article.main-article', document, {
  includeHeadings: true,
  includeLists: true,
});

// Structure scoped to a container (e.g., navigation only)
const nav = document.querySelector('nav');
if (nav) {
  const navOutline = ProgressiveExtractor.extractStructure(nav);
  // navOutline.regions will only include elements within <nav>
}

Extraction Modes

Interactive Mode

Focuses on elements users can interact with:

Buttons and button-like elements
Links
Form inputs (text, select, textarea)
Clickable elements with handlers
Form structures and associations

Full Mode

Includes everything from interactive mode plus:

Semantic HTML elements (articles, sections, nav)
Headings hierarchy
Images with alt text
Tables and lists
Content structure and relationships

API Comparison

Full Extraction API

// Class-based with options
const reader = new SmartDOMReader({
  mode: 'interactive',
  mainContentOnly: true,
  viewportOnly: false,
});
const result = reader.extract(document);

// Static methods for convenience
SmartDOMReader.extractInteractive(document);
SmartDOMReader.extractFull(document);
SmartDOMReader.extractFromElement(element, 'interactive');

Progressive Extraction API

// Step 1: Structure overview (Document or Element)
const overview = ProgressiveExtractor.extractStructure(document);
// Returns: regions, forms, summary, suggestions

// Step 2: Region extraction
const region = ProgressiveExtractor.extractRegion(selector, document, options);
// Returns: Full SmartDOMResult for that region

// Step 3: Content extraction
const content = ProgressiveExtractor.extractContent(selector, document, { includeMedia: true });
// Returns: Text content, headings, lists, tables, media

Output Structure

Both approaches return structured data optimized for AI processing:

interface SmartDOMResult {
  mode: 'interactive' | 'full';
  timestamp: number;

  page: {
    url: string;
    title: string;
    hasErrors: boolean;
    isLoading: boolean;
    hasModals: boolean;
    hasFocus?: string;
  };

  landmarks: {
    navigation: string[];
    main: string[];
    forms: string[];
    headers: string[];
    footers: string[];
    articles: string[];
    sections: string[];
  };

  interactive: {
    buttons: ExtractedElement[];
    links: ExtractedElement[];
    inputs: ExtractedElement[];
    forms: FormInfo[];
    clickable: ExtractedElement[];
  };

  semantic?: {
    // Only in full mode
    headings: ExtractedElement[];
    images: ExtractedElement[];
    tables: ExtractedElement[];
    lists: ExtractedElement[];
    articles: ExtractedElement[];
  };

  metadata?: {
    // Only in full mode
    totalElements: number;
    extractedElements: number;
    mainContent?: string;
    language?: string;
  };
}

Element Information

Each extracted element includes comprehensive selector strategies with ranking (stable-first):

interface ExtractedElement {
  tag: string;
  text: string;

  selector: {
    css: string; // Best CSS selector (ranked stable-first)
    xpath: string; // XPath selector
    textBased?: string; // Text-content based hint
    dataTestId?: string; // data-testid if available
    ariaLabel?: string; // ARIA label if available
    candidates?: Array<{
      type:
        | 'id'
        | 'data-testid'
        | 'role-aria'
        | 'name'
        | 'class-path'
        | 'css-path'
        | 'xpath'
        | 'text';
      value: string;
      score: number; // Higher = more stable/robust
    }>;
  };

  attributes: Record<string, string>;

  context: {
    nearestForm?: string;
    nearestSection?: string;
    nearestMain?: string;
    nearestNav?: string;
    parentChain: string[];
  };

  // Compact flags: only present when true to save tokens
  interaction: {
    click?: boolean;
    change?: boolean;
    submit?: boolean;
    nav?: boolean;
    disabled?: boolean;
    hidden?: boolean;
    role?: string; // aria role when present
    form?: string; // associated form selector
  };
}

Options

| Option | Type | Default | Description | | ------------------ | ------------------------- | --------------- | ------------------------------- | | mode | 'interactive' \| 'full' | 'interactive' | Extraction mode | | maxDepth | number | 5 | Maximum traversal depth | | includeHidden | boolean | false | Include hidden elements | | includeShadowDOM | boolean | true | Traverse shadow DOM | | includeIframes | boolean | false | Traverse iframes | | viewportOnly | boolean | false | Only visible viewport elements | | mainContentOnly | boolean | false | Focus on main content area | | customSelectors | string[] | [] | Additional selectors to extract |

Use Cases

AI Userscript Generation (Progressive Approach)

// First, understand the page structure
const structure = ProgressiveExtractor.extractStructure(document);

// AI decides which region to focus on based on structure
const targetRegion = structure.regions.main?.selector || 'body';

// Extract detailed information from chosen region
const details = ProgressiveExtractor.extractRegion(targetRegion, document, {
  mode: 'interactive',
  viewportOnly: true,
});

// Generate userscript prompt with focused context
const prompt = `
  Page: ${details.page.title}
  Main form: ${details.interactive.forms[0]?.selector}
  Submit button: ${details.interactive.buttons.find((b) => b.text.includes('Submit'))?.selector.css}
  
  Write a userscript to auto-fill and submit this form.
`;

Test Automation (Full Extraction)

// Get all interactive elements at once
const testData = SmartDOMReader.extractInteractive(document, {
  customSelectors: ['[data-test]', '[data-cy]'],
});

// Use multiple selector strategies for robust testing
testData.interactive.buttons.forEach((button) => {
  console.log(`Button: ${button.text}`);
  console.log(`  CSS: ${button.selector.css}`);
  console.log(`  XPath: ${button.selector.xpath}`);
  console.log(`  TestID: ${button.selector.dataTestId}`);
  console.log(`  Ranked candidates:`, button.selector.candidates?.slice(0, 3));
});

Content Analysis (Progressive Approach)

// Get structure first
const structure = ProgressiveExtractor.extractStructure(document);

// Extract readable content from main area
const content = ProgressiveExtractor.extractContent(
  structure.summary.mainContentSelector || 'main',
  document,
  { includeHeadings: true, includeTables: true }
);

console.log(`Word count: ${content.metadata.wordCount}`);
console.log(`Headings: ${content.text.headings?.length}`);
console.log(`Has interactive elements: ${content.metadata.hasInteractive}`);

Stateless Architecture

All methods are stateless and accept document/element parameters explicitly:

// No window or document globals required
function extractFromIframe(iframe: HTMLIFrameElement) {
  const iframeDoc = iframe.contentDocument;
  if (iframeDoc) {
    return SmartDOMReader.extractInteractive(iframeDoc);
  }
}

// Works with any document context
function extractFromShadowRoot(shadowRoot: ShadowRoot) {
  const container = shadowRoot.querySelector('.container');
  if (container) {
    return SmartDOMReader.extractFromElement(container);
  }
}

/**
 * Stateless bundle string (for extensions / userScripts)
 *
 * The library also provides a self-contained IIFE bundle as a string
 * export that can be injected and executed without touching window scope.
 */
import { SMART_DOM_READER_BUNDLE } from '@mcp-b/smart-dom-reader/bundle-string';

function execute(method, args) {
  const code = `(() => {\n${SMART_DOM_READER_BUNDLE}\nreturn SmartDOMReaderBundle.executeExtraction(${JSON.stringify(
    'extractStructure'
  )}, ${JSON.stringify({ selector: undefined, formatOptions: { detail: 'summary' } })});\n})()`;
  // inject `code` into the page (e.g., chrome.userScripts.execute)
}

// Note: The bundle contains guarded fallbacks (e.g., typeof require === 'function')
// that are no-ops in the browser; there are no runtime imports.

Design Philosophy

This library is designed to provide:

Token Efficiency: Progressive extraction minimizes token usage for AI applications
Flexibility: Choose between complete extraction or step-by-step approach
Statelessness: No global dependencies, works in any JavaScript environment
Multiple Selector Strategies: Robust element targeting with fallbacks
Semantic Understanding: Preserves meaning and relationships
Interactive Focus: Prioritizes elements users interact with
Context Preservation: Maintains element relationships
Framework Agnostic: Works with any web application

Frequently Asked Questions

How is this different from Cheerio or jsdom?

This library is AI-optimized:

Outputs structured data designed for LLM consumption
Provides ranked selectors with stability scores
Progressive extraction minimizes token usage
Preserves semantic context (landmarks, forms, interactivity)

Can I use this with Puppeteer/Playwright?

Yes! The library is stateless - pass any document object:

const page = await browser.newPage();
await page.goto('https://example.com');
const result = await page.evaluate(() => {
  return SmartDOMReader.extractInteractive(document);
});

How do selector rankings work?

Selectors are ranked by stability (higher = more reliable):

ID selectors (score: 100) - #unique-id
data-testid (score: 90) - [data-testid="submit"]
ARIA (score: 80) - [role="button"][aria-label="Submit"]
Name/ID attributes (score: 70) - input[name="email"]
Class paths (score: 50) - .form-container .submit-btn

Does it handle Shadow DOM?

Yes! Set includeShadowDOM: true to traverse shadow roots.

What's the token overhead vs raw HTML?

Progressive extraction can reduce token usage by 80-95% compared to raw HTML, depending on the page and what you extract.

Comparison with Alternatives

| Feature | @mcp-b/smart-dom-reader | Cheerio | jsdom | Raw DOM | | ------------------- | ----------------------- | ---------- | --------- | ------- | | AI-Optimized Output | Yes | No | No | No | | Ranked Selectors | Yes | No | No | No | | Token Efficiency | Progressive | N/A | N/A | N/A | | Shadow DOM | Yes | No | Limited | Yes | | Browser Environment | Native | Parse only | Simulated | Native | | Zero Dependencies | Yes | No | No | Yes |

Credits

Inspired by:

stacking-contexts-inspector - DOM traversal techniques
dom-to-semantic-markdown - Content scoring algorithms
z-context - Selector generation approaches

Related Packages

@mcp-b/global - W3C Web Model Context API polyfill
@mcp-b/transports - Browser-specific MCP transports
@mcp-b/chrome-devtools-mcp - Connect desktop AI agents to browser
@modelcontextprotocol/sdk - Official MCP SDK

Resources

License

MIT - see LICENSE for details

Support

MCP Server (Golden Path)

For AI agents, use the bundled MCP server which returns XML-wrapped Markdown instead of JSON. This keeps responses concise and readable for LLMs while providing clear structural boundaries.

Output format: always XML envelope with a single section tag containing Markdown in CDATA
- Structure: <page title="..." url="...">\n <outline><![CDATA[ ...markdown... ]]></outline>\n</page>
- Region: <page ...>\n <section><![CDATA[ ...markdown... ]]></section>\n</page>
- Content: <page ...>\n <content><![CDATA[ ...markdown... ]]></content>\n</page>
Golden path sequence:
1. dom_extract_structure → get page outline and pick a target
2. dom_extract_region → get actionable selectors for that area
3. Write a script; if unstable, re-run with higher detail or limits
4. Optional: dom_extract_content for readable text context

Running the server

Ensure the library is built so the formatter is available:

pnpm -w --filter @mcp-b/smart-dom-reader run build

Build and update the embedded bundle, then start the MCP server (stdio):

pnpm --filter @mcp-b/smart-dom-reader bundle:mcp
pnpm --filter @mcp-b/smart-dom-reader-server run start

Or directly with tsx:

tsx smart-dom-reader/mcp-server/src/index.ts

Tool overview (inputs only)

browser_connect → { headless?: boolean, executablePath?: string }
browser_navigate → { url: string }
dom_extract_structure → { selector?: string, detail?: 'summary'|'region'|'deep', maxTextLength?: number, maxElements?: number }
dom_extract_region → { selector: string, options?: { mode?: 'interactive'|'full', includeHidden?: boolean, maxDepth?: number, detail?: 'summary'|'region'|'deep', maxTextLength?: number, maxElements?: number } }
dom_extract_content → { selector: string, options?: { includeHeadings?: boolean, includeLists?: boolean, includeMedia?: boolean, maxTextLength?: number, detail?: 'summary'|'region'|'deep', maxElements?: number } }
dom_extract_interactive → { selector?: string, options?: { viewportOnly?: boolean, maxDepth?: number, detail?: 'summary'|'region'|'deep', maxTextLength?: number, maxElements?: number } }
browser_screenshot → { path?: string, fullPage?: boolean }
browser_close → {}

All extraction tools return XML-wrapped Markdown with a short “Next:” instruction at the bottom to guide the following step.

Local Testing (Playwright)

Run the library in a real browser against local HTML (no network):

pnpm --filter @mcp-b/smart-dom-reader bundle:mcp
pnpm --filter @mcp-b/smart-dom-reader test:local

What it validates:

Stable selectors (ID, data-testid, role+aria, name/id)
Semantic extraction (headings/images/tables/lists)
Shadow DOM detection

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@mcp-b/smart-dom-reader

Why Use @mcp-b/smart-dom-reader?

Use Cases

Key Features

Installation

Two Extraction Approaches

1. Full Extraction (SmartDOMReader)

2. Progressive Extraction (ProgressiveExtractor)

Extraction Modes

Interactive Mode

Full Mode

API Comparison

Full Extraction API

Progressive Extraction API

Output Structure

Element Information

Options

Use Cases

AI Userscript Generation (Progressive Approach)

Test Automation (Full Extraction)

Content Analysis (Progressive Approach)

Stateless Architecture

Design Philosophy

Frequently Asked Questions

How is this different from Cheerio or jsdom?

Can I use this with Puppeteer/Playwright?

How do selector rankings work?

Does it handle Shadow DOM?

What's the token overhead vs raw HTML?

Comparison with Alternatives

Credits

Related Packages

Resources

License

Support

MCP Server (Golden Path)

Running the server

Tool overview (inputs only)

Local Testing (Playwright)