@agent-infra/browser-context

v0.2.2

Published

6 days ago

get browser context for AI Agent

Downloads

6,207

0High
0Medium
0Low

Browser Context

A powerful web content extraction library that extracts clean, readable content from web pages and converts it to Markdown format. Built for browser automation and content processing workflows.

Features

Smart Content Extraction: Uses advanced algorithms (Defuddle + Mozilla Readability) to extract main content from web pages
HTML to Markdown Conversion: Clean conversion with support for GitHub Flavored Markdown (GFM)
Browser Integration: Works seamlessly with Puppeteer and other browser automation tools
Fallback Strategy: Automatically falls back to Readability when primary extraction fails
Customizable: Configurable tag removal and conversion options

Installation

pnpm install @agent-infra/browser-context

Usage

Basic Content Extraction

Extract clean content from a web page using Puppeteer:

import { extractContent } from '@agent-infra/browser-context';
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/article');

// Extract content as Markdown
const result = await extractContent(page);
console.log(result.title); // Article title
console.log(result.content); // Clean Markdown content

await browser.close();

Manual Content Extraction

Extract content from HTML strings:

import {
  extractWithDefuddle,
  extractWithReadability,
} from '@agent-infra/browser-context';

// Using Defuddle (primary method)
const result1 = await extractWithDefuddle(htmlString, url, {
  markdown: true,
});

// Using Readability (fallback method)
const result2 = await extractWithReadability(page, {
  markdown: true,
});

HTML to Markdown Conversion

Convert HTML content to Markdown:

import { toMarkdown } from '@agent-infra/browser-context';

const html = '<h1>Title</h1><p>Content with <strong>bold</strong> text</p>';
const markdown = toMarkdown(html, {
  gfmExtension: true, // Enable GitHub Flavored Markdown
  codeBlockStyle: 'fenced', // Use fenced code blocks
  headingStyle: 'atx', // Use # style headings
});

console.log(markdown);
// # Title
//
// Content with **bold** text

Advanced HTML to Markdown Options

import {
  toMarkdown,
  DEFAULT_TAGS_TO_REMOVE,
} from '@agent-infra/browser-context';

const options = {
  gfmExtension: true,
  codeBlockStyle: 'fenced' as const,
  headingStyle: 'atx' as const,
  emDelimiter: '*',
  strongDelimiter: '**',
  removeTags: [...DEFAULT_TAGS_TO_REMOVE, 'footer', 'nav'], // Remove additional tags
};

const markdown = toMarkdown(htmlContent, options);

API Reference

`extractContent(page: Page)`

Main extraction function that automatically tries Defuddle first, then falls back to Readability.

Parameters:

page: Puppeteer page instance

Returns:

Promise<{title: string, content: string}>: Extracted title and Markdown content

`extractWithDefuddle(html: string, url: string, options: DefuddleOptions)`

Extract content using the Defuddle library.

Parameters:

html: HTML content string
url: Page URL
options: Defuddle configuration options

`extractWithReadability(page: Page, options?)`

Extract content using Mozilla's Readability algorithm.

Parameters:

page: Puppeteer page instance
options.markdown: Whether to convert to Markdown (default: false)

`toMarkdown(html: string, options?: ToMarkdownOptions)`

Convert HTML to Markdown format.

Parameters:

html: HTML content string
options: Conversion options

ToMarkdownOptions:

gfmExtension: Enable GitHub Flavored Markdown (default: true)
removeTags: Array of HTML tags to remove
codeBlockStyle: 'indented' | 'fenced'
headingStyle: 'setext' | 'atx'
emDelimiter: Emphasis delimiter
strongDelimiter: Strong emphasis delimiter

Content Extraction Strategy

The library uses a smart two-tier extraction strategy:

Primary: Defuddle library for accurate content extraction
Fallback: Mozilla Readability algorithm when Defuddle fails

This ensures maximum compatibility and extraction success across different website structures.

Removed HTML Elements

By default, the following HTML elements are removed during Markdown conversion:

script, style, link, head
iframe, video, audio, canvas
object, embed, noscript
aside, dialog

You can customize this list using the removeTags option.

Browser Compatibility

This library is designed to work with:

Puppeteer
Playwright
Any browser automation tool that provides a Page-like interface

License

Apache License 2.0.

Credits

Special thanks to the open source projects that inspired this toolkit:

readability - A standalone version of the readability lib

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Browser Context

Features

Installation

Usage

Basic Content Extraction

Manual Content Extraction

HTML to Markdown Conversion

Advanced HTML to Markdown Options

API Reference

extractContent(page: Page)

extractWithDefuddle(html: string, url: string, options: DefuddleOptions)

extractWithReadability(page: Page, options?)

toMarkdown(html: string, options?: ToMarkdownOptions)

Content Extraction Strategy

Removed HTML Elements

Browser Compatibility

License

Credits

`extractContent(page: Page)`

`extractWithDefuddle(html: string, url: string, options: DefuddleOptions)`

`extractWithReadability(page: Page, options?)`

`toMarkdown(html: string, options?: ToMarkdownOptions)`