graby-ts-site-config

v1.1.1

Published

a year ago

Site configuration loader for Graby-TS with dynamic imports

Downloads

0High
0Medium
0Low

modos189

graby-ts readability site-config content-extraction fivefilters

Graby-TS Site Config

A dynamic site configuration loader for Graby-TS based on FiveFilters site patterns format. This library provides standardized content extraction rules for different websites, allowing for consistent extraction across a wide range of domains.

The site configuration rules are sourced from FiveFilters ftr-site-config, which contains a comprehensive collection of extraction rules for thousands of websites.

Features

Dynamically loads site-specific extraction rules
Well-typed with full TypeScript support
Memory-efficient with on-demand loading and caching
Compatible with all JavaScript environments
Supports wildcard domain patterns
Based on the established FiveFilters site patterns format

Installation

npm install graby-ts-site-config

Usage

import { SiteConfigManager } from 'graby-ts-site-config';

// Create a site config manager instance
const configManager = new SiteConfigManager();

async function extractContent(url) {
  // Get the site configuration for this URL
  const { hostname } = new URL(url);
  const config = await configManager.getConfigForHost(hostname);
  
  // Now use the configuration with your content extractor
  console.log('Using config:', config);
  
  // Example: checking if this site has specific extraction rules
  if (config.title && config.title.length > 0) {
    console.log('This site has custom title extraction rules');
  }
}

// Preload configs for frequently used sites
configManager.preloadConfigs(['medium.com', 'wikipedia.org']);

API

SiteConfigManager

`getConfigForHost(hostname: string): Promise<SiteConfig>`

Asynchronously loads and returns the configuration for the given hostname.

const config = await configManager.getConfigForHost('nytimes.com');

`hasConfigForHost(hostname: string): boolean`

Checks if a configuration exists for the given hostname.

if (configManager.hasConfigForHost('medium.com')) {
  console.log('Medium has custom extraction rules');
}

`preloadConfigs(hostnames: string[]): Promise<void>`

Preloads configurations for an array of hostnames to improve performance.

await configManager.preloadConfigs(['medium.com', 'wikipedia.org']);

`clearCache(): void`

Clears the internal configuration cache.

configManager.clearCache();

Configuration Fields

The SiteConfig object contains various fields that control how content is extracted from a website:

Content Selection (XPath expressions)

| Field | Type | Description | |-------|------|-------------| | title | string[] | XPath expressions to extract the page title | | body | string[] | XPath expressions to extract the article body content | | date | string[] | XPath expressions to extract the publication date | | author | string[] | XPath expressions to extract the author(s) information |

Content Cleaning

| Field | Type | Description | |-------|------|-------------| | strip | string[] | XPath expressions for elements to remove from the content | | strip_id_or_class | string[] | Element IDs or classes to remove from the content | | strip_image_src | string[] | Remove images with matching src attributes | | native_ad_clue | string[] | XPath expressions to identify native advertisements |

Processing Options

| Field | Type | Default | Description | |-------|------|---------|-------------| | prune | boolean | true | Clean content from non-essential elements using Readability algorithm | | autodetect_on_failure | boolean | true | Fall back to auto-detection if the pattern-based extraction fails | | insert_detected_image | boolean | true | Insert the main image detected from metadata | | skip_json_ld | boolean | true | Skip extraction from JSON-LD structured data |

Multi-page Handling

| Field | Type | Description | |-------|------|-------------| | single_page_link | string[] | XPath expressions to find the "view as single page" link | | single_page_link_in_feed | string[] | XPath for single-page links in feed items | | next_page_link | string[] | XPath expressions to find links to subsequent pages | | if_page_contains | string[] | XPath expressions for conditional processing of multi-page content |

Content Enhancement

| Field | Type | Description | |-------|------|-------------| | find_string | string[] | Strings to find and replace in the content | | replace_string | string[] | Replacement strings (paired with find_string) | | wrap_in | Record<string, string> | Wrap matching elements with specified tags | | src_lazy_load_attr | string[] | Image attribute names for lazy-loaded images |

HTTP Options

| Field | Type | Description | |-------|------|-------------| | http_header | Record<string, string> | Additional HTTP headers to send with requests |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme