web-content-extract
v1.0.1
Published
A library and command-line tool to extract clean content from web pages using Mozilla Readability and convert it to Markdown or JSON.
Maintainers
Readme
Web Content Extract
A library and command-line tool to extract clean content from web pages using Mozilla Readability, with enhanced SEO metadata extraction and JSON output support.
Features
- Extracts main content from web pages, filtering out ads, navigation, and other non-essential elements
- Converts extracted content to clean Markdown format
- Enhanced SEO metadata extraction including:
- Standard meta tags (title, description, keywords, author)
- Open Graph metadata
- Schema.org itemprop metadata
- rel="author" links
- time tags for publication dates
- Supports multiple output formats: Markdown, YAML Front Matter, and JSON
- Can be used as a library in other projects or as a standalone CLI tool
- Built with TypeScript and Node.js
- Uses
@mozilla/readabilityfor accurate content extraction - Uses
ofetchfor robust HTTP requests with timeout handling
Installation
npm install web-content-extractUsage as a Library
import { extractContent } from "web-content-extract";
// Basic content extraction
const result = await extractContent("https://example.com");
console.log(result.content); // Markdown content
// Content extraction with SEO metadata
const resultWithSeo = await extractContent("https://example.com", true);
console.log(resultWithSeo.title); // Article title
console.log(resultWithSeo.seo); // SEO metadata
console.log(resultWithSeo.content); // Markdown content
// JSON output example
console.log(JSON.stringify(resultWithSeo, null, 2));Usage as a CLI Tool
Run the tool using npx:
npx web-content-extract <url> [options]Arguments
<url>: The URL of the web page to extract content from (required)
Options
-o, --output <file>: Output file path (default: stdout)-s, --seo: Include SEO metadata in the output-j, --json: Output in JSON format
Examples
Extract content and output to stdout:
npx web-content-extract https://example.comExtract content and save to a file:
npx web-content-extract https://example.com -o output.mdExtract content with SEO metadata in YAML Front Matter format:
npx web-content-extract https://example.com --seoExtract content with SEO metadata in JSON format:
npx web-content-extract https://example.com --seo --jsonAPI
extractContent(url: string, includeSeo?: boolean): Promise<ExtractedContent>
Extracts content from a web page.
Parameters:
url: The URL of the web page to extract content fromincludeSeo: Whether to include SEO metadata (default: false)
Returns: An object with the following properties:
content: The extracted content in Markdown formattitle: The title of the articleseo: SEO metadata (only ifincludeSeois true)
ExtractedContent Interface
interface ExtractedContent {
content: string;
title?: string;
seo?: SeoMetadata;
}
interface SeoMetadata {
title?: string;
description?: string;
keywords?: string;
author?: string;
publishedTime?: string;
siteName?: string;
language?: string;
openGraph?: {
title?: string;
type?: string;
image?: string;
url?: string;
description?: string;
siteName?: string;
locale?: string;
};
}How It Works
- Fetch: Uses
ofetchto retrieve the HTML content of the specified URL with a 10-second timeout - Parse: Uses
jsdomto create a DOM environment from the HTML - Extract: Uses
@mozilla/readabilityto identify and extract the main article content - SEO Metadata: Extracts comprehensive SEO metadata from various sources:
- Standard meta tags
- Open Graph tags
- Schema.org itemprop attributes
- rel="author" links
- time tags
- Convert: Uses
turndownto convert the extracted HTML content to Markdown - Output: Outputs the content in the requested format (Markdown, YAML Front Matter, or JSON)
Development
- Clone or download this repository
- Install dependencies:
npm install - Build the project:
npm run build
Dependencies
- @mozilla/readability: Content extraction engine
- jsdom: DOM implementation for Node.js
- ofetch: Modern fetch implementation
- turndown: HTML to Markdown converter
- yargs: CLI argument parser
License
MIT
