trafilatura

v0.2.0

Published

13 days ago

Fast and accurate web content extraction

0High
0Medium
0Low

gorango

content-extraction crawling nlp scraping trafilatura

napi-rs-trafilatura

Fast and accurate web content extraction for Node.js.

High-performance NAPI bindings for rs-trafilatura - a Rust port of trafilatura. Extracts clean, readable content from web pages while removing boilerplate, navigation, and advertisements.

Features

Fast: 71 files/s for articles, 46 files/s overall (native Rust)
Accurate: F1 0.966 on ScrapingHub benchmark, F1 0.859 across 7 page types
Page Type Classification: Auto-detects 7 page types (article, forum, product, collection, listing, documentation, service)
Per-Type Extraction: Specialized extraction profiles for each page type
Extraction Quality Predictor: ML-based confidence scoring (0.0-1.0)
Markdown Output: GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks
Rich Metadata: Title, author, date, description, categories, tags, license, images from JSON-LD, Open Graph, Dublin Core
Configurable: 28 options to tune precision/recall tradeoff, content selection, and output format
Robust: Handles malformed HTML with automatic character encoding detection

Installation

npm install trafilatura

Usage

import { extract } from 'trafilatura'

const html = `
<html>
  <head><title>Example Article</title></head>
  <body>
    <nav>Home | About | Contact</nav>
    <article>
      <h1>Main Title</h1>
      <p>This is the main content of the article.</p>
    </article>
    <footer>Copyright 2024</footer>
  </body>
</html>
`

const result = extract(html)
console.log('Title:', result.metadata.title)
console.log('Content:', result.contentText)
console.log('Page type:', result.metadata.pageType)
console.log('Quality:', result.extractionQuality)

With Options

import { extract } from 'trafilatura'

const result = extract(html, {
  outputMarkdown: true,
  includeImages: true,
  favorPrecision: true,
  url: 'https://example.com/article',
})

console.log(result.contentMarkdown)
console.log(result.images)

Page Type Override

import { extract } from 'trafilatura'

const result = extract(html, {
  pageType: 'product', // Force product page extraction
})

Working with Bytes

For HTML with unknown encoding:

import { extractBytes } from 'trafilatura'

const htmlBuffer = await fs.promises.readFile('page.html')
const result = extractBytes(htmlBuffer, { url: 'https://example.com' })

API

extract(html: string, options?: Options): ExtractResult

Extract content from HTML string with optional options.

extractBytes(buffer: Buffer, options?: Options): ExtractResult

Extract content from Buffer (handles encoding detection) with optional options.

ExtractResult

| Field | Type | Description | | ------------------------ | ----------- | --------------------------------------- | | contentText | string? | Main content as plain text | | contentHtml | string? | Main content as HTML | | contentMarkdown | string? | Main content as Markdown | | commentsText | string? | Comments section as text | | commentsHtml | string? | Comments section as HTML | | images | ImageData[] | Extracted images | | metadata | Metadata | Extracted metadata | | classificationConfidence | number? | ML classifier confidence (0.0-1.0) | | extractionQuality | number | Extraction quality confidence (0.0-1.0) | | warnings | string[] | Processing warnings |

Options

| Option | Type | Description | | --------------------- | -------- | -------------------------------- | | includeComments | boolean | Include comments in output | | includeTables | boolean | Include tables | | includeImages | boolean | Include images | | includeLinks | boolean | Include links | | favorPrecision | boolean | Favor precision over recall | | favorRecall | boolean | Favor recall over precision | | targetLanguage | string | Target language code | | url | string | Source URL | | authorBlacklist | string[] | Author names to exclude | | deduplicate | boolean | Remove duplicate content | | minExtractedSize | number | Minimum extracted content size | | minExtractedLen | number | Minimum extracted length | | maxExtractedLen | number | Maximum extracted length | | minOutputSize | number | Minimum output size | | minOutputCommSize | number | Minimum comments size | | minScore | number | Minimum quality score | | maxDuplicateRatio | number | Max duplicate ratio threshold | | maxLinkDensity | number | Max link density threshold | | minParagraphCluster | number | Min paragraph cluster size | | includeFormatting | boolean | Include text formatting | | onlyWithMetadata | boolean | Only extract pages with metadata | | maxTreeDepth | number | Maximum DOM tree depth | | minWordLength | number | Minimum word length | | useFallbackExtraction | boolean | Use fallback extraction | | dedupCacheSize | number | Deduplication cache size | | includeTitleInContent | boolean | Include title in content | | outputMarkdown | boolean | Output as Markdown | | pageType | string | Override page type |

Build from Source

# Install build dependencies
npm install

# Build the native module
npm run build

Test

npm test

License

MIT

Acknowledgments

trafilatura - Original Python implementation by Adrien Barbaresi
rs-trafilatura - Rust port by Murrough Foley