npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

trafilatura

v0.2.0

Published

Fast and accurate web content extraction

Readme

napi-rs-trafilatura

Fast and accurate web content extraction for Node.js.

High-performance NAPI bindings for rs-trafilatura - a Rust port of trafilatura. Extracts clean, readable content from web pages while removing boilerplate, navigation, and advertisements.

Features

  • Fast: 71 files/s for articles, 46 files/s overall (native Rust)
  • Accurate: F1 0.966 on ScrapingHub benchmark, F1 0.859 across 7 page types
  • Page Type Classification: Auto-detects 7 page types (article, forum, product, collection, listing, documentation, service)
  • Per-Type Extraction: Specialized extraction profiles for each page type
  • Extraction Quality Predictor: ML-based confidence scoring (0.0-1.0)
  • Markdown Output: GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks
  • Rich Metadata: Title, author, date, description, categories, tags, license, images from JSON-LD, Open Graph, Dublin Core
  • Configurable: 28 options to tune precision/recall tradeoff, content selection, and output format
  • Robust: Handles malformed HTML with automatic character encoding detection

Installation

npm install trafilatura

Usage

import { extract } from 'trafilatura'

const html = `
<html>
  <head><title>Example Article</title></head>
  <body>
    <nav>Home | About | Contact</nav>
    <article>
      <h1>Main Title</h1>
      <p>This is the main content of the article.</p>
    </article>
    <footer>Copyright 2024</footer>
  </body>
</html>
`

const result = extract(html)
console.log('Title:', result.metadata.title)
console.log('Content:', result.contentText)
console.log('Page type:', result.metadata.pageType)
console.log('Quality:', result.extractionQuality)

With Options

import { extract } from 'trafilatura'

const result = extract(html, {
  outputMarkdown: true,
  includeImages: true,
  favorPrecision: true,
  url: 'https://example.com/article',
})

console.log(result.contentMarkdown)
console.log(result.images)

Page Type Override

import { extract } from 'trafilatura'

const result = extract(html, {
  pageType: 'product', // Force product page extraction
})

Working with Bytes

For HTML with unknown encoding:

import { extractBytes } from 'trafilatura'

const htmlBuffer = await fs.promises.readFile('page.html')
const result = extractBytes(htmlBuffer, { url: 'https://example.com' })

API

extract(html: string, options?: Options): ExtractResult

Extract content from HTML string with optional options.

extractBytes(buffer: Buffer, options?: Options): ExtractResult

Extract content from Buffer (handles encoding detection) with optional options.

ExtractResult

| Field | Type | Description | | ------------------------ | ----------- | --------------------------------------- | | contentText | string? | Main content as plain text | | contentHtml | string? | Main content as HTML | | contentMarkdown | string? | Main content as Markdown | | commentsText | string? | Comments section as text | | commentsHtml | string? | Comments section as HTML | | images | ImageData[] | Extracted images | | metadata | Metadata | Extracted metadata | | classificationConfidence | number? | ML classifier confidence (0.0-1.0) | | extractionQuality | number | Extraction quality confidence (0.0-1.0) | | warnings | string[] | Processing warnings |

Options

| Option | Type | Description | | --------------------- | -------- | -------------------------------- | | includeComments | boolean | Include comments in output | | includeTables | boolean | Include tables | | includeImages | boolean | Include images | | includeLinks | boolean | Include links | | favorPrecision | boolean | Favor precision over recall | | favorRecall | boolean | Favor recall over precision | | targetLanguage | string | Target language code | | url | string | Source URL | | authorBlacklist | string[] | Author names to exclude | | deduplicate | boolean | Remove duplicate content | | minExtractedSize | number | Minimum extracted content size | | minExtractedLen | number | Minimum extracted length | | maxExtractedLen | number | Maximum extracted length | | minOutputSize | number | Minimum output size | | minOutputCommSize | number | Minimum comments size | | minScore | number | Minimum quality score | | maxDuplicateRatio | number | Max duplicate ratio threshold | | maxLinkDensity | number | Max link density threshold | | minParagraphCluster | number | Min paragraph cluster size | | includeFormatting | boolean | Include text formatting | | onlyWithMetadata | boolean | Only extract pages with metadata | | maxTreeDepth | number | Maximum DOM tree depth | | minWordLength | number | Minimum word length | | useFallbackExtraction | boolean | Use fallback extraction | | dedupCacheSize | number | Deduplication cache size | | includeTitleInContent | boolean | Include title in content | | outputMarkdown | boolean | Output as Markdown | | pageType | string | Override page type |

Build from Source

# Install build dependencies
npm install

# Build the native module
npm run build

Test

npm test

License

MIT

Acknowledgments