npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@firecrawl/html-extractor

v0.1.2

Published

Fast HTML main-content extractor with Node.js bindings. Native Rust performance via napi-rs.

Readme

html-extractor

Fast HTML main-content extractor for Node.js / Bun. Native Rust performance via napi-rs.

Pulls the article body out of a raw HTML page and renders it as clean markdown — stripping nav, footers, related-stories rails, ads, and other site chrome. Page-type aware (article, forum, product, listing, documentation, etc.), with a per-extraction confidence score and harvested metadata (JSON-LD, OpenGraph, microformats).

Built by Firecrawl. Source: github.com/firecrawl/html-extractor.

Install

npm install @firecrawl/html-extractor
# or
bun add @firecrawl/html-extractor

Prebuilt binaries included for linux-x64, macOS ARM64, and windows-x64. No Rust toolchain needed.

Quick start

import { extract } from '@firecrawl/html-extractor'

const html = await fetch('https://example.com/article').then(r => r.text())
const result = await extract(html, { url: 'https://example.com/article' })

console.log(result.markdown)           // cleaned article as markdown
console.log(result.pageType)           // 'article' | 'forum' | 'product' | ...
console.log(result.extractionQuality)  // 0.0..1.0 confidence
console.log(result.metadata?.title)    // 'How to do the thing'

API

extract(html, options?): Promise<ExtractResult>

Asynchronous extraction. CPU work runs on a libuv worker thread, so the Node event loop isn't blocked for large pages. Prefer this for production use.

extractSync(html, options?): ExtractResult

Synchronous variant. Convenient for scripts and short HTML, but blocks the event loop while running. Don't use in request handlers.

version(): string

Returns the library version.

Options

interface ExtractOptions {
  url?: string                    // original URL, used for absolute-link rewriting + classification
  favorPrecision?: boolean        // be aggressive about dropping ambiguous content
  favorRecall?: boolean           // be lenient — keep ambiguous content (mutually exclusive with favorPrecision)
  outputText?: boolean            // also return a plain-text mirror of the markdown
  targetLanguage?: string         // hint for language detection
  pageTypeOverride?:              // skip the classifier and force a scoring profile
    | 'article'   | 'forum'
    | 'product'   | 'listing'
    | 'collection'| 'documentation'
    | 'service'   | 'other'
  includeLinks?: boolean          // preserve <a> as [text](href) in markdown (default true)
  includeTables?: boolean         // preserve tables as GFM tables (default true)
  includeImages?: boolean         // preserve <img> as ![alt](src) (default false)
  includeMetadata?: boolean       // populate the metadata field (default true)
  minExtractionLength?: number    // fall back if the kept subtree is shorter than this (default 25)
}

Result shape

interface ExtractResult {
  markdown: string                // cleaned main content as GFM markdown
  text?: string                   // plain-text mirror, present when outputText: true
  pageType: string                // detected page type
  extractionQuality: number       // confidence in [0.0, 1.0]
  language?: string               // BCP-47 language tag
  metadata?: {
    title?: string
    description?: string
    author?: string
    publishedDate?: string
    siteName?: string
    imageUrl?: string
    canonicalUrl?: string
    language?: string
    keywords: string[]
  }
  stats?: {
    textChars: number
    elementCount: number
    usedFallback: boolean
    pageType: string
  }
  errorReason?: string            // present on low-confidence extractions
}

How the algorithm works

A trafilatura-inspired five-stage pipeline:

  1. Pre-clean — drop <script> / <style> / <head> / comments / hidden elements.
  2. Page-type classification — rules-based ladder (URL patterns, tag counts, class regexes, JSON-LD @type) picks a scoring profile.
  3. Score + select — every element scored on 7 features (text density, link density, tag weight, class hints, position, parent chain); the highest-scoring subtree is the candidate main content.
  4. Fallback chain — if Stage 3 produced a degenerate result, fall through justext-style → readability-style → raw-text.
  5. Post-clean + render — strip leftover boilerplate from inside the kept subtree, render markdown, extract metadata.

See the Rust crate's README for architecture details and benchmarks.

License

Apache-2.0.