npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tyronerossjr/blog-scraper

v0.1.0

Published

Powerful web scraping SDK for extracting blog articles and content. No LLM required.

Readme

@tyronerossjr/blog-scraper

Powerful web scraping SDK for extracting blog articles and content. No LLM required.

npm version License: MIT

Features

No LLM needed - Uses Mozilla Readability (92.2% F1 score) for content extraction 🎯 3-tier filtering - URL patterns → content validation → quality scoring ⚡ Fast - Extracts articles in 2-5 seconds 🔧 Modular - Use high-level API or individual components 📦 Zero config - Works out of the box 🌐 Multi-source - RSS feeds, sitemaps, and HTML pages

Installation

npm install @tyronerossjr/blog-scraper

Quick Start

import { scrape } from '@tyronerossjr/blog-scraper';

// Simple usage - scrape a blog
const result = await scrape('https://example.com/blog');

console.log(`Found ${result.articles.length} articles`);
console.log(`Processing time: ${result.processingTime}ms`);

// Access articles
result.articles.forEach(article => {
  console.log(article.title);
  console.log(article.url);
  console.log(article.fullContentMarkdown); // Markdown format
  console.log(article.qualityScore); // 0-1 quality score
});

API Reference

High-Level API (Recommended)

scrape(url, options?)

Extract articles from a URL with automatic source detection.

import { scrape } from '@tyronerossjr/blog-scraper';

const result = await scrape('https://example.com', {
  // Optional configuration
  sourceType: 'auto',           // 'auto' | 'rss' | 'sitemap' | 'html'
  maxArticles: 50,              // Maximum articles to return
  extractFullContent: true,     // Extract full article content
  denyPaths: ['/about', '/contact'], // URL patterns to exclude
  qualityThreshold: 0.6         // Minimum quality score (0-1)
});

Returns:

{
  url: string;
  detectedType: 'rss' | 'sitemap' | 'html';
  confidence: 'high' | 'medium' | 'low';
  articles: ScrapedArticle[];
  extractionStats: {
    attempted: number;
    successful: number;
    failed: number;
    filtered: number;
    totalDiscovered: number;
    afterDenyFilter: number;
    afterContentValidation: number;
    afterQualityFilter: number;
  };
  processingTime: number;
  errors: string[];
  timestamp: string;
}

quickScrape(url)

Fast URL-only extraction (no full content).

import { quickScrape } from '@tyroneross/blog-scraper';

const urls = await quickScrape('https://example.com/blog');
// Returns: ['url1', 'url2', 'url3', ...]

Modular API (Advanced)

Use individual components for granular control.

Content Extraction

import { ContentExtractor } from '@tyroneross/blog-scraper';

const extractor = new ContentExtractor();
const content = await extractor.extractContent('https://example.com/article');

console.log(content.title);
console.log(content.textContent);
console.log(content.wordCount);
console.log(content.readingTime);

Quality Scoring

import { calculateArticleQualityScore, getQualityBreakdown } from '@tyroneross/blog-scraper';

const score = calculateArticleQualityScore(extractedContent);
console.log(`Quality score: ${score}`); // 0-1

// Get detailed breakdown
const breakdown = getQualityBreakdown(extractedContent);
console.log(breakdown);
// {
//   contentValidation: 0.6,
//   publishedDate: 0.12,
//   author: 0.08,
//   schema: 0.08,
//   readingTime: 0.12,
//   total: 1.0,
//   passesThreshold: true
// }

Custom Quality Configuration

import { calculateArticleQualityScore } from '@tyroneross/blog-scraper';

const score = calculateArticleQualityScore(content, {
  contentWeight: 0.8,        // Increase content importance
  dateWeight: 0.05,          // Decrease date importance
  authorWeight: 0.05,
  schemaWeight: 0.05,
  readingTimeWeight: 0.05,
  threshold: 0.7             // Stricter threshold
});

RSS Discovery

import { RSSDiscovery } from '@tyroneross/blog-scraper';

const discovery = new RSSDiscovery();
const feeds = await discovery.discoverFeeds('https://example.com');

feeds.forEach(feed => {
  console.log(feed.url);
  console.log(feed.title);
  console.log(feed.confidence); // 0-1
});

Sitemap Parsing

import { SitemapParser } from '@tyroneross/blog-scraper';

const parser = new SitemapParser();
const entries = await parser.parseSitemap('https://example.com/sitemap.xml');

entries.forEach(entry => {
  console.log(entry.url);
  console.log(entry.lastmod);
  console.log(entry.priority);
});

HTML Scraping

import { HTMLScraper } from '@tyroneross/blog-scraper';

const scraper = new HTMLScraper();
const articles = await scraper.extractFromPage('https://example.com/blog', {
  selectors: {
    articleLinks: ['article a', '.post-link'],
    titleSelectors: ['h1', '.post-title'],
    dateSelectors: ['time', '.published-date']
  },
  filters: {
    minTitleLength: 10,
    maxTitleLength: 200
  }
});

Rate Limiting

import { ScrapingRateLimiter } from '@tyroneross/blog-scraper';

// Create custom rate limiter
const limiter = new ScrapingRateLimiter({
  maxConcurrent: 5,
  minTime: 1000  // 1 second between requests
});

// Use in your scraping logic
await limiter.execute('example.com', async () => {
  // Your scraping code here
});

Circuit Breaker

import { CircuitBreaker } from '@tyroneross/blog-scraper';

const breaker = new CircuitBreaker('my-operation', {
  failureThreshold: 5,
  resetTimeout: 60000  // 1 minute
});

const result = await breaker.execute(async () => {
  // Your operation here
});

Examples

Example 1: Scrape with Custom Deny Patterns

import { scrape } from '@tyronerossjr/blog-scraper';

const result = await scrape('https://techcrunch.com', {
  denyPaths: [
    '/',
    '/about',
    '/contact',
    '/tag/*',      // Exclude all tag pages
    '/author/*'    // Exclude all author pages
  ],
  maxArticles: 20
});

Example 2: Build Custom Pipeline

import {
  SourceOrchestrator,
  ContentExtractor,
  calculateArticleQualityScore
} from '@tyroneross/blog-scraper';

// Step 1: Discover articles
const orchestrator = new SourceOrchestrator();
const discovered = await orchestrator.processSource('https://example.com', {
  sourceType: 'auto'
});

// Step 2: Extract content
const extractor = new ContentExtractor();
const extracted = await Promise.all(
  discovered.articles
    .slice(0, 10)
    .map(a => extractor.extractContent(a.url))
);

// Step 3: Score and filter
const scored = extracted
  .filter(Boolean)
  .map(content => ({
    content,
    score: calculateArticleQualityScore(content!)
  }))
  .filter(item => item.score >= 0.7);

console.log(`Found ${scored.length} high-quality articles`);

Example 3: RSS-Only Scraping

import { scrape } from '@tyronerossjr/blog-scraper';

const result = await scrape('https://example.com', {
  sourceType: 'rss',           // Only use RSS feeds
  extractFullContent: false,   // Don't extract full content
  maxArticles: 100
});

How It Works

3-Tier Filtering System

Tier 1: URL Deny Patterns

  • Fast pattern-based filtering
  • Excludes non-article pages (/, /about, /tag/*, etc.)
  • Customizable patterns

Tier 2: Content Validation

  • Minimum 200 characters
  • Title length 10-200 characters
  • Text-to-HTML ratio ≥ 10%

Tier 3: Quality Scoring

  • Content quality: 60% weight
  • Publication date: 12% weight
  • Author/byline: 8% weight
  • Schema.org metadata: 8% weight
  • Reading time: 12% weight
  • Default threshold: 50%

Auto-Detection Flow

  1. Try RSS feed (highest confidence)
  2. Discover RSS feeds from HTML
  3. Try sitemap parsing
  4. Discover sitemaps from domain
  5. Fall back to HTML link extraction

TypeScript Support

Full TypeScript support with exported types:

import type {
  ScrapedArticle,
  ScraperTestResult,
  ScrapeOptions,
  ExtractedContent,
  QualityScoreConfig
} from '@tyroneross/blog-scraper';

Performance

  • Single article extraction: ~2-5 seconds
  • Bundle size: ~150 KB (gzipped)
  • Memory usage: ~100 MB average
  • No external APIs: Zero API costs

Requirements

  • Node.js ≥ 18.0.0
  • No environment variables needed

License

MIT © Tyrone Ross

Contributing

Contributions welcome! Please open an issue or PR.

Support


Built with ❤️ using Mozilla Readability