npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tyroneross/blog-content-scraper

v0.3.0

Published

Intelligent web scraper SDK for extracting blog/news content. Supports RSS, sitemaps, and HTML with automatic detection.

Readme

Blog Content Scraper

Intelligent web scraper for extracting blog/news content from any website. Includes both a web UI for testing and a programmatic SDK for integration.

Quick Start (SDK)

import { scrapeWebsite } from './lib';

const result = await scrapeWebsite('https://techcrunch.com', {
  maxArticles: 5,
  extractFullContent: true
});

for (const article of result.articles) {
  console.log(article.title, article.qualityScore);
}

See SDK Documentation below for full API reference.


Web UI

Standalone web application for testing web scraping with intelligent content filtering. Built with Next.js, Mozilla Readability, and zero LLM dependencies.

Features

  • No configuration needed - Works immediately
  • 🎯 3-tier filtering - URL patterns → content validation → quality scoring
  • Fast - Mozilla Readability (92.2% F1 score)
  • 📊 Detailed stats - See filtering pipeline in action
  • 🎨 Clean UI - Built with Tailwind CSS
  • 🚀 Deploy anywhere - Vercel, Netlify, Docker, etc.

Quick Start

Local Development

  1. Install dependencies:
npm install
  1. Run dev server:
npm run dev
  1. Open browser:
http://localhost:3000

Deployment

Vercel (Recommended)

  1. Install Vercel CLI:
npm install -g vercel
  1. Deploy:
vercel
  1. Production deploy:
vercel --prod

Netlify

  1. Build command:
npm run build
  1. Publish directory:
.next
  1. Deploy:
netlify deploy --prod

Docker

FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]
docker build -t scraper-app .
docker run -p 3000:3000 scraper-app

How It Works

3-Tier Filtering System

Tier 1: URL Deny Patterns

  • Blocks /, /about, /careers, /contact, /tag/*, etc.
  • Fast, pattern-based filtering

Tier 2: Content Validation

  • Minimum 200 characters
  • Title length 10-200 characters
  • Text-to-HTML ratio ≥ 10%

Tier 3: Metadata Scoring

  • Content quality: 60% weight
  • Publication date: 12% weight
  • Author/byline: 8% weight
  • Schema.org metadata: 8% weight
  • Reading time (2+ min): 12% weight
  • Default threshold: 50%

Technology Stack

  • Next.js 15 - React framework
  • TypeScript - Type safety
  • Tailwind CSS - Styling
  • Mozilla Readability - Content extraction
  • JSDOM - HTML parsing
  • Zod - Schema validation
  • Lucide React - Icons

Project Structure

scraper-app/
├── app/
│   ├── api/scraper-test/      # API route
│   │   └── route.ts
│   ├── layout.tsx              # Root layout
│   ├── page.tsx                # Homepage
│   └── globals.css             # Global styles
├── components/
│   ├── ScraperTester.tsx       # Main UI component
│   └── ScraperResults.tsx      # Results display
├── lib/
│   ├── types.ts                # TypeScript types
│   ├── quality-scorer.ts       # Quality scoring logic
│   └── content-extractor.ts    # Content extraction
├── public/                     # Static assets
├── package.json
├── tsconfig.json
├── tailwind.config.ts
└── next.config.js

Environment Variables

No environment variables required! The app works out of the box.

Performance

  • Single article: ~2-5 seconds
  • Bundle size: ~150 KB (gzipped)
  • Zero API costs: No external APIs used
  • Memory: ~100 MB average

Testing

F1 Score Validation

The 92.2% F1 score claim for Mozilla Readability is validated through automated testing using two approaches:

1. Dragnet Benchmark Dataset (Recommended)

Uses the established Dragnet benchmark dataset - a well-documented, peer-reviewed dataset used in academic research:

npm run test:f1:dragnet

Results: 91.4% F1 score (0.8% from claimed 92.2%)

  • 📊 Dataset: 414 test articles (20 tested for efficiency)
  • 📚 Source: Published research (2013)
  • ✅ 100% extraction success rate
  • 📈 92.6% Precision, 92.3% Recall

2. Custom Test Dataset

Quick validation with curated test articles:

npm run test:f1

Results: 96.3% F1 score

  • 3 manually-labeled test articles
  • Useful for quick validation and development

What is F1 Score?

  • Precision: % of extracted content that is actually article content (not ads/navigation)
  • Recall: % of actual article content that was successfully extracted
  • F1 Score: Harmonic mean of precision and recall

Conclusion: The 92.2% F1 claim is validated using the established Dragnet benchmark dataset (91.4% achieved).

See tests/README.md for detailed testing documentation and how to add new test cases.

License

MIT

Contributing

Contributions welcome! Areas for improvement:

  • RSS/Sitemap discovery
  • Batch URL processing
  • Export functionality (CSV, JSON)
  • Custom quality scoring
  • Dark mode

Support

  • Issues: https://github.com/tyroneross/scraper-app/issues
  • Questions: Open a discussion

SDK Documentation

The SDK provides programmatic access to the scraping engine without the web UI.

Installation

npm install

Basic Usage

import { scrapeWebsite } from './lib';

const result = await scrapeWebsite('https://example.com/blog', {
  maxArticles: 10,           // Max articles to return (default: 10)
  extractFullContent: true,  // Get full article text (default: true)
  qualityThreshold: 0.5,     // Min quality score 0-1 (default: 0.5)
  sourceType: 'auto',        // 'auto' | 'rss' | 'sitemap' | 'html'
  allowPaths: ['/blog/*'],   // Only scrape these paths
  denyPaths: ['/about'],     // Skip these paths
  onProgress: (done, total) => console.log(`${done}/${total}`)
});

Response Format

{
  url: string;
  detectedType: 'rss' | 'sitemap' | 'html';
  articles: Array<{
    url: string;
    title: string;
    publishedDate: string;
    description?: string;
    fullContent?: string;          // Raw HTML
    fullContentMarkdown?: string;  // Formatted markdown
    fullContentText?: string;      // Plain text
    qualityScore: number;          // 0-1
    confidence: number;
    source: 'rss' | 'sitemap' | 'html';
  }>;
  stats: {
    totalDiscovered: number;
    afterQualityFilter: number;
    processingTime: number;
  };
  errors: string[];
}

Advanced: Direct Orchestrator

import { globalSourceOrchestrator } from './lib';

const result = await globalSourceOrchestrator.processSource(url, {
  sourceType: 'auto',
  allowPaths: ['/news/*'],
  denyPaths: ['/about', '/careers/*']
});

// Enhance with full content (parallel processing)
const enhanced = await globalSourceOrchestrator.enhanceWithFullContent(
  result.articles,
  10,
  { concurrency: 5, onProgress: (done, total) => {} }
);

Rate Limiter Presets

import { createRateLimiter } from './lib';

const limiter = createRateLimiter('moderate'); // or 'conservative', 'aggressive'

| Preset | Req/s | Max Concurrent | Per Host | |--------|-------|----------------|----------| | conservative | 1 | 10 | 2 | | moderate | 2 | 20 | 3 | | aggressive | 4 | 30 | 5 |

Path Patterns

'/blog/*'      // Matches /blog/anything
'/news/2024/*' // Matches /news/2024/anything
'/about'       // Exact match

Default deny patterns: /, /about/*, /careers/*, /contact/*, /tag/*, /category/*, /login, /signup, /pricing/*

Quality Scoring

Score weights:

  • Content quality: 60%
  • Publication date: 12%
  • Author/byline: 8%
  • Schema.org data: 8%
  • Reading time: 12%

Built with ❤️ using Mozilla Readability