@tyroneross/blog-content-scraper

v0.3.0

Published

4 months ago

Intelligent web scraper SDK for extracting blog/news content. Supports RSS, sitemaps, and HTML with automatic detection.

0High
0Medium
0Low

tyroneross

scraper web-scraper blog rss sitemap content-extraction readability news article

Blog Content Scraper

Intelligent web scraper for extracting blog/news content from any website. Includes both a web UI for testing and a programmatic SDK for integration.

Quick Start (SDK)

import { scrapeWebsite } from './lib';

const result = await scrapeWebsite('https://techcrunch.com', {
  maxArticles: 5,
  extractFullContent: true
});

for (const article of result.articles) {
  console.log(article.title, article.qualityScore);
}

See SDK Documentation below for full API reference.

Web UI

Standalone web application for testing web scraping with intelligent content filtering. Built with Next.js, Mozilla Readability, and zero LLM dependencies.

Features

✅ No configuration needed - Works immediately
🎯 3-tier filtering - URL patterns → content validation → quality scoring
⚡ Fast - Mozilla Readability (92.2% F1 score)
📊 Detailed stats - See filtering pipeline in action
🎨 Clean UI - Built with Tailwind CSS
🚀 Deploy anywhere - Vercel, Netlify, Docker, etc.

Quick Start

Local Development

Install dependencies:

npm install

Run dev server:

npm run dev

Open browser:

http://localhost:3000

Deployment

Vercel (Recommended)

Install Vercel CLI:

npm install -g vercel

Deploy:

vercel

Production deploy:

vercel --prod

Netlify

Build command:

npm run build

Publish directory:

.next

Deploy:

netlify deploy --prod

Docker

FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]

docker build -t scraper-app .
docker run -p 3000:3000 scraper-app

How It Works

3-Tier Filtering System

Tier 1: URL Deny Patterns

Blocks /, /about, /careers, /contact, /tag/*, etc.
Fast, pattern-based filtering

Tier 2: Content Validation

Minimum 200 characters
Title length 10-200 characters
Text-to-HTML ratio ≥ 10%

Tier 3: Metadata Scoring

Content quality: 60% weight
Publication date: 12% weight
Author/byline: 8% weight
Schema.org metadata: 8% weight
Reading time (2+ min): 12% weight
Default threshold: 50%

Technology Stack

Next.js 15 - React framework
TypeScript - Type safety
Tailwind CSS - Styling
Mozilla Readability - Content extraction
JSDOM - HTML parsing
Zod - Schema validation
Lucide React - Icons

Project Structure

scraper-app/
├── app/
│   ├── api/scraper-test/      # API route
│   │   └── route.ts
│   ├── layout.tsx              # Root layout
│   ├── page.tsx                # Homepage
│   └── globals.css             # Global styles
├── components/
│   ├── ScraperTester.tsx       # Main UI component
│   └── ScraperResults.tsx      # Results display
├── lib/
│   ├── types.ts                # TypeScript types
│   ├── quality-scorer.ts       # Quality scoring logic
│   └── content-extractor.ts    # Content extraction
├── public/                     # Static assets
├── package.json
├── tsconfig.json
├── tailwind.config.ts
└── next.config.js

Environment Variables

No environment variables required! The app works out of the box.

Performance

Single article: ~2-5 seconds
Bundle size: ~150 KB (gzipped)
Zero API costs: No external APIs used
Memory: ~100 MB average

Testing

F1 Score Validation

The 92.2% F1 score claim for Mozilla Readability is validated through automated testing using two approaches:

1. Dragnet Benchmark Dataset (Recommended)

Uses the established Dragnet benchmark dataset - a well-documented, peer-reviewed dataset used in academic research:

npm run test:f1:dragnet

Results: 91.4% F1 score (0.8% from claimed 92.2%)

📊 Dataset: 414 test articles (20 tested for efficiency)
📚 Source: Published research (2013)
✅ 100% extraction success rate
📈 92.6% Precision, 92.3% Recall

2. Custom Test Dataset

Quick validation with curated test articles:

npm run test:f1

Results: 96.3% F1 score

3 manually-labeled test articles
Useful for quick validation and development

What is F1 Score?

Precision: % of extracted content that is actually article content (not ads/navigation)
Recall: % of actual article content that was successfully extracted
F1 Score: Harmonic mean of precision and recall

Conclusion: The 92.2% F1 claim is validated using the established Dragnet benchmark dataset (91.4% achieved).

See tests/README.md for detailed testing documentation and how to add new test cases.

License

MIT

Contributing

Contributions welcome! Areas for improvement:

RSS/Sitemap discovery
Batch URL processing
Export functionality (CSV, JSON)
Custom quality scoring
Dark mode

Support

Issues: https://github.com/tyroneross/scraper-app/issues
Questions: Open a discussion

SDK Documentation

The SDK provides programmatic access to the scraping engine without the web UI.

Installation

npm install

Basic Usage

import { scrapeWebsite } from './lib';

const result = await scrapeWebsite('https://example.com/blog', {
  maxArticles: 10,           // Max articles to return (default: 10)
  extractFullContent: true,  // Get full article text (default: true)
  qualityThreshold: 0.5,     // Min quality score 0-1 (default: 0.5)
  sourceType: 'auto',        // 'auto' | 'rss' | 'sitemap' | 'html'
  allowPaths: ['/blog/*'],   // Only scrape these paths
  denyPaths: ['/about'],     // Skip these paths
  onProgress: (done, total) => console.log(`${done}/${total}`)
});

Response Format

{
  url: string;
  detectedType: 'rss' | 'sitemap' | 'html';
  articles: Array<{
    url: string;
    title: string;
    publishedDate: string;
    description?: string;
    fullContent?: string;          // Raw HTML
    fullContentMarkdown?: string;  // Formatted markdown
    fullContentText?: string;      // Plain text
    qualityScore: number;          // 0-1
    confidence: number;
    source: 'rss' | 'sitemap' | 'html';
  }>;
  stats: {
    totalDiscovered: number;
    afterQualityFilter: number;
    processingTime: number;
  };
  errors: string[];
}

Advanced: Direct Orchestrator

import { globalSourceOrchestrator } from './lib';

const result = await globalSourceOrchestrator.processSource(url, {
  sourceType: 'auto',
  allowPaths: ['/news/*'],
  denyPaths: ['/about', '/careers/*']
});

// Enhance with full content (parallel processing)
const enhanced = await globalSourceOrchestrator.enhanceWithFullContent(
  result.articles,
  10,
  { concurrency: 5, onProgress: (done, total) => {} }
);

Rate Limiter Presets

import { createRateLimiter } from './lib';

const limiter = createRateLimiter('moderate'); // or 'conservative', 'aggressive'

| Preset | Req/s | Max Concurrent | Per Host | |--------|-------|----------------|----------| | conservative | 1 | 10 | 2 | | moderate | 2 | 20 | 3 | | aggressive | 4 | 30 | 5 |

Path Patterns

'/blog/*'      // Matches /blog/anything
'/news/2024/*' // Matches /news/2024/anything
'/about'       // Exact match

Default deny patterns: /, /about/*, /careers/*, /contact/*, /tag/*, /category/*, /login, /signup, /pricing/*

Quality Scoring

Score weights:

Content quality: 60%
Publication date: 12%
Author/byline: 8%
Schema.org data: 8%
Reading time: 12%

Built with ❤️ using Mozilla Readability