@tyroneross/blog-content-scraper
v0.3.0
Published
Intelligent web scraper SDK for extracting blog/news content. Supports RSS, sitemaps, and HTML with automatic detection.
Maintainers
Readme
Blog Content Scraper
Intelligent web scraper for extracting blog/news content from any website. Includes both a web UI for testing and a programmatic SDK for integration.
Quick Start (SDK)
import { scrapeWebsite } from './lib';
const result = await scrapeWebsite('https://techcrunch.com', {
maxArticles: 5,
extractFullContent: true
});
for (const article of result.articles) {
console.log(article.title, article.qualityScore);
}See SDK Documentation below for full API reference.
Web UI
Standalone web application for testing web scraping with intelligent content filtering. Built with Next.js, Mozilla Readability, and zero LLM dependencies.
Features
- ✅ No configuration needed - Works immediately
- 🎯 3-tier filtering - URL patterns → content validation → quality scoring
- ⚡ Fast - Mozilla Readability (92.2% F1 score)
- 📊 Detailed stats - See filtering pipeline in action
- 🎨 Clean UI - Built with Tailwind CSS
- 🚀 Deploy anywhere - Vercel, Netlify, Docker, etc.
Quick Start
Local Development
- Install dependencies:
npm install- Run dev server:
npm run dev- Open browser:
http://localhost:3000Deployment
Vercel (Recommended)
- Install Vercel CLI:
npm install -g vercel- Deploy:
vercel- Production deploy:
vercel --prodNetlify
- Build command:
npm run build- Publish directory:
.next- Deploy:
netlify deploy --prodDocker
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]docker build -t scraper-app .
docker run -p 3000:3000 scraper-appHow It Works
3-Tier Filtering System
Tier 1: URL Deny Patterns
- Blocks /, /about, /careers, /contact, /tag/*, etc.
- Fast, pattern-based filtering
Tier 2: Content Validation
- Minimum 200 characters
- Title length 10-200 characters
- Text-to-HTML ratio ≥ 10%
Tier 3: Metadata Scoring
- Content quality: 60% weight
- Publication date: 12% weight
- Author/byline: 8% weight
- Schema.org metadata: 8% weight
- Reading time (2+ min): 12% weight
- Default threshold: 50%
Technology Stack
- Next.js 15 - React framework
- TypeScript - Type safety
- Tailwind CSS - Styling
- Mozilla Readability - Content extraction
- JSDOM - HTML parsing
- Zod - Schema validation
- Lucide React - Icons
Project Structure
scraper-app/
├── app/
│ ├── api/scraper-test/ # API route
│ │ └── route.ts
│ ├── layout.tsx # Root layout
│ ├── page.tsx # Homepage
│ └── globals.css # Global styles
├── components/
│ ├── ScraperTester.tsx # Main UI component
│ └── ScraperResults.tsx # Results display
├── lib/
│ ├── types.ts # TypeScript types
│ ├── quality-scorer.ts # Quality scoring logic
│ └── content-extractor.ts # Content extraction
├── public/ # Static assets
├── package.json
├── tsconfig.json
├── tailwind.config.ts
└── next.config.jsEnvironment Variables
No environment variables required! The app works out of the box.
Performance
- Single article: ~2-5 seconds
- Bundle size: ~150 KB (gzipped)
- Zero API costs: No external APIs used
- Memory: ~100 MB average
Testing
F1 Score Validation
The 92.2% F1 score claim for Mozilla Readability is validated through automated testing using two approaches:
1. Dragnet Benchmark Dataset (Recommended)
Uses the established Dragnet benchmark dataset - a well-documented, peer-reviewed dataset used in academic research:
npm run test:f1:dragnetResults: 91.4% F1 score (0.8% from claimed 92.2%)
- 📊 Dataset: 414 test articles (20 tested for efficiency)
- 📚 Source: Published research (2013)
- ✅ 100% extraction success rate
- 📈 92.6% Precision, 92.3% Recall
2. Custom Test Dataset
Quick validation with curated test articles:
npm run test:f1Results: 96.3% F1 score
- 3 manually-labeled test articles
- Useful for quick validation and development
What is F1 Score?
- Precision: % of extracted content that is actually article content (not ads/navigation)
- Recall: % of actual article content that was successfully extracted
- F1 Score: Harmonic mean of precision and recall
Conclusion: The 92.2% F1 claim is validated using the established Dragnet benchmark dataset (91.4% achieved).
See tests/README.md for detailed testing documentation and how to add new test cases.
License
MIT
Contributing
Contributions welcome! Areas for improvement:
- RSS/Sitemap discovery
- Batch URL processing
- Export functionality (CSV, JSON)
- Custom quality scoring
- Dark mode
Support
- Issues: https://github.com/tyroneross/scraper-app/issues
- Questions: Open a discussion
SDK Documentation
The SDK provides programmatic access to the scraping engine without the web UI.
Installation
npm installBasic Usage
import { scrapeWebsite } from './lib';
const result = await scrapeWebsite('https://example.com/blog', {
maxArticles: 10, // Max articles to return (default: 10)
extractFullContent: true, // Get full article text (default: true)
qualityThreshold: 0.5, // Min quality score 0-1 (default: 0.5)
sourceType: 'auto', // 'auto' | 'rss' | 'sitemap' | 'html'
allowPaths: ['/blog/*'], // Only scrape these paths
denyPaths: ['/about'], // Skip these paths
onProgress: (done, total) => console.log(`${done}/${total}`)
});Response Format
{
url: string;
detectedType: 'rss' | 'sitemap' | 'html';
articles: Array<{
url: string;
title: string;
publishedDate: string;
description?: string;
fullContent?: string; // Raw HTML
fullContentMarkdown?: string; // Formatted markdown
fullContentText?: string; // Plain text
qualityScore: number; // 0-1
confidence: number;
source: 'rss' | 'sitemap' | 'html';
}>;
stats: {
totalDiscovered: number;
afterQualityFilter: number;
processingTime: number;
};
errors: string[];
}Advanced: Direct Orchestrator
import { globalSourceOrchestrator } from './lib';
const result = await globalSourceOrchestrator.processSource(url, {
sourceType: 'auto',
allowPaths: ['/news/*'],
denyPaths: ['/about', '/careers/*']
});
// Enhance with full content (parallel processing)
const enhanced = await globalSourceOrchestrator.enhanceWithFullContent(
result.articles,
10,
{ concurrency: 5, onProgress: (done, total) => {} }
);Rate Limiter Presets
import { createRateLimiter } from './lib';
const limiter = createRateLimiter('moderate'); // or 'conservative', 'aggressive'| Preset | Req/s | Max Concurrent | Per Host | |--------|-------|----------------|----------| | conservative | 1 | 10 | 2 | | moderate | 2 | 20 | 3 | | aggressive | 4 | 30 | 5 |
Path Patterns
'/blog/*' // Matches /blog/anything
'/news/2024/*' // Matches /news/2024/anything
'/about' // Exact matchDefault deny patterns: /, /about/*, /careers/*, /contact/*, /tag/*, /category/*, /login, /signup, /pricing/*
Quality Scoring
Score weights:
- Content quality: 60%
- Publication date: 12%
- Author/byline: 8%
- Schema.org data: 8%
- Reading time: 12%
Built with ❤️ using Mozilla Readability
