@xmer/rss-reader
v1.0.0
Published
TypeScript RSS feed reader with Puppeteer-based scraping for JavaScript-rendered content
Downloads
12
Maintainers
Readme
@xmer/rss-reader
A robust TypeScript RSS feed reader with Puppeteer-based scraping for JavaScript-rendered content. Designed to handle challenging RSS feeds that require full browser execution, with built-in browser pool management, link extraction, and configurable content parsing.
Key Features
- JavaScript-Rendered Feed Support: Uses Puppeteer to handle feeds that require browser execution
- Browser Pool Management: Maintains warm browser instances for <5s feed fetch latency
- Comprehensive Link Extraction: Extracts all links from HTML content (HTTP, HTTPS, magnet, etc.)
- Configurable Title Cleaning: Remove site-specific suffixes/prefixes with regex patterns
- Retry Logic: Automatic retry with exponential backoff for transient failures
- TypeScript-First: Full type safety with detailed interface definitions
- Error Handling: Comprehensive error hierarchy for precise error handling
- Memory Efficient: Configurable browser pool size to balance performance and resource usage
Table of Contents
- Installation
- Quick Start
- FitGirl Repacks Example
- Configuration Options
- API Reference
- Error Handling
- Security Considerations
- Performance
- Docker/CI Deployment
- Examples
- Contributing
- License
Installation
npm install @xmer/rss-readerRequirements
- Node.js 18.0.0 or higher
- Sufficient system resources for Puppeteer (default: 2 browser instances)
Peer Dependencies
All dependencies are bundled with the package:
puppeteer^24.34.0cheerio^1.0.0-rc.12xml2js^0.6.2
Quick Start
import { RssReader } from '@xmer/rss-reader';
// Create reader instance
const reader = new RssReader();
// Initialize browser pool (REQUIRED)
await reader.initialize();
try {
// Fetch and parse feed
const feed = await reader.fetchAndParse('https://example.com/feed.xml');
console.log(`Feed: ${feed.title}`);
console.log(`Items: ${feed.items.length}`);
// Access parsed items
for (const item of feed.items) {
console.log(`${item.title} - ${item.links.length} links`);
console.log(`Published: ${item.publishedAt.toISOString()}`);
}
} finally {
// CRITICAL: Always close to prevent zombie browsers
await reader.close();
}FitGirl Repacks Example
Primary use case: Parsing FitGirl Repacks feed with title cleaning and magnet link extraction.
import { RssReader, type RssItem } from '@xmer/rss-reader';
const reader = new RssReader({
// Remove site branding from titles
titleSuffixPattern: /[–-]\s*FitGirl Repacks/,
// Extract WordPress-style item IDs
itemIdPattern: /[?&]p=(\d+)/,
// Increase pool size for better throughput
browserPoolSize: 3,
// More retries for potentially flaky connections
retryAttempts: 5,
// Longer timeout for slow-loading feeds
timeout: 60000
});
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://fitgirl-repacks.site/feed/');
for (const item of feed.items) {
// Title is cleaned: "Game Name – FitGirl Repacks" -> "Game Name"
console.log(`Game: ${item.title}`);
console.log(`Post ID: ${item.itemId}`);
// Extract magnet links
const magnetLinks = item.links.filter(link => link.startsWith('magnet:'));
console.log(`Magnet links: ${magnetLinks.length}`);
if (magnetLinks.length > 0) {
console.log(`Download: ${magnetLinks[0]}`);
}
}
} catch (error) {
if (error instanceof FeedFetchError) {
console.error(`Failed to fetch feed: ${error.message}`);
} else if (error instanceof ParseError) {
console.error(`Failed to parse item: ${error.itemTitle}`);
}
} finally {
await reader.close();
}Configuration Options
Configure RssReader behavior by passing options to the constructor:
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| titleSuffixPattern | RegExp | undefined | Pattern to remove from end of titles (e.g., /[–-]\s*Site Name/) |
| titlePrefixPattern | RegExp | undefined | Pattern to remove from start of titles |
| itemIdPattern | RegExp | /[?&]p=(\d+)/ | Pattern to extract item ID from URLs (WordPress-style by default) |
| browserPoolSize | number | 2 | Number of browser instances in pool (balance performance vs resources) |
| retryAttempts | number | 3 | Maximum retry attempts for transient failures |
| timeout | number | 30000 | Timeout for feed fetching in milliseconds |
| linkValidationPattern | RegExp | undefined | Optional regex pattern to validate extracted links |
| disableSandbox | boolean | false | DANGER: Disable Chrome sandbox (only for Docker/CI) |
Configuration Examples
Basic configuration:
const reader = new RssReader({
browserPoolSize: 3,
timeout: 45000
});Site-specific extraction:
const reader = new RssReader({
titleSuffixPattern: /\s*-\s*Blog Name$/,
titlePrefixPattern: /^\[News\]\s*/,
itemIdPattern: /\/(\d+)\/$/ // Extract from URL path
});High-performance setup:
const reader = new RssReader({
browserPoolSize: 5,
retryAttempts: 5,
timeout: 60000
});API Reference
RssReader Class
Main entry point for RSS feed reading.
Constructor
new RssReader(config?: RssReaderConfig)Creates a new RssReader instance with optional configuration.
Parameters:
config(RssReaderConfig): Optional configuration object
Example:
const reader = new RssReader({
browserPoolSize: 2,
retryAttempts: 3
});initialize()
async initialize(): Promise<void>Initializes the browser pool. MUST be called before fetchAndParse().
Throws:
BrowserError: If browser pool initialization fails
Example:
await reader.initialize();fetchAndParse()
async fetchAndParse(feedUrl: string): Promise<RssFeed>Fetches and parses an RSS feed.
Parameters:
feedUrl(string): URL of the RSS feed to fetch
Returns:
Promise<RssFeed>: Parsed RSS feed with all items
Throws:
FeedFetchError: If fetching fails after all retriesInvalidFeedError: If feed structure is invalidParseError: If item parsing failsBrowserError: If browser pool is not initialized
Example:
const feed = await reader.fetchAndParse('https://example.com/feed.xml');
console.log(`Found ${feed.items.length} items`);close()
async close(): Promise<void>Closes the browser pool and cleans up resources. MUST be called when done to prevent zombie browser processes.
Example:
try {
await reader.fetchAndParse(url);
} finally {
await reader.close();
}getPoolStats()
getPoolStats(): BrowserPoolStatsReturns current browser pool statistics.
Returns:
BrowserPoolStats: Object containingtotal,available, andinUsecounts
Example:
const stats = reader.getPoolStats();
console.log(`Pool: ${stats.inUse}/${stats.total} in use`);isInitialized()
isInitialized(): booleanChecks if the reader is initialized.
Returns:
boolean:trueif initialized
Static Utilities
Utility methods for standalone use without creating a reader instance:
extractLinks()
static extractLinks(html: string): string[]Extracts all links from HTML content.
Parameters:
html(string): HTML content to parse
Returns:
string[]: Array of extracted links
Example:
const links = RssReader.extractLinks('<a href="https://example.com">Link</a>');extractItemId()
static extractItemId(url: string, pattern?: RegExp): string | undefinedExtracts item ID from URL using regex pattern.
Parameters:
url(string): URL to parsepattern(RegExp): Optional regex pattern (default: WordPress-style)
Returns:
string | undefined: Extracted ID or undefined if not found
Example:
const id = RssReader.extractItemId('https://example.com/?p=12345');
// Returns: "12345"cleanTitle()
static cleanTitle(title: string, options?: TitleCleanOptions): stringCleans title by removing patterns and normalizing whitespace.
Parameters:
title(string): Title to cleanoptions(TitleCleanOptions): Optional cleaning options
Returns:
string: Cleaned title
Example:
const cleaned = RssReader.cleanTitle('Game – Site Name', {
suffixPattern: /[–-]\s*Site Name/
});
// Returns: "Game"validateLink()
static validateLink(link: string, pattern?: RegExp): booleanValidates link against optional pattern.
Parameters:
link(string): Link to validatepattern(RegExp): Optional validation pattern
Returns:
boolean:trueif valid
Type Interfaces
RssFeed
interface RssFeed {
title: string; // Feed title
feedUrl: string; // Feed URL
description?: string; // Feed description
items: RssItem[]; // Parsed items
fetchedAt: Date; // Fetch timestamp
}RssItem
interface RssItem {
title: string; // Item title (cleaned)
link: string; // Item link/URL
publishedAt: Date; // Publication date
links: string[]; // All extracted links from content
itemId?: string; // Optional item ID extracted from URL
rawContent?: string; // Optional raw HTML content (UNSANITIZED)
metadata?: Record<string, unknown>; // Optional additional metadata
}SECURITY WARNING: The rawContent field contains unsanitized HTML. See Security Considerations.
RssReaderConfig
See Configuration Options table above.
BrowserPoolStats
interface BrowserPoolStats {
total: number; // Total browsers in pool
available: number; // Available browsers
inUse: number; // Browsers currently in use
}Error Handling
All errors extend the base RssReaderError class for easy identification.
Error Types
RssReaderError Base error class for all package errors.
FeedFetchError Thrown when feed fetching fails (network issues, timeout, Puppeteer errors).
try {
await reader.fetchAndParse(url);
} catch (error) {
if (error instanceof FeedFetchError) {
console.error(`Failed to fetch ${error.feedUrl}: ${error.message}`);
// Retry with exponential backoff or alert
}
}InvalidFeedError Thrown when feed structure is invalid or cannot be parsed as RSS/Atom.
catch (error) {
if (error instanceof InvalidFeedError) {
console.error(`Invalid feed at ${error.feedUrl}`);
// Skip this feed or alert administrator
}
}ParseError Thrown when individual item parsing fails.
catch (error) {
if (error instanceof ParseError) {
console.error(`Failed to parse item: ${error.itemTitle}`);
// Log and continue with other items
}
}BrowserError Thrown when browser operations fail (launch, crash, pool exhaustion).
catch (error) {
if (error instanceof BrowserError) {
console.error(`Browser error during ${error.operation}`);
// Reinitialize browser pool
await reader.close();
await reader.initialize();
}
}LinkNotFoundError Thrown when no links are found in item content (may be expected for some feeds).
catch (error) {
if (error instanceof LinkNotFoundError) {
console.warn(`No links found in: ${error.itemTitle}`);
// Expected for some items - log and continue
}
}Error Handling Best Practices
import {
RssReader,
FeedFetchError,
BrowserError,
ParseError
} from '@xmer/rss-reader';
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse(url);
// Process items
for (const item of feed.items) {
try {
await processItem(item);
} catch (itemError) {
// Log individual item failures but continue
console.error(`Failed to process ${item.title}:`, itemError);
}
}
} catch (error) {
if (error instanceof FeedFetchError) {
// Network/Puppeteer failure - retry or alert
console.error('Feed fetch failed:', error.message);
} else if (error instanceof BrowserError) {
// Browser crashed - reinitialize pool
console.error('Browser error:', error.message);
await reader.close();
await reader.initialize();
} else if (error instanceof ParseError) {
// Parse failure - log and skip
console.error('Parse error:', error.message);
} else {
// Unexpected error
console.error('Unexpected error:', error);
}
} finally {
// ALWAYS cleanup
await reader.close();
}Security Considerations
CRITICAL: Chrome Sandbox
The disableSandbox option removes Chrome's process isolation security layer.
NEVER set disableSandbox: true in production environments unless:
- Running inside a Docker container with proper isolation
- Running in a CI/CD pipeline with ephemeral environments
- You fully understand the security implications
Why this is dangerous:
- Compromised websites can escape the browser and access your system
- Malicious content in RSS feeds can execute arbitrary code
- No process isolation between browser tabs
Safe usage in Docker:
const reader = new RssReader({
// Only safe inside Docker containers
disableSandbox: process.env.RUNNING_IN_DOCKER === 'true'
});References:
XSS Warning: rawContent Field
The rawContent field in RssItem contains unsanitized HTML from the RSS feed.
NEVER render this content directly:
// DANGEROUS - XSS vulnerability
element.innerHTML = item.rawContent;Always sanitize before rendering:
import DOMPurify from 'dompurify';
// SAFE - sanitized HTML
const clean = DOMPurify.sanitize(item.rawContent);
element.innerHTML = clean;Recommended sanitization libraries:
- DOMPurify - Client-side HTML sanitization
- sanitize-html - Server-side HTML sanitization
References:
Production Security Best Practices
- Run in isolated containers: Use Docker with proper resource limits
- Sanitize all output: Never trust content from RSS feeds
- Validate links: Use
linkValidationPatternto restrict allowed link formats - Monitor resource usage: Browser instances consume significant memory
- Set reasonable timeouts: Prevent indefinite hangs on malicious feeds
- Log security events: Track failed fetches and suspicious content
Performance
Performance Targets
- Feed fetch: <5s (with warm browser pool)
- Parse per item: <100ms
- Memory: ~300MB with 2 browser pool
- Browser startup: ~2s cold start
- Max concurrent: Limited by
browserPoolSize
Optimization Tips
Browser Pool Sizing:
// Low concurrency (1-2 feeds simultaneously)
browserPoolSize: 2 // Default, ~300MB memory
// Medium concurrency (3-5 feeds simultaneously)
browserPoolSize: 5 // ~750MB memory
// High concurrency (10+ feeds simultaneously)
browserPoolSize: 10 // ~1.5GB memoryTimeout Configuration:
// Fast, reliable feeds
timeout: 15000 // 15s
// Slow or unreliable feeds
timeout: 60000 // 60s
// Very slow feeds (use sparingly)
timeout: 120000 // 2 minutesRetry Strategy:
// Reliable network
retryAttempts: 2
// Unreliable network or flaky feeds
retryAttempts: 5
// Critical feeds that must succeed
retryAttempts: 10Memory Management
Monitor browser pool health:
const stats = reader.getPoolStats();
if (stats.inUse === stats.total) {
console.warn('Browser pool exhausted - consider increasing poolSize');
}Monitoring
// Track fetch performance
const start = Date.now();
const feed = await reader.fetchAndParse(url);
const duration = Date.now() - start;
console.log(`Fetched ${feed.items.length} items in ${duration}ms`);
console.log(`Avg per item: ${duration / feed.items.length}ms`);
// Monitor pool utilization
const stats = reader.getPoolStats();
console.log(`Pool utilization: ${(stats.inUse / stats.total * 100).toFixed(1)}%`);Docker/CI Deployment
Dockerfile Example
FROM node:18-alpine
# Install Chromium dependencies
RUN apk add --no-cache \
chromium \
nss \
freetype \
harfbuzz \
ca-certificates \
ttf-freefont
# Tell Puppeteer to skip installing Chrome (use system Chromium)
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
RUNNING_IN_DOCKER=true
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy application code
COPY . .
# Build TypeScript
RUN npm run build
# Run as non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
USER nodejs
CMD ["node", "dist/index.js"]Docker Compose Example
version: '3.8'
services:
rss-reader:
build: .
environment:
- NODE_ENV=production
- RUNNING_IN_DOCKER=true
deploy:
resources:
limits:
memory: 1G
cpus: '1.0'
reservations:
memory: 512M
security_opt:
- no-new-privileges:true
cap_drop:
- ALLApplication Code for Docker
import { RssReader } from '@xmer/rss-reader';
const reader = new RssReader({
// Safe in Docker container
disableSandbox: process.env.RUNNING_IN_DOCKER === 'true',
browserPoolSize: 3,
timeout: 60000
});
await reader.initialize();
try {
const feed = await reader.fetchAndParse(url);
// Process feed
} finally {
await reader.close();
}CI/CD (GitHub Actions) Example
name: RSS Reader CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Install Chromium dependencies
run: |
sudo apt-get update
sudo apt-get install -y \
chromium-browser \
libx11-xcb1 \
libxcomposite1 \
libxcursor1 \
libxdamage1 \
libxi6 \
libxtst6 \
libnss3 \
libcups2 \
libxss1 \
libxrandr2 \
libasound2 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libpangocairo-1.0-0 \
libgtk-3-0
- name: Run tests
run: npm test
env:
PUPPETEER_EXECUTABLE_PATH: /usr/bin/chromium-browserExamples
Basic Usage
import { RssReader } from '@xmer/rss-reader';
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://blog.example.com/feed.xml');
for (const item of feed.items) {
console.log(item.title);
console.log(item.publishedAt.toLocaleDateString());
console.log(item.links.join(', '));
console.log('---');
}
} finally {
await reader.close();
}Custom Title Cleaning
const reader = new RssReader({
titleSuffixPattern: /\s*-\s*TechBlog$/,
titlePrefixPattern: /^\[Article\]\s*/
});
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://techblog.example.com/feed');
// Titles are cleaned:
// "[Article] TypeScript Tips - TechBlog" -> "TypeScript Tips"
for (const item of feed.items) {
console.log(item.title);
}
} finally {
await reader.close();
}Extracting Specific Link Types
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://downloads.example.com/feed');
for (const item of feed.items) {
// Extract only magnet links
const magnetLinks = item.links.filter(link => link.startsWith('magnet:'));
// Extract only HTTP(S) links
const httpLinks = item.links.filter(link =>
link.startsWith('http://') || link.startsWith('https://')
);
console.log(`${item.title}:`);
console.log(` Magnet: ${magnetLinks.length}`);
console.log(` HTTP: ${httpLinks.length}`);
}
} finally {
await reader.close();
}Processing Multiple Feeds
const reader = new RssReader({
browserPoolSize: 5, // Support concurrent processing
retryAttempts: 3
});
await reader.initialize();
try {
const feedUrls = [
'https://feed1.example.com/rss',
'https://feed2.example.com/rss',
'https://feed3.example.com/rss'
];
// Process feeds concurrently
const results = await Promise.allSettled(
feedUrls.map(url => reader.fetchAndParse(url))
);
for (const result of results) {
if (result.status === 'fulfilled') {
const feed = result.value;
console.log(`${feed.title}: ${feed.items.length} items`);
} else {
console.error(`Failed to fetch feed: ${result.reason}`);
}
}
} finally {
await reader.close();
}Advanced Error Handling
import {
RssReader,
FeedFetchError,
BrowserError,
ParseError,
InvalidFeedError,
LinkNotFoundError
} from '@xmer/rss-reader';
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse(url);
for (const item of feed.items) {
console.log(`${item.title}: ${item.links.length} links`);
}
} catch (error) {
if (error instanceof FeedFetchError) {
console.error(`Network error for ${error.feedUrl}:`, error.message);
// Implement exponential backoff retry
} else if (error instanceof InvalidFeedError) {
console.error(`Invalid feed structure: ${error.feedUrl}`);
// Alert administrator - feed may have changed format
} else if (error instanceof BrowserError) {
console.error(`Browser crashed during ${error.operation}`);
// Reinitialize browser pool
await reader.close();
await reader.initialize();
} else if (error instanceof ParseError) {
console.error(`Parse error for item ${error.itemTitle}:`, error.message);
// Log and continue - some items may still be valid
} else if (error instanceof LinkNotFoundError) {
console.warn(`No links in ${error.itemTitle}`);
// Expected for some feeds - log warning
} else {
console.error('Unexpected error:', error);
throw error;
}
} finally {
await reader.close();
}Using Static Utilities
import { RssReader } from '@xmer/rss-reader';
// Extract links from HTML without creating a reader instance
const html = '<a href="https://example.com">Link 1</a><a href="magnet:?xt=...">Magnet</a>';
const links = RssReader.extractLinks(html);
console.log(links); // ['https://example.com', 'magnet:?xt=...']
// Extract item ID from URL
const id = RssReader.extractItemId('https://blog.com/?p=12345');
console.log(id); // '12345'
// Clean title
const cleaned = RssReader.cleanTitle('Game Name – Site Branding', {
suffixPattern: /[–-]\s*Site Branding/
});
console.log(cleaned); // 'Game Name'
// Validate link
const isValid = RssReader.validateLink('https://example.com', /^https:\/\//);
console.log(isValid); // trueContributing
Contributions are welcome! Please see the repository for contribution guidelines.
License
MIT License - see LICENSE file for details.
Made with TypeScript, Puppeteer, Cheerio, and xml2js.
