@anisirji/web-extractor
v1.0.2
Published
Powerful web content extraction SDK with URL normalization and intelligent scraping - https://github.com/anisirji/llm-web-extractor
Maintainers
Readme
@anisirji/web-extractor
Powerful web content extraction SDK with intelligent URL handling, content cleaning, and comprehensive metadata extraction.
Features
✨ Smart URL Handling
- URL validation and normalization
- Subdomain detection
- Duplicate URL filtering
- Pattern-based URL filtering
🧹 Content Cleaning
- Automatic markdown/HTML/text extraction
- Whitespace normalization
- Word counting
- Language detection
📊 Rich Metadata
- Scraping timestamps
- Word counts
- Page descriptions
- Status codes
- Custom metadata support
🚀 Easy to Use
- Simple, intuitive API
- TypeScript support
- Promise-based
- Comprehensive error handling
📖 Documentation
- Testing Guide - Comprehensive guide on testing the SDK
- Test Results - Latest test results for astratechai.com
- API Documentation - Complete API reference and examples
Installation
npm install @anisirji/web-extractorQuick Start
Extract a Single Page
import { WebExtractor } from '@anisirji/web-extractor';
const extractor = new WebExtractor({
apiKey: 'your-firecrawl-api-key'
});
// Extract single page
const page = await extractor.extractPage('https://example.com');
console.log(page.title);
console.log(page.content);
console.log(page.metadata.wordCount);Extract Entire Website
const result = await extractor.extractWebsite('https://docs.example.com', {
maxPages: 20,
includeSubdomains: false,
titlePrefix: 'Docs',
maxDepth: 3
});
console.log(`Extracted ${result.pages.length} pages`);
console.log(`Success rate: ${result.stats.successRate}%`);
console.log(`Total words: ${result.stats.totalWords}`);
for (const page of result.pages) {
console.log(`${page.title} - ${page.url}`);
}API Reference
WebExtractor
Main class for web extraction.
Constructor
new WebExtractor(config: WebExtractorConfig)Config Options:
apiKey(required): Your Firecrawl API keybaseUrl(optional): Custom Firecrawl API URLtimeout(optional): Request timeout in ms (default: 30000)debug(optional): Enable debug logging (default: false)
Methods
extractPage(url, options?)
Extract content from a single page.
await extractor.extractPage('https://example.com', {
onlyMainContent: true, // Extract only main content
format: 'markdown', // 'markdown' | 'html' | 'text'
waitFor: 1000 // Wait time before extraction (ms)
});Returns: Promise<ExtractedPage>
extractWebsite(url, options?)
Extract content from entire website (crawl).
await extractor.extractWebsite('https://example.com', {
maxPages: 10, // Maximum pages to scrape
includeSubdomains: false, // Include subdomains
titlePrefix: 'My Site', // Prefix for all titles
maxDepth: 3, // Maximum crawl depth
followExternalLinks: false, // Follow external links
includePatterns: [/\/docs\//], // URL patterns to include
excludePatterns: [/\/blog\//], // URL patterns to exclude
onlyMainContent: true, // Extract only main content
format: 'markdown' // Output format
});Returns: Promise<ExtractionResult>
URL Utilities
Powerful URL manipulation utilities.
import {
normalizeUrl,
validateUrl,
deduplicateUrls,
isSameDomain,
extractDomain
} from '@anisirji/web-extractor';
// Normalize URL
const normalized = normalizeUrl('https://Example.com/path/?b=2&a=1#hash', {
lowercase: true, // Convert to lowercase
removeTrailingSlash: true, // Remove trailing slash
removeFragment: true, // Remove #hash
sortQueryParams: true // Sort query params
});
// => 'https://example.com/path?a=1&b=2'
// Validate URL
const urlObj = validateUrl('https://example.com'); // Returns URL object or throws
// Deduplicate URLs
const unique = deduplicateUrls([
'https://example.com/page',
'https://example.com/page/',
'https://EXAMPLE.COM/page'
]);
// => ['https://example.com/page']
// Check same domain
isSameDomain('https://example.com', 'https://example.com/page'); // true
isSameDomain('https://example.com', 'https://other.com'); // false
// Extract domain
extractDomain('https://blog.example.com/page'); // => 'blog.example.com'Content Utilities
Content processing utilities.
import {
cleanContent,
countWords,
generateExcerpt,
detectLanguage
} from '@anisirji/web-extractor';
// Clean content
const cleaned = cleanContent(' text\n\n\n\nmore text ');
// => 'text\n\nmore text'
// Count words
countWords('Hello world from TermiX'); // => 4
// Generate excerpt
generateExcerpt('Very long content here...', 10);
// => 'Very long content here (first 10 words)...'
// Detect language
detectLanguage('This is an English text'); // => 'en'Advanced Examples
Filter URLs by Pattern
const result = await extractor.extractWebsite('https://docs.example.com', {
maxPages: 50,
// Only include documentation pages
includePatterns: [
/\/docs\//,
/\/api\//,
/\/guides\//
],
// Exclude blog and changelog
excludePatterns: [
/\/blog\//,
/\/changelog\//
]
});Custom Processing Pipeline
const result = await extractor.extractWebsite('https://example.com', {
maxPages: 30
});
// Filter by word count
const substantialPages = result.pages.filter(
page => page.metadata.wordCount > 500
);
// Group by language
const byLanguage = result.pages.reduce((acc, page) => {
const lang = page.metadata.language || 'unknown';
acc[lang] = acc[lang] || [];
acc[lang].push(page);
return acc;
}, {});
// Calculate reading time
const withReadingTime = result.pages.map(page => ({
...page,
readingTimeMinutes: Math.ceil(page.metadata.wordCount / 200)
}));Batch Processing with Error Handling
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
const results = await Promise.allSettled(
urls.map(url => extractor.extractPage(url))
);
const successful = results
.filter(r => r.status === 'fulfilled')
.map(r => r.value);
const failed = results
.filter(r => r.status === 'rejected')
.map((r, i) => ({ url: urls[i], error: r.reason }));
console.log(`Success: ${successful.length}, Failed: ${failed.length}`);Types
ExtractedPage
interface ExtractedPage {
title: string;
content: string;
url: string;
metadata: PageMetadata;
}PageMetadata
interface PageMetadata {
scrapedAt: Date;
sourceUrl: string;
description?: string;
wordCount: number;
language?: string;
statusCode?: number;
[key: string]: any; // Custom metadata
}ExtractionResult
interface ExtractionResult {
pages: ExtractedPage[];
totalPages: number;
failed: FailedExtraction[];
stats: ExtractionStats;
}ExtractionStats
interface ExtractionStats {
duration: number; // Total time in ms
successRate: number; // Success rate %
totalWords: number; // Total words extracted
avgWordsPerPage: number; // Average words per page
}Use Cases
- 📚 Documentation Scraping: Extract and index documentation sites
- 🧠 Knowledge Base Building: Build AI knowledge bases from websites
- 🔍 Content Analysis: Analyze website content and structure
- 📊 SEO Analysis: Extract metadata for SEO analysis
- 🤖 AI Training Data: Collect training data for AI models
- 📝 Content Migration: Migrate content from old to new sites
Requirements
- Node.js >= 16
- Firecrawl API key (Get one here)
Testing
Run the comprehensive test suite:
# Unit tests
npm test
# Integration test with astratechai.com
npm run test:astratechai
# Basic usage example
npm run test:integrationSee Testing Guide for detailed instructions on creating tests for your own websites.
License
MIT
Repository
Built with ❤️ by anisirji
