npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@anisirji/web-extractor

v1.0.2

Published

Powerful web content extraction SDK with URL normalization and intelligent scraping - https://github.com/anisirji/llm-web-extractor

Readme

@anisirji/web-extractor

Powerful web content extraction SDK with intelligent URL handling, content cleaning, and comprehensive metadata extraction.

Features

Smart URL Handling

  • URL validation and normalization
  • Subdomain detection
  • Duplicate URL filtering
  • Pattern-based URL filtering

🧹 Content Cleaning

  • Automatic markdown/HTML/text extraction
  • Whitespace normalization
  • Word counting
  • Language detection

📊 Rich Metadata

  • Scraping timestamps
  • Word counts
  • Page descriptions
  • Status codes
  • Custom metadata support

🚀 Easy to Use

  • Simple, intuitive API
  • TypeScript support
  • Promise-based
  • Comprehensive error handling

📖 Documentation

Installation

npm install @anisirji/web-extractor

Quick Start

Extract a Single Page

import { WebExtractor } from '@anisirji/web-extractor';

const extractor = new WebExtractor({
  apiKey: 'your-firecrawl-api-key'
});

// Extract single page
const page = await extractor.extractPage('https://example.com');

console.log(page.title);
console.log(page.content);
console.log(page.metadata.wordCount);

Extract Entire Website

const result = await extractor.extractWebsite('https://docs.example.com', {
  maxPages: 20,
  includeSubdomains: false,
  titlePrefix: 'Docs',
  maxDepth: 3
});

console.log(`Extracted ${result.pages.length} pages`);
console.log(`Success rate: ${result.stats.successRate}%`);
console.log(`Total words: ${result.stats.totalWords}`);

for (const page of result.pages) {
  console.log(`${page.title} - ${page.url}`);
}

API Reference

WebExtractor

Main class for web extraction.

Constructor

new WebExtractor(config: WebExtractorConfig)

Config Options:

  • apiKey (required): Your Firecrawl API key
  • baseUrl (optional): Custom Firecrawl API URL
  • timeout (optional): Request timeout in ms (default: 30000)
  • debug (optional): Enable debug logging (default: false)

Methods

extractPage(url, options?)

Extract content from a single page.

await extractor.extractPage('https://example.com', {
  onlyMainContent: true,  // Extract only main content
  format: 'markdown',     // 'markdown' | 'html' | 'text'
  waitFor: 1000          // Wait time before extraction (ms)
});

Returns: Promise<ExtractedPage>

extractWebsite(url, options?)

Extract content from entire website (crawl).

await extractor.extractWebsite('https://example.com', {
  maxPages: 10,                    // Maximum pages to scrape
  includeSubdomains: false,        // Include subdomains
  titlePrefix: 'My Site',          // Prefix for all titles
  maxDepth: 3,                     // Maximum crawl depth
  followExternalLinks: false,      // Follow external links
  includePatterns: [/\/docs\//],   // URL patterns to include
  excludePatterns: [/\/blog\//],   // URL patterns to exclude
  onlyMainContent: true,           // Extract only main content
  format: 'markdown'               // Output format
});

Returns: Promise<ExtractionResult>

URL Utilities

Powerful URL manipulation utilities.

import {
  normalizeUrl,
  validateUrl,
  deduplicateUrls,
  isSameDomain,
  extractDomain
} from '@anisirji/web-extractor';

// Normalize URL
const normalized = normalizeUrl('https://Example.com/path/?b=2&a=1#hash', {
  lowercase: true,           // Convert to lowercase
  removeTrailingSlash: true, // Remove trailing slash
  removeFragment: true,      // Remove #hash
  sortQueryParams: true      // Sort query params
});
// => 'https://example.com/path?a=1&b=2'

// Validate URL
const urlObj = validateUrl('https://example.com'); // Returns URL object or throws

// Deduplicate URLs
const unique = deduplicateUrls([
  'https://example.com/page',
  'https://example.com/page/',
  'https://EXAMPLE.COM/page'
]);
// => ['https://example.com/page']

// Check same domain
isSameDomain('https://example.com', 'https://example.com/page'); // true
isSameDomain('https://example.com', 'https://other.com'); // false

// Extract domain
extractDomain('https://blog.example.com/page'); // => 'blog.example.com'

Content Utilities

Content processing utilities.

import {
  cleanContent,
  countWords,
  generateExcerpt,
  detectLanguage
} from '@anisirji/web-extractor';

// Clean content
const cleaned = cleanContent('  text\n\n\n\nmore text  ');
// => 'text\n\nmore text'

// Count words
countWords('Hello world from TermiX'); // => 4

// Generate excerpt
generateExcerpt('Very long content here...', 10);
// => 'Very long content here (first 10 words)...'

// Detect language
detectLanguage('This is an English text'); // => 'en'

Advanced Examples

Filter URLs by Pattern

const result = await extractor.extractWebsite('https://docs.example.com', {
  maxPages: 50,
  // Only include documentation pages
  includePatterns: [
    /\/docs\//,
    /\/api\//,
    /\/guides\//
  ],
  // Exclude blog and changelog
  excludePatterns: [
    /\/blog\//,
    /\/changelog\//
  ]
});

Custom Processing Pipeline

const result = await extractor.extractWebsite('https://example.com', {
  maxPages: 30
});

// Filter by word count
const substantialPages = result.pages.filter(
  page => page.metadata.wordCount > 500
);

// Group by language
const byLanguage = result.pages.reduce((acc, page) => {
  const lang = page.metadata.language || 'unknown';
  acc[lang] = acc[lang] || [];
  acc[lang].push(page);
  return acc;
}, {});

// Calculate reading time
const withReadingTime = result.pages.map(page => ({
  ...page,
  readingTimeMinutes: Math.ceil(page.metadata.wordCount / 200)
}));

Batch Processing with Error Handling

const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

const results = await Promise.allSettled(
  urls.map(url => extractor.extractPage(url))
);

const successful = results
  .filter(r => r.status === 'fulfilled')
  .map(r => r.value);

const failed = results
  .filter(r => r.status === 'rejected')
  .map((r, i) => ({ url: urls[i], error: r.reason }));

console.log(`Success: ${successful.length}, Failed: ${failed.length}`);

Types

ExtractedPage

interface ExtractedPage {
  title: string;
  content: string;
  url: string;
  metadata: PageMetadata;
}

PageMetadata

interface PageMetadata {
  scrapedAt: Date;
  sourceUrl: string;
  description?: string;
  wordCount: number;
  language?: string;
  statusCode?: number;
  [key: string]: any;  // Custom metadata
}

ExtractionResult

interface ExtractionResult {
  pages: ExtractedPage[];
  totalPages: number;
  failed: FailedExtraction[];
  stats: ExtractionStats;
}

ExtractionStats

interface ExtractionStats {
  duration: number;           // Total time in ms
  successRate: number;        // Success rate %
  totalWords: number;         // Total words extracted
  avgWordsPerPage: number;    // Average words per page
}

Use Cases

  • 📚 Documentation Scraping: Extract and index documentation sites
  • 🧠 Knowledge Base Building: Build AI knowledge bases from websites
  • 🔍 Content Analysis: Analyze website content and structure
  • 📊 SEO Analysis: Extract metadata for SEO analysis
  • 🤖 AI Training Data: Collect training data for AI models
  • 📝 Content Migration: Migrate content from old to new sites

Requirements

Testing

Run the comprehensive test suite:

# Unit tests
npm test

# Integration test with astratechai.com
npm run test:astratechai

# Basic usage example
npm run test:integration

See Testing Guide for detailed instructions on creating tests for your own websites.

License

MIT

Repository


Built with ❤️ by anisirji