npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

scrapex

v1.0.0

Published

Modern web scraper with LLM-enhanced extraction, extensible pipeline, and pluggable parsers

Readme

scrapex

Modern web scraper with LLM-enhanced extraction, extensible pipeline, and pluggable parsers.

Release: v1.0.0 (API stable; breaking changes documented below).

Features

  • LLM-Ready Output - Content extracted as Markdown, optimized for AI/LLM consumption
  • Provider-Agnostic LLM - Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
  • Vector Embeddings - Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
  • Extensible Pipeline - Pluggable extractors with priority-based execution
  • Smart Extraction - Uses Mozilla Readability for content, Cheerio for metadata
  • Markdown Parsing - Parse markdown content, awesome lists, and GitHub repos
  • RSS/Atom Feeds - Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with pagination support
  • Content Normalization - Clean, embedding-ready text with boilerplate removal
  • TypeScript First - Full type safety with comprehensive type exports
  • Dual Format - ESM and CommonJS builds

Installation

npm install scrapex

Optional Peer Dependencies

# For LLM features
npm install openai           # OpenAI/Ollama/LM Studio
npm install @anthropic-ai/sdk  # Anthropic Claude

# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node

# For JavaScript-rendered pages
npm install puppeteer

Quick Start

import { scrape } from 'scrapex';

const result = await scrape('https://example.com/article');

console.log(result.title);       // "Article Title"
console.log(result.content);     // Markdown content
console.log(result.textContent); // Plain text (lower tokens)
console.log(result.excerpt);     // First ~300 chars

API Reference

scrape(url, options?)

Fetch and extract metadata and content from a URL.

import { scrape } from 'scrapex';

const result = await scrape('https://example.com', {
  timeout: 10000,
  userAgent: 'MyBot/1.0',
  extractContent: true,
  maxContentLength: 50000,
  normalize: {
    mode: 'full',
    removeBoilerplate: true,
  },
  respectRobots: false,
});

scrapeHtml(html, url, options?)

Extract from raw HTML without fetching.

import { scrapeHtml } from 'scrapex';

const html = await fetchSomehow('https://example.com');
const result = await scrapeHtml(html, 'https://example.com');

Result Object (ScrapedData)

interface ScrapedData {
  // Identity
  url: string;
  canonicalUrl: string;
  domain: string;

  // Basic metadata
  title: string;
  description: string;
  image?: string;
  favicon?: string;

  // Content (LLM-optimized)
  content: string;      // Markdown format
  textContent: string;  // Plain text
  excerpt: string;      // ~300 char preview
  wordCount: number;

  // Context
  author?: string;
  publishedAt?: string;
  modifiedAt?: string;
  siteName?: string;
  language?: string;

  // Classification
  contentType: 'article' | 'repo' | 'docs' | 'package' | 'video' | 'tool' | 'product' | 'unknown';
  keywords: string[];

  // Structured data
  jsonLd?: Record<string, unknown>[];
  links?: ExtractedLink[];

  // LLM Enhancements (when enabled)
  summary?: string;
  suggestedTags?: string[];
  entities?: ExtractedEntities;
  extracted?: Record<string, unknown>;

  // Custom extractor results
  custom?: Record<string, unknown>;

  // Embeddings (when enabled)
  embeddings?: EmbeddingResult;

  // Normalized text (when enabled)
  normalizedText?: string;
  normalizationMeta?: NormalizationMeta;
  normalizedBlocks?: ContentBlock[];

  // Meta
  scrapedAt: string;
  scrapeTimeMs: number;
  error?: string;
}

LLM Integration

Using OpenAI

import { scrape } from 'scrapex';
import { createOpenAI } from 'scrapex/llm';

const llm = createOpenAI({ apiKey: 'sk-...' });

const result = await scrape('https://example.com/article', {
  llm,
  enhance: ['summarize', 'tags', 'entities', 'classify'],
});

console.log(result.summary);       // AI-generated summary
console.log(result.suggestedTags); // ['javascript', 'web', ...]
console.log(result.entities);      // { people: [], organizations: [], ... }

Embeddings

Generate vector embeddings from scraped content for semantic search, RAG, and similarity matching:

import { scrape } from 'scrapex';
import { createOpenAIEmbedding } from 'scrapex/embeddings';

const result = await scrape('https://example.com/article', {
  embeddings: {
    provider: { type: 'custom', provider: createOpenAIEmbedding() },
    model: 'text-embedding-3-small',
  },
});

if (result.embeddings?.status === 'success') {
  console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}

Features include:

  • Multiple providers - OpenAI, Azure, Cohere, HuggingFace, Ollama, Transformers.js
  • PII redaction - Automatically redact emails, phones, SSNs before sending to APIs
  • Smart chunking - Split long content with configurable overlap
  • Caching - Content-addressable cache to avoid redundant API calls
  • Resilience - Retry, circuit breaker, rate limiting

See the Embeddings Guide for full documentation.

Content Normalization

Clean, embedding-ready text with boilerplate removal and block classification:

const result = await scrape(url, {
  normalize: {
    mode: 'full',            // or 'summary' for score-ranked blocks
    removeBoilerplate: true, // filter nav, footer, promos
    maxChars: 5000,          // truncate at sentence boundary
  },
});

console.log(result.normalizedText);    // Clean text ready for embedding
console.log(result.normalizationMeta); // { charCount, tokenEstimate, hash, ... }

Standalone Normalization

Use normalization without scraping:

import { parseBlocks, normalizeText } from 'scrapex';
import { load } from 'cheerio';

const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });

console.log(result.text);
console.log(result.meta);

Custom Classifiers

Filter blocks with custom logic:

import { combineClassifiers, defaultBlockClassifier } from 'scrapex';

const myClassifier = combineClassifiers(
  defaultBlockClassifier,
  (block) => {
    if (block.text.includes('Advertisement')) {
      return { accept: false, label: 'ad' };
    }
    return { accept: true };
  }
);

const result = await scrape(url, {
  normalize: { blockClassifier: myClassifier },
});

Normalization Options

interface NormalizeOptions {
  mode?: 'summary' | 'full';       // Output mode (default: 'full')
  maxChars?: number;               // Max characters in output
  minChars?: number;               // Minimum characters required
  maxBlocks?: number;              // Max blocks to process (default: 2000)
  truncate?: 'sentence' | 'word' | 'char';  // Truncation strategy
  dropSelectors?: string[];        // Selectors to drop before parsing
  removeBoilerplate?: boolean;     // Filter nav/footer/promos (default: true)
  decodeEntities?: boolean;        // Decode HTML entities (default: true)
  normalizeUnicode?: boolean;      // Normalize Unicode to NFC (default: true)
  preserveLineBreaks?: boolean;    // Preserve paragraph breaks (default: true)
  stripLinks?: boolean;            // Strip Markdown links (default: true)
  includeHtml?: boolean;           // Include raw HTML in blocks (default: false)
  languageHint?: string;           // Language hint for metadata
  blockClassifier?: ContentBlockClassifier; // Custom classifier
  debug?: boolean;                 // Include blocks in output
}

Breaking Changes

  • LLM provider classes (e.g., AnthropicProvider) were removed. Use preset factories like createOpenAI, createAnthropic, createOllama, and createLMStudio instead.

Using Anthropic Claude

import { createAnthropic } from 'scrapex/llm';

const llm = createAnthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: 'claude-3-5-haiku-20241022', // or 'claude-3-5-sonnet-20241022'
});

const result = await scrape(url, { llm, enhance: ['summarize'] });

Using Ollama (Local)

import { createOllama } from 'scrapex/llm';

const llm = createOllama({ model: 'llama3.2' });

const result = await scrape(url, { llm, enhance: ['summarize'] });

Using LM Studio (Local)

import { createLMStudio } from 'scrapex/llm';

const llm = createLMStudio({ model: 'local-model' });

const result = await scrape(url, { llm, enhance: ['summarize'] });

Structured Extraction

Extract specific data using a schema:

const result = await scrape('https://example.com/product', {
  llm,
  extract: {
    productName: 'string',
    price: 'number',
    features: 'string[]',
    inStock: 'boolean',
    sku: 'string?', // optional
  },
});

console.log(result.extracted);
// { productName: "Widget", price: 29.99, features: [...], inStock: true }

Custom Extractors

Create custom extractors to add domain-specific extraction logic:

import { scrape, type Extractor, type ExtractionContext } from 'scrapex';

const recipeExtractor: Extractor = {
  name: 'recipe',
  priority: 60, // Higher = runs earlier

  async extract(context: ExtractionContext) {
    const { $ } = context;

    return {
      custom: {
        ingredients: $('.ingredients li').map((_, el) => $(el).text()).get(),
        cookTime: $('[itemprop="cookTime"]').attr('content'),
        servings: $('[itemprop="recipeYield"]').text(),
      },
    };
  },
};

const result = await scrape('https://example.com/recipe', {
  extractors: [recipeExtractor],
});

console.log(result.custom?.ingredients);

Replacing Default Extractors

const result = await scrape(url, {
  replaceDefaultExtractors: true,
  extractors: [myCustomExtractor],
});

Markdown Parsing

Parse markdown content:

import { MarkdownParser, extractListLinks, groupByCategory } from 'scrapex/parsers';

// Parse any markdown
const parser = new MarkdownParser();
const result = parser.parse(markdownContent);

console.log(result.data.title);
console.log(result.data.sections);
console.log(result.data.links);
console.log(result.data.codeBlocks);

// Extract links from markdown lists and group by category
const links = extractListLinks(markdownContent);
const grouped = groupByCategory(links);

grouped.forEach((categoryLinks, category) => {
  console.log(`${category}: ${categoryLinks.length} links`);
});

GitHub Utilities

import {
  isGitHubRepo,
  parseGitHubUrl,
  toRawUrl,
} from 'scrapex/parsers';

isGitHubRepo('https://github.com/owner/repo');
// true

parseGitHubUrl('https://github.com/facebook/react');
// { owner: 'facebook', repo: 'react' }

toRawUrl('https://github.com/owner/repo');
// 'https://raw.githubusercontent.com/owner/repo/main/README.md'

RSS/Atom Feed Parsing

Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:

import { RSSParser } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml, 'https://example.com/feed.xml');

console.log(result.data.format);  // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title);   // Feed title
console.log(result.data.items);   // Array of feed items

Supported formats:

  • rss2 - RSS 2.0 (most common format)
  • rss1 - RSS 1.0 (RDF-based, older format)
  • atom - Atom 1.0 (modern format with better semantics)

Feed Item Structure

interface FeedItem {
  id: string;
  title: string;
  link: string;
  description?: string;
  content?: string;
  author?: string;
  publishedAt?: string;      // ISO 8601
  rawPublishedAt?: string;   // Original date string
  updatedAt?: string;        // Atom only
  categories: string[];
  enclosure?: FeedEnclosure; // Podcast/media attachments
  customFields?: Record<string, string>;
}

Fetching and Parsing Feeds

import { fetchFeed, paginateFeed } from 'scrapex';

// Fetch and parse in one call
const result = await fetchFeed('https://example.com/feed.xml');
console.log(result.data.items);

// Paginate through feeds with rel="next" links (Atom)
for await (const page of paginateFeed('https://example.com/atom')) {
  console.log(`Page with ${page.items.length} items`);
}

Discovering Feeds in HTML

import { discoverFeeds } from 'scrapex';

const html = await fetch('https://example.com').then(r => r.text());
const feedUrls = discoverFeeds(html, 'https://example.com');
// ['https://example.com/feed.xml', 'https://example.com/atom.xml']

Filtering by Date

import { RSSParser, filterByDate } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml);

const recentItems = filterByDate(result.data.items, {
  after: new Date('2024-01-01'),
  before: new Date('2024-12-31'),
  includeUndated: false,
});

Converting to Markdown/Text

import { RSSParser, feedToMarkdown, feedToText } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml);

// Convert to markdown (great for LLM consumption)
const markdown = feedToMarkdown(result.data, { maxItems: 10 });

// Convert to plain text
const text = feedToText(result.data);

Normalizing Feed Items

Convert feed item content to clean, embedding-ready text:

import { RSSParser, normalizeFeedItem } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml);

for (const item of result.data.items) {
  const normalized = await normalizeFeedItem(item, {
    mode: 'full',
    removeBoilerplate: true,
  });

  console.log(normalized.text); // Clean text from content/description
  console.log(normalized.meta); // { charCount, tokenEstimate, hash, ... }
}

Custom Fields (Podcast/Media)

Extract custom namespace fields like iTunes podcast tags:

const parser = new RSSParser({
  customFields: {
    duration: 'itunes\\:duration',
    explicit: 'itunes\\:explicit',
    rating: 'media\\:rating',
  },
});

const result = parser.parse(podcastXml);
const item = result.data.items[0];

console.log(item.customFields?.duration);  // '10:00'
console.log(item.customFields?.explicit);  // 'no'

Attribute Extraction

Use selector@attr syntax to extract XML attribute values:

const parser = new RSSParser({
  customFields: {
    // Extract url attribute from media:thumbnail element
    thumbnail: 'media\\:thumbnail@url',
    // Extract url attribute from media:content element
    mediaUrl: 'media\\:content@url',
  },
});

const result = parser.parse(mediaRssFeed);
console.log(result.data.items[0]?.customFields?.thumbnail);
// => "https://example.com/images/thumbnail.jpg"

Security

The RSS parser enforces strict URL security:

  • HTTPS-only URLs (RSS parser only): The RSS/Atom parser (RSSParser) resolves all links to HTTPS only. Non-HTTPS URLs (http, javascript, data, file) are rejected and returned as empty strings. This is specific to feed parsing to prevent malicious links in untrusted feeds.
  • XML Mode: Feeds are parsed with Cheerio's { xml: true } mode, which disables HTML entity processing and prevents XSS vectors.

Note: The public URL utilities (resolveUrl, isValidUrl, etc.) accept both http: and https: URLs. Protocol-relative URLs (e.g., //example.com/path) are resolved against the base URL's protocol by the standard URL constructor.

URL Utilities

import {
  isValidUrl,
  normalizeUrl,
  extractDomain,
  resolveUrl,
  isExternalUrl,
} from 'scrapex';

isValidUrl('https://example.com');
// true

normalizeUrl('https://example.com/page?utm_source=twitter');
// 'https://example.com/page' (tracking params removed)

extractDomain('https://www.example.com/path');
// 'example.com'

resolveUrl('/path', 'https://example.com/page');
// 'https://example.com/path'

isExternalUrl('https://other.com', 'example.com');
// true

Error Handling

import { scrape, ScrapeError } from 'scrapex';

try {
  const result = await scrape('https://example.com');
} catch (error) {
  if (error instanceof ScrapeError) {
    console.log(error.code);       // 'FETCH_FAILED' | 'TIMEOUT' | 'INVALID_URL' | ...
    console.log(error.statusCode); // HTTP status if available
    console.log(error.isRetryable()); // true for network errors
  }
}

Error codes:

  • FETCH_FAILED - Network request failed
  • TIMEOUT - Request timed out
  • INVALID_URL - URL is malformed
  • BLOCKED - Access denied (403)
  • NOT_FOUND - Page not found (404)
  • ROBOTS_BLOCKED - Blocked by robots.txt
  • PARSE_ERROR - HTML parsing failed
  • LLM_ERROR - LLM provider error
  • VALIDATION_ERROR - Schema validation failed

Robots.txt

import { scrape, checkRobotsTxt } from 'scrapex';

// Check before scraping
const check = await checkRobotsTxt('https://example.com/path');
if (check.allowed) {
  const result = await scrape('https://example.com/path');
}

// Or let scrape() handle it
const result = await scrape('https://example.com/path', {
  respectRobots: true, // Throws if blocked
});

Built-in Extractors

| Extractor | Priority | Description | |-----------|----------|-------------| | MetaExtractor | 100 | OG, Twitter, meta tags | | JsonLdExtractor | 80 | JSON-LD structured data | | ContentExtractor | 50 | Readability + Turndown | | FaviconExtractor | 70 | Favicon discovery | | LinksExtractor | 30 | Content link extraction |

Configuration

Options

interface ScrapeOptions {
  timeout?: number;              // Default: 10000ms
  userAgent?: string;            // Custom user agent
  extractContent?: boolean;      // Default: true
  maxContentLength?: number;     // Default: 50000 chars
  fetcher?: Fetcher;             // Custom fetcher
  extractors?: Extractor[];      // Additional extractors
  replaceDefaultExtractors?: boolean;
  respectRobots?: boolean;       // Check robots.txt
  llm?: LLMProvider;             // LLM provider
  enhance?: EnhancementType[];   // LLM enhancements
  extract?: ExtractionSchema;    // Structured extraction
  embeddings?: EmbeddingOptions; // Vector embeddings
  normalize?: NormalizeOptions;  // Content normalization
}

Enhancement Types

type EnhancementType =
  | 'summarize'  // Generate summary
  | 'tags'       // Extract keywords/tags
  | 'entities'   // Extract named entities
  | 'classify';  // Classify content type

Requirements

  • Node.js 20+
  • TypeScript 5.0+ (for type imports)

Get Help

Contributing

Contributions are welcome! Whether it's bug reports, feature requests, or pull requests.

See CONTRIBUTING.md for guidelines on:

  • Reporting bugs
  • Suggesting features
  • Submitting pull requests
  • Development setup

Support

If you find scrapex useful, consider supporting its development:

Buy Me A Coffee

License

MIT

Author

Rakesh Paul - binaryroute