npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@robot-resources/scraper

v0.1.0

Published

Context compression for AI agents. Fetch -> Extract -> Convert pipeline without LLM dependency.

Readme

CI npm version License: MIT npm downloads codecov bundle size

@robot-resources/scraper

Context compression for AI agents. Fetch → Extract → Convert pipeline without LLM dependency.

Reduces web page tokens by 70-80% for AI agent consumption. 3-tier fetch with auto-fallback, BFS multi-page crawl, robots.txt compliance.

Installation

npm install @robot-resources/scraper

Optional peer dependencies (install only what you need):

npm install impit          # Stealth mode — TLS fingerprint impersonation
npm install playwright     # Render mode — headless browser for JS-rendered pages

Quick Start

import { scrape } from '@robot-resources/scraper';

const result = await scrape('https://example.com/article');

console.log(result.markdown);     // Compressed content
console.log(result.tokenCount);   // Estimated tokens
console.log(result.title);        // Page title

Fetch Modes

Control how pages are fetched with the mode option:

| Mode | How | When to use | |------|-----|-------------| | 'fast' | Plain HTTP fetch | Default sites, APIs, docs | | 'stealth' | TLS fingerprint impersonation (impit) | Sites with anti-bot protection | | 'render' | Headless Playwright browser | JS-rendered SPAs, dynamic content | | 'auto' | Fast first, falls back to stealth on 403/challenge | Default — best for unknown sites |

// Explicit stealth for a protected site
const result = await scrape('https://protected-site.com', { mode: 'stealth' });

// Auto mode (default) — tries fast, falls back to stealth if blocked
const result = await scrape('https://unknown-site.com');

Crawling Multiple Pages

import { crawl } from '@robot-resources/scraper';

const result = await crawl({
  url: 'https://docs.example.com',
  depth: 2,            // Max link depth (default: 2)
  limit: 20,           // Max pages (default: 50)
  mode: 'auto',        // Fetch mode per page
  concurrency: 3,      // Parallel fetches (default: 3)
  respectRobots: true,  // Obey robots.txt (default: true)
  include: ['**/docs/**'],   // Only crawl docs paths
  exclude: ['**/archive/**'], // Skip archive
});

console.log(`Crawled ${result.totalCrawled} pages in ${result.duration}ms`);

for (const page of result.pages) {
  console.log(`[depth ${page.depth}] ${page.title}: ${page.tokenCount} tokens`);
}

The crawler uses BFS link discovery, seeds from sitemap.xml when available, and respects crawl-delay from robots.txt.

API

scrape(url, options?)

Fetch a URL and return compressed markdown.

const result = await scrape('https://example.com', {
  mode: 'auto',          // Fetch mode (default: 'auto')
  timeout: 5000,         // Request timeout ms (default: 10000)
  maxRetries: 2,         // Retry attempts (default: 3)
  userAgent: '...',      // Custom user agent
  respectRobots: false,  // Check robots.txt (default: false)
});

Returns: ScrapeResult

interface ScrapeResult {
  markdown: string;      // Compressed content
  tokenCount: number;    // Estimated token count
  title?: string;        // Page title
  author?: string;       // Author if found
  siteName?: string;     // Site name if found
  publishedAt?: string;  // Publish date if found
  url: string;           // Final URL after redirects
}

crawl(options)

BFS multi-page crawl from a starting URL.

const result = await crawl({
  url: 'https://example.com',   // Starting URL (required)
  depth: 2,                     // Max depth (default: 2)
  limit: 50,                    // Max pages (default: 50)
  mode: 'auto',                 // Fetch mode (default: 'auto')
  include: ['**/blog/**'],      // Include patterns (glob)
  exclude: ['**/admin/**'],     // Exclude patterns (glob)
  timeout: 10000,               // Per-page timeout ms
  concurrency: 3,               // Parallel fetches (default: 3)
  respectRobots: true,          // Obey robots.txt (default: true)
});

Returns: CrawlResult

interface CrawlResult {
  pages: CrawlPageResult[];  // Scraped pages (extends ScrapeResult + depth)
  totalDiscovered: number;   // Total URLs found
  totalCrawled: number;      // Successfully scraped
  totalSkipped: number;      // Skipped (robots, filter, limit)
  errors: CrawlError[];      // Per-URL errors
  duration: number;          // Total ms
}

Individual Layers

For advanced usage, use the pipeline layers directly:

import {
  fetchUrl,
  fetchStealth,
  fetchRender,
  extractContent,
  convertToMarkdown,
  estimateTokens,
} from '@robot-resources/scraper';

// Layer 1: Fetch HTML (choose your tier)
const fetched = await fetchUrl('https://example.com');
// or: await fetchStealth(url, options)
// or: await fetchRender(url, options)

// Layer 2: Extract main content
const extracted = await extractContent(fetched);

// Layer 3: Convert to markdown
const converted = await convertToMarkdown(extracted);

// Token estimation
const htmlTokens = estimateTokens(fetched.html);
console.log(`Compressed ${htmlTokens} → ${converted.tokenCount} tokens`);

Robots & Sitemap

import {
  isAllowedByRobots,
  getCrawlDelay,
  getSitemapUrls,
  parseSitemap,
} from '@robot-resources/scraper';

const allowed = await isAllowedByRobots('https://example.com/page');
const delay = await getCrawlDelay('https://example.com');
const entries = await parseSitemap('https://example.com/sitemap.xml');

Error Handling

import { scrape, FetchError, ExtractionError } from '@robot-resources/scraper';

try {
  const result = await scrape(url);
} catch (error) {
  if (error instanceof FetchError) {
    console.log('Fetch failed:', error.statusCode, error.retryable);
  }
  if (error instanceof ExtractionError) {
    console.log('Extraction failed:', error.code);
  }
}

Token Reduction

| Page Type | HTML Tokens | Markdown Tokens | Reduction | |-----------|-------------|-----------------|-----------| | News article | ~15,000 | ~3,000 | 80% | | Documentation | ~12,000 | ~2,500 | 79% | | Blog post | ~8,000 | ~1,800 | 77% |

Requirements

  • Node.js 18+
  • ESM or CommonJS

Related

License

MIT