@robot-resources/scraper

v0.5.0

Published

3 months ago

Context compression for AI agents. Fetch -> Extract -> Convert pipeline without LLM dependency.

0High
0Medium
0Low

maudrani

manusovi6

ai agents context compression markdown web scraping tokens llm

@robot-resources/scraper

Context compression for AI agents. Fetch → Extract → Convert pipeline without LLM dependency.

Median 91% token reduction for AI agent consumption (verified across 41 page types). 3-tier fetch with auto-fallback, BFS multi-page crawl, robots.txt compliance.

Installation

npm install @robot-resources/scraper

Optional peer dependencies (install only what you need):

npm install impit          # Stealth mode — TLS fingerprint impersonation
npm install playwright     # Render mode — headless browser for JS-rendered pages

Quick Start

import { scrape } from '@robot-resources/scraper';

const result = await scrape('https://example.com/article');

console.log(result.markdown);     // Compressed content
console.log(result.tokenCount);   // Estimated tokens
console.log(result.title);        // Page title

Fetch Modes

Control how pages are fetched with the mode option:

| Mode | How | When to use | |------|-----|-------------| | 'fast' | Plain HTTP fetch | Default sites, APIs, docs | | 'stealth' | TLS fingerprint impersonation (impit) | Sites with anti-bot protection | | 'render' | Headless Playwright browser | JS-rendered SPAs, dynamic content | | 'auto' | Fast first, falls back to stealth on 403/challenge | Default — best for unknown sites |

// Explicit stealth for a protected site
const result = await scrape('https://protected-site.com', { mode: 'stealth' });

// Auto mode (default) — tries fast, falls back to stealth if blocked
const result = await scrape('https://unknown-site.com');

Crawling Multiple Pages

import { crawl } from '@robot-resources/scraper';

const result = await crawl({
  url: 'https://docs.example.com',
  depth: 2,            // Max link depth (default: 2)
  limit: 20,           // Max pages (default: 50)
  mode: 'auto',        // Fetch mode per page
  concurrency: 3,      // Parallel fetches (default: 3)
  respectRobots: true,  // Obey robots.txt (default: true)
  include: ['**/docs/**'],   // Only crawl docs paths
  exclude: ['**/archive/**'], // Skip archive
});

console.log(`Crawled ${result.totalCrawled} pages in ${result.duration}ms`);

for (const page of result.pages) {
  console.log(`[depth ${page.depth}] ${page.title}: ${page.tokenCount} tokens`);
}

The crawler uses BFS link discovery, seeds from sitemap.xml when available, and respects crawl-delay from robots.txt.

API

`scrape(url, options?)`

Fetch a URL and return compressed markdown.

const result = await scrape('https://example.com', {
  mode: 'auto',          // Fetch mode (default: 'auto')
  timeout: 5000,         // Request timeout ms (default: 10000)
  maxRetries: 2,         // Retry attempts (default: 3)
  userAgent: '...',      // Custom user agent
  respectRobots: false,  // Check robots.txt (default: false)
});

Returns: ScrapeResult

interface ScrapeResult {
  markdown: string;      // Compressed content
  tokenCount: number;    // Estimated token count
  title?: string;        // Page title
  author?: string;       // Author if found
  siteName?: string;     // Site name if found
  publishedAt?: string;  // Publish date if found
  url: string;           // Final URL after redirects
}

`crawl(options)`

BFS multi-page crawl from a starting URL.

const result = await crawl({
  url: 'https://example.com',   // Starting URL (required)
  depth: 2,                     // Max depth (default: 2)
  limit: 50,                    // Max pages (default: 50)
  mode: 'auto',                 // Fetch mode (default: 'auto')
  include: ['**/blog/**'],      // Include patterns (glob)
  exclude: ['**/admin/**'],     // Exclude patterns (glob)
  timeout: 10000,               // Per-page timeout ms
  concurrency: 3,               // Parallel fetches (default: 3)
  respectRobots: true,          // Obey robots.txt (default: true)
});

Returns: CrawlResult

interface CrawlResult {
  pages: CrawlPageResult[];  // Scraped pages (extends ScrapeResult + depth)
  totalDiscovered: number;   // Total URLs found
  totalCrawled: number;      // Successfully scraped
  totalSkipped: number;      // Skipped (robots, filter, limit)
  errors: CrawlError[];      // Per-URL errors
  duration: number;          // Total ms
}

Individual Layers

For advanced usage, use the pipeline layers directly:

import {
  fetchUrl,
  fetchStealth,
  fetchRender,
  extractContent,
  convertToMarkdown,
  estimateTokens,
} from '@robot-resources/scraper';

// Layer 1: Fetch HTML (choose your tier)
const fetched = await fetchUrl('https://example.com');
// or: await fetchStealth(url, options)
// or: await fetchRender(url, options)

// Layer 2: Extract main content
const extracted = await extractContent(fetched);

// Layer 3: Convert to markdown
const converted = await convertToMarkdown(extracted);

// Token estimation
const htmlTokens = estimateTokens(fetched.html);
console.log(`Compressed ${htmlTokens} → ${converted.tokenCount} tokens`);

Robots & Sitemap

import {
  isAllowedByRobots,
  getCrawlDelay,
  getSitemapUrls,
  parseSitemap,
} from '@robot-resources/scraper';

const allowed = await isAllowedByRobots('https://example.com/page');
const delay = await getCrawlDelay('https://example.com');
const entries = await parseSitemap('https://example.com/sitemap.xml');

Error Handling

import { scrape, FetchError, ExtractionError } from '@robot-resources/scraper';

try {
  const result = await scrape(url);
} catch (error) {
  if (error instanceof FetchError) {
    console.log('Fetch failed:', error.statusCode, error.retryable);
  }
  if (error instanceof ExtractionError) {
    console.log('Extraction failed:', error.code);
  }
}

Token Reduction

Verified across 41 pages (March 2026):

| Page Type | HTML Tokens | Scraper Tokens | Reduction | |-----------|-------------|----------------|-----------| | Landing pages & SPAs | ~237,000 | ~380 | 99% | | GitHub repositories | ~110,000 | ~479 | 99% | | API reference (MDN) | ~55,000 | ~6,349 | 88% | | Wikipedia articles | ~187,000 | ~42,039 | 77% | | Blog posts & essays | ~20,000 | ~15,639 | 22-92% | | Median across all types | | | 91% |

Requirements

Node.js 18+
ESM or CommonJS

@robot-resources/scraper-mcp - MCP server for AI agents
@robot-resources/scraper-tracking - Usage tracking
scraper.robotresources.ai - Hosted API
Robot Resources - Human Resources, but for your AI agents

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@robot-resources/scraper

Installation

Quick Start

Fetch Modes

Crawling Multiple Pages

API

scrape(url, options?)

crawl(options)

Individual Layers

Robots & Sitemap

Error Handling

Token Reduction

Requirements

Related

License

`scrape(url, options?)`

`crawl(options)`