npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@monostate/node-scraper

v1.8.1

Published

Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers

Readme

@monostate/node-scraper

Lightning-fast web scraping with intelligent fallback system - 11.35x faster than Firecrawl

npm version Performance License Node

Quick Start

Installation

npm install @monostate/node-scraper
# or
yarn add @monostate/node-scraper
# or  
pnpm add @monostate/node-scraper

Fixed in v1.8.1: Critical production fix - browser-pool.js now included in npm package.

New in v1.8.0: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.

Fixed in v1.7.0: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.

New in v1.6.0: Method override support! Force specific scraping methods with method parameter for testing and optimization.

New in v1.5.0: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.

Also in v1.3.0: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.

Also in v1.2.0: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.

Zero-Configuration Setup

The package now automatically:

  • Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
  • Configures binary paths and permissions
  • Validates installation health on first use

Basic Usage

import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';

// Simple one-line scraping
const result = await smartScrape('https://example.com');
console.log(result.content); // Extracted content
console.log(result.method);  // Method used: direct-fetch, lightpanda, or puppeteer

// Take a screenshot
const screenshot = await smartScreenshot('https://example.com');
console.log(screenshot.screenshot); // Base64 encoded image

// Quick screenshot (optimized for speed)
const quick = await quickShot('https://example.com');
console.log(quick.screenshot); // Fast screenshot capture

// PDF parsing (automatic detection)
const pdfResult = await smartScrape('https://example.com/document.pdf');
console.log(pdfResult.content); // Extracted text, metadata, page count

Advanced Usage

import { BNCASmartScraper } from '@monostate/node-scraper';

const scraper = new BNCASmartScraper({
  timeout: 10000,
  verbose: true,
  lightpandaPath: './lightpanda' // optional
});

const result = await scraper.scrape('https://complex-spa.com');
console.log(result.stats); // Performance statistics

await scraper.cleanup(); // Clean up resources

Browser Pool Configuration (New in v1.8.0)

The package now includes automatic browser instance pooling to prevent memory leaks:

// Browser pool is managed automatically with these defaults:
// - Max 3 concurrent browser instances
// - 5 second idle timeout before cleanup
// - Automatic reuse of browser instances

// For heavy workloads, you can manually clean up:
const scraper = new BNCASmartScraper();
// ... perform multiple scrapes ...
await scraper.cleanup(); // Closes all browser instances

Important: The convenience functions (smartScrape, smartScreenshot, etc.) automatically handle cleanup. You only need to call cleanup() when using the BNCASmartScraper class directly.

Method Override (New in v1.6.0)

Force a specific scraping method instead of using automatic fallback:

// Force direct fetch (no browser)
const result = await smartScrape('https://example.com', { method: 'direct' });

// Force Lightpanda browser
const result = await smartScrape('https://example.com', { method: 'lightpanda' });

// Force Puppeteer (full Chrome)
const result = await smartScrape('https://example.com', { method: 'puppeteer' });

// Auto mode (default - intelligent fallback)
const result = await smartScrape('https://example.com', { method: 'auto' });

Important: When forcing a method, no fallback occurs if it fails. This is useful for:

  • Testing specific methods in isolation
  • Optimizing for known site requirements
  • Debugging method-specific issues

Error Response for Forced Methods:

{
  success: false,
  error: "Lightpanda scraping failed: [specific error]",
  method: "lightpanda",
  errorType: "network|timeout|parsing|service_unavailable",
  details: "Additional error context"
}

Bulk Scraping (New in v1.8.0)

Process multiple URLs efficiently with automatic request queueing and progress tracking:

import { bulkScrape } from '@monostate/node-scraper';

// Basic bulk scraping
const urls = [
  'https://example1.com',
  'https://example2.com',
  'https://example3.com',
  // ... hundreds more
];

const results = await bulkScrape(urls, {
  concurrency: 5,  // Process 5 URLs at a time
  continueOnError: true,  // Don't stop on failures
  progressCallback: (progress) => {
    console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
  }
});

console.log(`Success: ${results.stats.successful}, Failed: ${results.stats.failed}`);
console.log(`Total time: ${results.stats.totalTime}ms`);
console.log(`Average time per URL: ${results.stats.averageTime}ms`);

Streaming Results

For large datasets, use streaming to process results as they complete:

import { bulkScrapeStream } from '@monostate/node-scraper';

await bulkScrapeStream(urls, {
  concurrency: 10,
  onResult: async (result) => {
    // Process each successful result immediately
    await saveToDatabase(result);
    console.log(`✓ ${result.url} - ${result.duration}ms`);
  },
  onError: async (error) => {
    // Handle errors as they occur
    console.error(`✗ ${error.url} - ${error.error}`);
  },
  progressCallback: (progress) => {
    process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
  }
});

Features:

  • Automatic request queueing (no more memory errors!)
  • Configurable concurrency control
  • Real-time progress tracking
  • Continue on error or stop on first failure
  • Detailed statistics and method tracking
  • Browser instance pooling for efficiency

For detailed examples and advanced usage, see BULK_SCRAPING.md.

How It Works

BNCA uses a sophisticated multi-tier system with intelligent detection:

1. 🔄 Direct Fetch (Fastest)

  • Pure HTTP requests with intelligent HTML parsing
  • Performance: Sub-second responses
  • Success rate: 75% of websites
  • PDF Detection: Automatically detects PDFs by URL, content-type, or magic bytes

2. 🐼 Lightpanda Browser (Fast)

  • Lightweight browser engine (2-3x faster than Chromium)
  • Performance: Fast JavaScript execution
  • Fallback triggers: SPA detection

3. 🔵 Puppeteer (Complete)

  • Full Chromium browser for maximum compatibility
  • Performance: Complete JavaScript execution
  • Fallback triggers: Complex interactions needed

📄 PDF Parser (Specialized)

  • Automatic PDF detection and parsing
  • Features: Text extraction, metadata, page count
  • Smart Detection: Works even when PDFs are served with wrong content-types
  • Performance: Typically 100-500ms for most PDFs

📸 Screenshot Methods

  • Chrome CLI: Direct Chrome screenshot capture
  • Quickshot: Optimized with retry logic and smart timeouts

📊 Performance Benchmark

| Site Type | BNCA | Firecrawl | Speed Advantage | |-----------|------|-----------|----------------| | Wikipedia | 154ms | 4,662ms | 30.3x faster | | Hacker News | 1,715ms | 4,644ms | 2.7x faster | | GitHub | 9,167ms | 9,790ms | 1.1x faster |

Average: 11.35x faster than Firecrawl with 100% reliability

🎛️ API Reference

Convenience Functions

smartScrape(url, options?)

Quick scraping with intelligent fallback.

smartScreenshot(url, options?)

Take a screenshot of any webpage.

quickShot(url, options?)

Optimized screenshot capture for maximum speed.

Parameters:

  • url (string): URL to scrape/capture
  • options (object, optional): Configuration options

Returns: Promise

BNCASmartScraper

Main scraper class with advanced features.

Constructor Options

const scraper = new BNCASmartScraper({
  timeout: 10000,           // Request timeout in ms
  retries: 2,               // Number of retries per method
  verbose: false,           // Enable detailed logging
  lightpandaPath: './lightpanda', // Path to Lightpanda binary
  userAgent: 'Mozilla/5.0 ...',   // Custom user agent
});

Methods

scraper.scrape(url, options?)

Scrape a URL with intelligent fallback.

const result = await scraper.scrape('https://example.com');
scraper.screenshot(url, options?)

Take a screenshot of a webpage.

const result = await scraper.screenshot('https://example.com');
const img = result.screenshot; // data:image/png;base64,...
scraper.quickshot(url, options?)

Quick screenshot capture - optimized for speed with retry logic.

const result = await scraper.quickshot('https://example.com');
// 2-3x faster than regular screenshot
scraper.getStats()

Get performance statistics.

const stats = scraper.getStats();
console.log(stats.successRates); // Success rates by method
scraper.healthCheck()

Check availability of all scraping methods.

const health = await scraper.healthCheck();
console.log(health.status); // 'healthy' or 'unhealthy'
scraper.cleanup()

Clean up resources (close browser instances).

await scraper.cleanup();

AI-Powered Q&A

Ask questions about any website and get AI-generated answers:

// Method 1: Using your own OpenRouter API key
const scraper = new BNCASmartScraper({
  openRouterApiKey: 'your-openrouter-api-key'
});
const result = await scraper.askAI('https://example.com', 'What is this website about?');

// Method 2: Using OpenAI API (or compatible endpoints)
const scraper = new BNCASmartScraper({
  openAIApiKey: 'your-openai-api-key',
  // Optional: Use a compatible endpoint like Groq, Together AI, etc.
  openAIBaseUrl: 'https://api.groq.com/openai'
});
const result = await scraper.askAI('https://example.com', 'What services do they offer?');

// Method 3: One-liner with OpenRouter
import { askWebsiteAI } from '@monostate/node-scraper';
const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
  openRouterApiKey: process.env.OPENROUTER_API_KEY
});

// Method 4: Using BNCA backend API (requires BNCA API key)
const scraper = new BNCASmartScraper({
  apiKey: 'your-bnca-api-key'
});
const result = await scraper.askAI('https://example.com', 'What products are featured?');

API Key Priority:

  1. OpenRouter API key (openRouterApiKey)
  2. OpenAI API key (openAIApiKey)
  3. BNCA backend API (apiKey)
  4. Local fallback (pattern matching - no API key required)

Configuration Options:

const result = await scraper.askAI(url, question, {
  // OpenRouter specific
  openRouterApiKey: 'sk-or-...',
  model: 'meta-llama/llama-4-scout:free', // Default model
  
  // OpenAI specific
  openAIApiKey: 'sk-...',
  openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
  model: 'gpt-3.5-turbo',
  
  // Shared options
  temperature: 0.3,
  maxTokens: 500
});

Response Format:

{
  success: true,
  answer: "This website is about...",
  method: "direct-fetch",     // Scraping method used
  scrapeTime: 1234,          // Time to scrape in ms
  processing: "openrouter"   // AI processing method used
}

📄 PDF Support

BNCA automatically detects and parses PDF documents:

const pdfResult = await smartScrape('https://example.com/document.pdf');

// Parsed content includes:
const content = JSON.parse(pdfResult.content);
console.log(content.title);          // PDF title
console.log(content.author);         // Author name
console.log(content.pages);          // Number of pages
console.log(content.text);           // Full extracted text
console.log(content.creationDate);   // Creation date
console.log(content.metadata);       // Additional metadata

PDF Detection Methods:

  • URL ending with .pdf
  • Content-Type header application/pdf
  • Binary content starting with %PDF (magic bytes)
  • Works with PDFs served as application/octet-stream (e.g., GitHub raw files)

Limitations:

  • Maximum file size: 20MB
  • Text extraction only (no image OCR)
  • Requires pdf-parse dependency (automatically installed)

📱 Next.js Integration

API Route Example

// pages/api/scrape.js or app/api/scrape/route.js
import { smartScrape } from '@monostate/node-scraper';

export async function POST(request) {
  try {
    const { url } = await request.json();
    const result = await smartScrape(url);
    
    return Response.json({
      success: true,
      data: result.content,
      method: result.method,
      time: result.performance.totalTime
    });
  } catch (error) {
    return Response.json({
      success: false,
      error: error.message
    }, { status: 500 });
  }
}

React Hook Example

// hooks/useScraper.js
import { useState } from 'react';

export function useScraper() {
  const [loading, setLoading] = useState(false);
  const [data, setData] = useState(null);
  const [error, setError] = useState(null);

  const scrape = async (url) => {
    setLoading(true);
    setError(null);
    
    try {
      const response = await fetch('/api/scrape', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ url })
      });
      
      const result = await response.json();
      
      if (result.success) {
        setData(result.data);
      } else {
        setError(result.error);
      }
    } catch (err) {
      setError(err.message);
    } finally {
      setLoading(false);
    }
  };

  return { scrape, loading, data, error };
}

Component Example

// components/ScraperDemo.jsx
import { useScraper } from '../hooks/useScraper';

export default function ScraperDemo() {
  const { scrape, loading, data, error } = useScraper();
  const [url, setUrl] = useState('');

  const handleScrape = () => {
    if (url) scrape(url);
  };

  return (
    <div className="p-4">
      <div className="flex gap-2 mb-4">
        <input
          type="url"
          value={url}
          onChange={(e) => setUrl(e.target.value)}
          placeholder="Enter URL to scrape..."
          className="flex-1 px-3 py-2 border rounded"
        />
        <button
          onClick={handleScrape}
          disabled={loading}
          className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
        >
          {loading ? 'Scraping...' : 'Scrape'}
        </button>
      </div>
      
      {error && (
        <div className="p-3 bg-red-100 text-red-700 rounded mb-4">
          Error: {error}
        </div>
      )}
      
      {data && (
        <div className="p-3 bg-green-100 rounded">
          <h3 className="font-bold mb-2">Scraped Content:</h3>
          <pre className="text-sm overflow-auto">{data}</pre>
        </div>
      )}
    </div>
  );
}

⚠️ Important Notes

Server-Side Only

BNCA is designed for server-side use only due to:

  • Browser automation requirements (Puppeteer)
  • File system access for Lightpanda binary
  • CORS restrictions in browsers

Next.js Deployment

  • Use in API routes, not client components
  • Ensure Node.js 18+ in production environment
  • Consider adding Lightpanda binary to deployment

Lightpanda Setup (Optional)

For maximum performance, install Lightpanda:

# macOS ARM64
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
chmod +x lightpanda

# Linux x64
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
chmod +x lightpanda

🔒 Privacy & Security

  • No external API calls - all processing is local
  • No data collection - your data stays private
  • Respects robots.txt (optional enforcement)
  • Configurable rate limiting

📝 TypeScript Support

Full TypeScript definitions included:

import { BNCASmartScraper, ScrapingResult, ScrapingOptions } from '@monostate/node-scraper';

const scraper: BNCASmartScraper = new BNCASmartScraper({
  timeout: 5000,
  verbose: true
});

const result: ScrapingResult = await scraper.scrape('https://example.com');

Changelog

v1.6.0 (Latest)

  • Method Override: Force specific scraping methods with method parameter
  • Enhanced Error Handling: Categorized error types for better debugging
  • Fallback Chain Tracking: See which methods were attempted in auto mode
  • Graceful Failures: No automatic fallback when method is forced

v1.5.0

  • AI-Powered Q&A: Ask questions about any website and get AI-generated answers
  • OpenRouter Support: Native integration with OpenRouter API for advanced AI models
  • OpenAI Support: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
  • Smart Fallback: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
  • One-liner AI: New askWebsiteAI() convenience function for quick AI queries
  • Enhanced TypeScript: Complete type definitions for all AI features

v1.4.0

  • Internal release (skipped for public release)

v1.3.0

  • PDF Support: Full PDF parsing with text extraction, metadata, and page count
  • Smart PDF Detection: Detects PDFs by URL patterns, content-type, or magic bytes
  • Robust Parsing: Handles PDFs served with incorrect content-types (e.g., GitHub raw files)
  • Fast Performance: PDF parsing typically completes in 100-500ms
  • Comprehensive Extraction: Title, author, creation date, page count, and full text

v1.2.0

  • Auto-Installation: Lightpanda binary is now automatically downloaded during npm install
  • Cross-Platform Support: Automatic detection and installation for macOS, Linux, and Windows/WSL
  • Improved Performance: Enhanced binary detection and ES6 module compatibility
  • Better Error Handling: More robust installation scripts with retry logic
  • Zero Configuration: No manual setup required - works out of the box

v1.1.1

  • Bug fixes and stability improvements
  • Enhanced Puppeteer integration

v1.1.0

  • Added screenshot capabilities
  • Improved fallback system
  • Performance optimizations

🤝 Contributing

See the main repository for contribution guidelines.

📄 License

MIT License - see LICENSE file for details.


Built with ❤️ for fast, reliable web scraping

⭐ Star on GitHub | 📖 Full Documentation