@monostate/node-scraper
v2.2.2
Published
Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers
Downloads
249
Maintainers
Readme
@monostate/node-scraper
Intelligent web scraping with multi-tier fallback — 11x faster than traditional scrapers
Install
npm install @monostate/node-scraperLightPanda is downloaded automatically on install. Puppeteer is an optional peer dependency for full browser fallback.
Usage
import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
// Scrape with automatic method selection
const result = await smartScrape('https://example.com');
console.log(result.content);
console.log(result.method); // 'direct-fetch' | 'lightpanda' | 'puppeteer'
// Screenshots
const screenshot = await smartScreenshot('https://example.com');
const quick = await quickShot('https://example.com'); // optimized for speed
// PDFs are detected and parsed automatically
const pdf = await smartScrape('https://example.com/doc.pdf');Force a specific method
const result = await smartScrape('https://example.com', { method: 'direct' });
// Also: 'lightpanda', 'puppeteer', 'auto' (default)No fallback occurs when a method is forced — useful for testing and debugging.
Advanced usage
import { BNCASmartScraper } from '@monostate/node-scraper';
const scraper = new BNCASmartScraper({
timeout: 10000,
verbose: true,
});
const result = await scraper.scrape('https://complex-spa.com');
const stats = scraper.getStats();
const health = await scraper.healthCheck();
await scraper.cleanup();Bulk scraping
import { bulkScrape, bulkScrapeStream } from '@monostate/node-scraper';
const results = await bulkScrape(urls, {
concurrency: 5,
continueOnError: true,
progressCallback: (p) => console.log(`${p.percentage.toFixed(1)}%`),
});
// Or stream results as they complete
await bulkScrapeStream(urls, {
concurrency: 10,
onResult: async (result) => await saveToDatabase(result),
onError: async (error) => console.error(error.url, error.error),
});See BULK_SCRAPING.md for full documentation.
Browser sessions
Persistent browser sessions with real-time control. Three modes:
import { createSession } from '@monostate/node-scraper';
// Headless (default) — LightPanda with Chrome fallback
const session = await createSession({ mode: 'auto' });
await session.goto('https://example.com');
const content = await session.extractContent();
const state = await session.getPageState({ includeScreenshot: true });
await session.close();
// Visual — Chrome with headless:false for dev/debug
const visual = await createSession({ mode: 'visual' });
await visual.goto('https://example.com');
await visual.screenshot(); // real Chrome rendering
await visual.close();Session methods: goto, click, type, scroll, hover, select, pressKey, goBack, goForward, screenshot, extractContent, getPageState, waitFor, evaluate, getCookies, setCookies.
Computer use (coordinate-based browser control)
For AI agents that navigate by pixel coordinates -- useful for anti-bot sites, dynamic UIs, or anything that can't be scraped with selectors.
import { createSession, LocalProvider } from '@monostate/node-scraper';
// LocalProvider runs Xvfb + Chrome + xdotool (Linux only)
const session = await createSession({
mode: 'computer-use',
provider: new LocalProvider({ screenWidth: 1280, screenHeight: 800, enableVnc: true }),
});
await session.goto('https://example.com');
// Coordinate-based actions (delegated to provider)
await session.clickAt(640, 400);
await session.typeText('hello world');
await session.mouseMove(100, 200);
await session.drag(10, 20, 300, 400);
await session.scrollAt(640, 400, 'down', 5);
const pos = await session.getCursorPosition();
const size = await session.getScreenSize();
// Selector-based actions still work (via Puppeteer CDP)
await session.click('#submit');
await session.type('#search', 'query');
// VNC streaming URL (if provider supports it)
console.log(session.getVncUrl());
await session.close();Custom providers
Implement ComputerUseProvider to connect any VM/container backend:
import { ComputerUseProvider } from '@monostate/node-scraper';
class MyProvider extends ComputerUseProvider {
async start() {
// Provision VM, return { cdpUrl, vncUrl, screenSize }
}
async mouseClick(x, y, button) { /* ... */ }
async screenshot() { /* ... */ }
async stop() { /* cleanup */ }
}AI-powered Q&A
Ask questions about any website using OpenRouter, OpenAI, or local fallback:
import { askWebsiteAI } from '@monostate/node-scraper';
const answer = await askWebsiteAI('https://example.com', 'What is this site about?', {
openRouterApiKey: process.env.OPENROUTER_API_KEY,
});API key priority: OpenRouter > OpenAI > BNCA backend > local pattern matching (no key needed).
How it works
The scraper uses a three-tier fallback system:
- Direct fetch — Pure HTTP with HTML parsing. Sub-second, handles ~75% of sites.
- LightPanda — Lightweight browser engine, 2-3x faster than Chromium. Handles SPAs.
- Puppeteer — Full Chromium for maximum compatibility.
Additional specialized handlers:
- PDF parser — Automatic detection by URL, content-type, or magic bytes. Extracts text, metadata, and page count.
- Screenshots — Chrome CLI capture with retry logic and smart timeouts.
Browser instances are pooled (max 3, 5s idle timeout) to prevent memory leaks.
Performance
| Site Type | node-scraper | Firecrawl | Speedup | |-----------|-------------|-----------|---------| | Wikipedia | 154ms | 4,662ms | 30.3x | | Hacker News | 1,715ms | 4,644ms | 2.7x | | GitHub | 9,167ms | 9,790ms | 1.1x |
Average: 11.35x faster with 100% reliability.
API Reference
Convenience functions
| Function | Description |
|----------|-------------|
| smartScrape(url, opts?) | Scrape with intelligent fallback |
| smartScreenshot(url, opts?) | Full page screenshot |
| quickShot(url, opts?) | Fast screenshot capture |
| bulkScrape(urls, opts?) | Batch scrape multiple URLs |
| bulkScrapeStream(urls, opts?) | Stream results as they complete |
| askWebsiteAI(url, question, opts?) | AI Q&A about a webpage |
BNCASmartScraper options
{
timeout: 10000, // Request timeout (ms)
retries: 2, // Retries per method
verbose: false, // Detailed logging
lightpandaPath: './bin/lightpanda',
lightpandaFormat: 'html', // 'html' or 'markdown'
userAgent: 'Mozilla/5.0 ...',
openRouterApiKey: '...',
openAIApiKey: '...',
openAIBaseUrl: 'https://api.openai.com',
}Methods
scraper.scrape(url, opts?)— Scrape with fallbackscraper.screenshot(url, opts?)— Take screenshotscraper.quickshot(url, opts?)— Fast screenshotscraper.askAI(url, question, opts?)— AI Q&Ascraper.getStats()— Performance statisticsscraper.healthCheck()— Check method availabilityscraper.cleanup()— Close browser instances
Response shape
{
success: true,
content: '...', // Extracted content (JSON string)
method: 'direct-fetch', // Method used
url: 'https://...',
performance: { totalTime: 154 },
stats: { ... }
}TypeScript
Full type definitions are included (index.d.ts).
import { BNCASmartScraper, ScrapingResult } from '@monostate/node-scraper';Notes
- Server-side only — requires filesystem access and browser automation.
- Node.js 20+ required.
- LightPanda binary is auto-installed. Puppeteer is optional.
- No external API calls for scraping — all processing is local.
Changelog
See CHANGELOG.md for the full release history.
License
MIT
