@flyrank/flyscrape
v1.0.16
Published
A powerful, modular web scraping and crawling library for Node.js, inspired by crawl4ai. Features stealth mode, LLM extraction, and markdown processing.
Maintainers
Readme

The Ultimate Node.js Web Scraping & Crawling Engine
FlyScrape is a Node.js package, based on top of Crawl4AI, that makes it easy to integrate powerful scrapers and crawlers directly into your web applications. Designed for the modern web, it provides modular, production-ready tools to extract clean, structured data, ready for RAG pipelines, AI agents, or advanced analytics.
Whether you’re building a content aggregator, an AI agent, or a complex data pipeline, FlyScrape simplifies web crawling and scraping while giving you maximum flexibility and performance.
- LLM-Ready Output: Generates smart Markdown with headings, tables, code blocks, and citation hints optimized for RAG.
- Production Grade: Built for reliability with retry strategies, caching, and robust error handling.
- Full Control: Customize every aspect of the crawl with hooks, custom transformers, and flexible configurations.
- Anti-Blocking: Integrated stealth techniques to bypass WAFs and bot detection systems.
- Developer Experience: Fully typed in TypeScript with a modular architecture for easy extensibility.
🚀 Quick Start
1. Installation
npm install @flyrank/flyscrape
# or
yarn add @flyrank/flyscrape
# or
pnpm add @flyrank/flyscrape2. Basic Crawl
import { AsyncWebCrawler } from "@flyrank/flyscrape";
async function main() {
const crawler = new AsyncWebCrawler();
await crawler.start();
// Crawl a URL and get clean Markdown
const result = await crawler.arun("https://example.com");
if (result.success) {
console.log(result.markdown);
}
await crawler.close();
}
main();3. Content-Only Mode (Smart Cleaning)
Extract only the main article content, removing all UI clutter.
const result = await crawler.arun("https://blog.example.com/guide", {
contentOnly: true,
excludeMedia: true, // Remove images/videos
});✨ Features
- 🧹 Clean Markdown: Generates clean, structured Markdown with accurate formatting.
- 🎯 Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- 🔗 Citations and References: Converts page links into a numbered reference list with clean citations.
- 🛠️ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
- 📚 BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.
- 🖼️ Media Support: Extract images, audio, videos, and responsive image formats like
srcsetandpicture. - 🚀 Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
- 📸 Screenshots: Capture page screenshots during crawling for debugging or analysis.
- 📂 Raw Data Crawling: Directly process raw HTML (
raw:) or local files (file://). - 🔗 Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content.
- 🛠️ Customizable Hooks: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
- 💾 Caching: Cache data for improved speed and to avoid redundant fetches.
- 📄 Metadata Extraction: Retrieve structured metadata (OpenGraph, Twitter Cards) from web pages.
- 📡 IFrame Content Extraction: Seamless extraction from embedded iframe content.
- 🕵️ Lazy Load Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- 🔄 Full-Page Scanning: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
- 🗺️ Sitemap Crawling: Fetch and crawl from sitemaps (and sitemap indexes) with optional category breakdown (e.g. products, pages, blogs).
- 🧠 AI-Powered Extraction: Seamlessly integrate with OpenAI and other LLMs to extract structured JSON data.
- 🧹 Smart Content Cleaning: Automatically strips navigation, ads, footers, and boilerplate.
- 📝 LLM-Ready Markdown: Converts HTML to clean, semantic Markdown, optimized for RAG (Retrieval-Augmented Generation) pipelines.
- 👻 Stealth Mode: Integrated evasion techniques (user-agent rotation, fingerprinting protection) to bypass WAFs.
- ⚡ Hybrid Caching: Memory and disk-based caching to speed up redundant crawls.
- 🚫 Resource Blocking: Block unnecessary assets (images, css, fonts) for faster loading.
🧩 API Service (n8n)
FlyScrape includes a provider-agnostic API service that registers providers from environment variables and exposes REST endpoints for n8n and other workflow tools.
Environment Variables
API_PROVIDER_<NAME>_ENDPOINTAPI_PROVIDER_<NAME>_AUTH_TYPE(api_key,oauth,basic,none)API_PROVIDER_<NAME>_API_KEYAPI_PROVIDER_<NAME>_API_KEY_HEADERAPI_PROVIDER_<NAME>_API_KEY_PREFIXAPI_PROVIDER_<NAME>_OAUTH_TOKENAPI_PROVIDER_<NAME>_OAUTH_HEADERAPI_PROVIDER_<NAME>_USERNAMEAPI_PROVIDER_<NAME>_PASSWORDAPI_PROVIDER_<NAME>_RATE_LIMITAPI_PROVIDER_<NAME>_RATE_WINDOW_MSAPI_PROVIDER_<NAME>_LOG_LEVELAPI_PROVIDER_<NAME>_HEALTH_ENDPOINTAPI_PROVIDER_<NAME>_TIMEOUT_MSAPI_SERVICE_PORTAPI_SERVICE_BASE_PATHAPI_SERVICE_LOG_LEVELAPI_SERVICE_MAX_BODY_BYTES
REST Endpoints
GET /healthGET /v1/providersGET /v1/providers/:nameGET /v1/providers/:name/healthPOST /v1/providers/:name/request
Response Format
{
"success": true,
"requestId": "uuid",
"data": {}
}{
"success": false,
"requestId": "uuid",
"error": {
"code": "ERROR_CODE",
"message": "Human readable message",
"details": {}
}
}Example .env
API_PROVIDER_OPENAI_ENDPOINT=https://api.openai.com/v1
API_PROVIDER_OPENAI_AUTH_TYPE=api_key
API_PROVIDER_OPENAI_API_KEY=sk-...
API_PROVIDER_OPENAI_API_KEY_HEADER=Authorization
API_PROVIDER_OPENAI_API_KEY_PREFIX=Bearer
API_PROVIDER_OPENAI_RATE_LIMIT=120
API_PROVIDER_OPENAI_RATE_WINDOW_MS=60000
API_PROVIDER_OPENAI_LOG_LEVEL=info
API_PROVIDER_OPENAI_HEALTH_ENDPOINT=https://api.openai.com/v1/models
API_SERVICE_PORT=3000
API_SERVICE_BASE_PATH=/v1
API_SERVICE_LOG_LEVEL=infoExample Request
curl -X POST http://localhost:3000/v1/providers/openai/request \
-H "Content-Type: application/json" \
-d '{
"method": "POST",
"path": "/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{ "role": "user", "content": "Hello" }]
}
}'Run the Service
bun run api-serviceDocker
docker build -t flyscrape-api .
docker run --env-file .env -p 3000:3000 flyscrape-api🔬 Advanced Usage Examples
Keep your session alive across multiple requests to look like a real user and avoid being blocked.
const sessionId = 'my-session-1';
// First request: Creates session, saves cookies/local storage
await crawler.arun("https://example.com/login", {
session_id: sessionId
});
// Second request: Reuses the same session (cookies are preserved!)
await crawler.arun("https://example.com/dashboard", {
session_id: sessionId
});
// Clean up when done
await crawler.closeSession(sessionId);Use impit under the hood to mimic real browser TLS fingerprints without the overhead of a full browser.
// Fast mode (no browser, but stealthy TLS fingerprint)
const result = await crawler.arun("https://example.com", {
jsExecution: false // Disables Playwright, enables impit
});Enable advanced anti-detection features to bypass WAFs and bot detection systems.
const crawler = new AsyncWebCrawler({
stealth: true, // Enable stealth mode
headless: true,
});
await crawler.start();Need full control? Provide a customTransformer to define exactly how HTML maps to Markdown.
const result = await crawler.arun("https://example.com", {
processing: {
markdown: {
customTransformer: (html) => {
// Your custom logic here
return myCustomConverter(html);
}
}
}
});Handle modern SPAs with ease using built-in scrolling and wait strategies.
const result = await crawler.arun("https://infinite-scroll.com", {
autoScroll: true, // Automatically scroll to bottom
waitMode: 'networkidle', // Wait for network to settle
});Inject custom logic at key stages of the crawling process.
const result = await crawler.arun("https://example.com", {
hooks: {
onPageCreated: async (page) => {
// Set cookies or modify environment
await page.context().addCookies([...]);
},
onLoad: async (page) => {
// Interact with the page
await page.click('#accept-cookies');
}
}
});Process raw HTML or local files directly without a web server.
// Raw HTML
await crawler.arun("raw:<html><body><h1>Hello</h1></body></html>");
// Local File
await crawler.arun("file:///path/to/local/file.html");Define a schema and let the LLM do the work.
const schema = {
type: "object",
properties: {
title: { type: "string" },
price: { type: "number" },
features: { type: "array", items: { type: "string" } }
}
};
const result = await crawler.arun("https://store.example.com/product/123", {
extraction: {
type: "llm",
schema: schema,
provider: myOpenAIProvider // Your LLM provider instance
}
});Crawl all pages listed in a sitemap (or sitemap index) in one call. Sitemaps are fetched over HTTP with timeouts and redirect limits; optional category counts (e.g. products, pages, blogs, collections) are supported.
import {
AsyncWebCrawler,
fetchSitemapUrls,
getSitemapIndexCategories,
} from "@flyrank/flyscrape";
// Option A: Crawl all pages from a sitemap
const crawler = new AsyncWebCrawler();
await crawler.start();
const results = await crawler.crawlFromSitemap(
"https://www.flyrank.com/sitemap.xml",
{ jsExecution: false }, // fast fetch-only mode
{ maxUrls: 1000, timeout: 10_000 }
);
await crawler.close();
// Option B: Get only the list of URLs from the sitemap
const urls = await fetchSitemapUrls("https://www.flyrank.com/sitemap.xml", {
sameOriginOnly: true,
maxUrls: 500,
});
// Option C: Get categorized counts (e.g. products (6), pages (6), blogs (12))
const { categories, totalSitemaps } = await getSitemapIndexCategories(
"https://www.flyrank.com/sitemap.xml"
);
for (const [name, info] of Object.entries(categories)) {
console.log(`${name} (${info.count})`);
}
// Option D: Crawl and get category breakdown in one call
const out = await crawler.crawlFromSitemap(
"https://www.flyrank.com/sitemap.xml",
{ jsExecution: false },
{ includeSitemapCategories: true }
);
if (!Array.isArray(out)) {
console.log("Categories:", out.sitemapCategories.categories);
// out.results = crawl results
}🤝 Contributing
We welcome contributions! Please see our Contribution Guidelines for details on how to get started.
📄 License
This project is licensed under the MIT License.
