@flyrank/flyscrape

v1.0.16

Published

24 days ago

A powerful, modular web scraping and crawling library for Node.js, inspired by crawl4ai. Features stealth mode, LLM extraction, and markdown processing.

0High
0Medium
0Low

flyrank

crawler scraper web-scraping headless-browser playwright stealth llm extraction markdown crawl4ai

The Ultimate Node.js Web Scraping & Crawling Engine

FlyScrape is a Node.js package, based on top of Crawl4AI, that makes it easy to integrate powerful scrapers and crawlers directly into your web applications. Designed for the modern web, it provides modular, production-ready tools to extract clean, structured data, ready for RAG pipelines, AI agents, or advanced analytics.

Whether you’re building a content aggregator, an AI agent, or a complex data pipeline, FlyScrape simplifies web crawling and scraping while giving you maximum flexibility and performance.

LLM-Ready Output: Generates smart Markdown with headings, tables, code blocks, and citation hints optimized for RAG.
Production Grade: Built for reliability with retry strategies, caching, and robust error handling.
Full Control: Customize every aspect of the crawl with hooks, custom transformers, and flexible configurations.
Anti-Blocking: Integrated stealth techniques to bypass WAFs and bot detection systems.
Developer Experience: Fully typed in TypeScript with a modular architecture for easy extensibility.

🚀 Quick Start

1. Installation

npm install @flyrank/flyscrape
# or
yarn add @flyrank/flyscrape
# or
pnpm add @flyrank/flyscrape

2. Basic Crawl

import { AsyncWebCrawler } from "@flyrank/flyscrape";

async function main() {
  const crawler = new AsyncWebCrawler();
  await crawler.start();
  
  // Crawl a URL and get clean Markdown
  const result = await crawler.arun("https://example.com");
  
  if (result.success) {
    console.log(result.markdown);
  }
  
  await crawler.close();
}

main();

3. Content-Only Mode (Smart Cleaning)

Extract only the main article content, removing all UI clutter.

const result = await crawler.arun("https://blog.example.com/guide", {
  contentOnly: true,
  excludeMedia: true, // Remove images/videos
});

✨ Features

🧹 Clean Markdown: Generates clean, structured Markdown with accurate formatting.
🎯 Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
🔗 Citations and References: Converts page links into a numbered reference list with clean citations.
🛠️ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
📚 BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.

🖼️ Media Support: Extract images, audio, videos, and responsive image formats like srcset and picture.
🚀 Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
📸 Screenshots: Capture page screenshots during crawling for debugging or analysis.
📂 Raw Data Crawling: Directly process raw HTML (raw:) or local files (file://).
🔗 Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content.
🛠️ Customizable Hooks: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
💾 Caching: Cache data for improved speed and to avoid redundant fetches.
📄 Metadata Extraction: Retrieve structured metadata (OpenGraph, Twitter Cards) from web pages.
📡 IFrame Content Extraction: Seamless extraction from embedded iframe content.
🕵️ Lazy Load Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading.
🔄 Full-Page Scanning: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
🗺️ Sitemap Crawling: Fetch and crawl from sitemaps (and sitemap indexes) with optional category breakdown (e.g. products, pages, blogs).

🧠 AI-Powered Extraction: Seamlessly integrate with OpenAI and other LLMs to extract structured JSON data.
🧹 Smart Content Cleaning: Automatically strips navigation, ads, footers, and boilerplate.
📝 LLM-Ready Markdown: Converts HTML to clean, semantic Markdown, optimized for RAG (Retrieval-Augmented Generation) pipelines.

👻 Stealth Mode: Integrated evasion techniques (user-agent rotation, fingerprinting protection) to bypass WAFs.
⚡ Hybrid Caching: Memory and disk-based caching to speed up redundant crawls.
🚫 Resource Blocking: Block unnecessary assets (images, css, fonts) for faster loading.

🧩 API Service (n8n)

FlyScrape includes a provider-agnostic API service that registers providers from environment variables and exposes REST endpoints for n8n and other workflow tools.

Environment Variables

API_PROVIDER_<NAME>_ENDPOINT
API_PROVIDER_<NAME>_AUTH_TYPE (api_key, oauth, basic, none)
API_PROVIDER_<NAME>_API_KEY
API_PROVIDER_<NAME>_API_KEY_HEADER
API_PROVIDER_<NAME>_API_KEY_PREFIX
API_PROVIDER_<NAME>_OAUTH_TOKEN
API_PROVIDER_<NAME>_OAUTH_HEADER
API_PROVIDER_<NAME>_USERNAME
API_PROVIDER_<NAME>_PASSWORD
API_PROVIDER_<NAME>_RATE_LIMIT
API_PROVIDER_<NAME>_RATE_WINDOW_MS
API_PROVIDER_<NAME>_LOG_LEVEL
API_PROVIDER_<NAME>_HEALTH_ENDPOINT
API_PROVIDER_<NAME>_TIMEOUT_MS
API_SERVICE_PORT
API_SERVICE_BASE_PATH
API_SERVICE_LOG_LEVEL
API_SERVICE_MAX_BODY_BYTES

REST Endpoints

GET /health
GET /v1/providers
GET /v1/providers/:name
GET /v1/providers/:name/health
POST /v1/providers/:name/request

Response Format

{
  "success": true,
  "requestId": "uuid",
  "data": {}
}

{
  "success": false,
  "requestId": "uuid",
  "error": {
    "code": "ERROR_CODE",
    "message": "Human readable message",
    "details": {}
  }
}

Example .env

API_PROVIDER_OPENAI_ENDPOINT=https://api.openai.com/v1
API_PROVIDER_OPENAI_AUTH_TYPE=api_key
API_PROVIDER_OPENAI_API_KEY=sk-...
API_PROVIDER_OPENAI_API_KEY_HEADER=Authorization
API_PROVIDER_OPENAI_API_KEY_PREFIX=Bearer
API_PROVIDER_OPENAI_RATE_LIMIT=120
API_PROVIDER_OPENAI_RATE_WINDOW_MS=60000
API_PROVIDER_OPENAI_LOG_LEVEL=info
API_PROVIDER_OPENAI_HEALTH_ENDPOINT=https://api.openai.com/v1/models
API_SERVICE_PORT=3000
API_SERVICE_BASE_PATH=/v1
API_SERVICE_LOG_LEVEL=info

Example Request

curl -X POST http://localhost:3000/v1/providers/openai/request \
  -H "Content-Type: application/json" \
  -d '{
    "method": "POST",
    "path": "/chat/completions",
    "body": {
      "model": "gpt-4o-mini",
      "messages": [{ "role": "user", "content": "Hello" }]
    }
  }'

Run the Service

bun run api-service

Docker

docker build -t flyscrape-api .
docker run --env-file .env -p 3000:3000 flyscrape-api

🔬 Advanced Usage Examples

Keep your session alive across multiple requests to look like a real user and avoid being blocked.

const sessionId = 'my-session-1';

// First request: Creates session, saves cookies/local storage
await crawler.arun("https://example.com/login", { 
  session_id: sessionId 
});

// Second request: Reuses the same session (cookies are preserved!)
await crawler.arun("https://example.com/dashboard", { 
  session_id: sessionId 
});

// Clean up when done
await crawler.closeSession(sessionId);

Use impit under the hood to mimic real browser TLS fingerprints without the overhead of a full browser.

// Fast mode (no browser, but stealthy TLS fingerprint)
const result = await crawler.arun("https://example.com", {
  jsExecution: false // Disables Playwright, enables impit
});

Enable advanced anti-detection features to bypass WAFs and bot detection systems.

const crawler = new AsyncWebCrawler({
  stealth: true, // Enable stealth mode
  headless: true,
});

await crawler.start();

Need full control? Provide a customTransformer to define exactly how HTML maps to Markdown.

const result = await crawler.arun("https://example.com", {
  processing: {
    markdown: {
      customTransformer: (html) => {
        // Your custom logic here
        return myCustomConverter(html);
      }
    }
  }
});

Handle modern SPAs with ease using built-in scrolling and wait strategies.

const result = await crawler.arun("https://infinite-scroll.com", {
  autoScroll: true, // Automatically scroll to bottom
  waitMode: 'networkidle', // Wait for network to settle
});

Inject custom logic at key stages of the crawling process.

const result = await crawler.arun("https://example.com", {
  hooks: {
    onPageCreated: async (page) => {
      // Set cookies or modify environment
      await page.context().addCookies([...]);
    },
    onLoad: async (page) => {
      // Interact with the page
      await page.click('#accept-cookies'); 
    }
  }
});

Process raw HTML or local files directly without a web server.

// Raw HTML
await crawler.arun("raw:<html><body><h1>Hello</h1></body></html>");

// Local File
await crawler.arun("file:///path/to/local/file.html");

Define a schema and let the LLM do the work.

const schema = {
  type: "object",
  properties: {
    title: { type: "string" },
    price: { type: "number" },
    features: { type: "array", items: { type: "string" } }
  }
};

const result = await crawler.arun("https://store.example.com/product/123", {
  extraction: {
    type: "llm",
    schema: schema,
    provider: myOpenAIProvider // Your LLM provider instance
  }
});

Crawl all pages listed in a sitemap (or sitemap index) in one call. Sitemaps are fetched over HTTP with timeouts and redirect limits; optional category counts (e.g. products, pages, blogs, collections) are supported.

import {
  AsyncWebCrawler,
  fetchSitemapUrls,
  getSitemapIndexCategories,
} from "@flyrank/flyscrape";

// Option A: Crawl all pages from a sitemap
const crawler = new AsyncWebCrawler();
await crawler.start();
const results = await crawler.crawlFromSitemap(
  "https://www.flyrank.com/sitemap.xml",
  { jsExecution: false },  // fast fetch-only mode
  { maxUrls: 1000, timeout: 10_000 }
);
await crawler.close();

// Option B: Get only the list of URLs from the sitemap
const urls = await fetchSitemapUrls("https://www.flyrank.com/sitemap.xml", {
  sameOriginOnly: true,
  maxUrls: 500,
});

// Option C: Get categorized counts (e.g. products (6), pages (6), blogs (12))
const { categories, totalSitemaps } = await getSitemapIndexCategories(
  "https://www.flyrank.com/sitemap.xml"
);
for (const [name, info] of Object.entries(categories)) {
  console.log(`${name} (${info.count})`);
}

// Option D: Crawl and get category breakdown in one call
const out = await crawler.crawlFromSitemap(
  "https://www.flyrank.com/sitemap.xml",
  { jsExecution: false },
  { includeSitemapCategories: true }
);
if (!Array.isArray(out)) {
  console.log("Categories:", out.sitemapCategories.categories);
  // out.results = crawl results
}

🤝 Contributing

We welcome contributions! Please see our Contribution Guidelines for details on how to get started.

📄 License

This project is licensed under the MIT License.