crawl4ai

v1.0.1

Published

7 months ago

TypeScript SDK for Crawl4AI REST API - Bun & Node.js compatible

0High
0Medium
0Low

pyronaur

crawl4ai web-scraping web-crawler typescript bun nodejs api-client llm ai data-extraction web-automation

Crawl4AI TypeScript SDK

A type-safe TypeScript SDK for the Crawl4AI REST API. Built for modern JavaScript/TypeScript environments with full Bun and Node.js compatibility.

🚀 Features

Full TypeScript Support - Complete type definitions for all API endpoints and responses
Bun & Node.js Compatible - Works seamlessly in both runtimes
Modern Async/Await - Promise-based API for all operations
Comprehensive Coverage - All Crawl4AI endpoints including specialized features
Smart Error Handling - Custom error classes with retry logic and timeouts
Batch Processing - Efficiently crawl multiple URLs in a single request
Input Validation - Built-in URL validation and parameter checking
Debug Mode - Optional request/response logging for development
Zero Dependencies - Uses only native fetch API

📦 Installation

Using Bun (Recommended)

bun add crawl4ai

Using npm/yarn

npm install crawl4ai
# or
yarn add crawl4ai

📚 About Crawl4AI

⚠️ Unofficial Package: This is an unofficial TypeScript SDK for the Crawl4AI REST API. This package was created for personal use to provide a type-safe way to interact with Crawl4AI's REST API.

Official Project: https://github.com/unclecode/crawl4ai
Official Documentation: https://docs.crawl4ai.com/

🏗️ Prerequisites

Crawl4AI Server Running

You can use the hosted version or run your own:

# Using Docker
docker run -p 11235:11235 unclecode/crawl4ai:latest
   
# With LLM support
docker run -p 11235:11235 \
  -e OPENAI_API_KEY=your_openai_key \
  -e ANTHROPIC_API_KEY=your_anthropic_key \
  unclecode/crawl4ai:latest

TypeScript (if using TypeScript)
```
bun add -d typescript
```

🚀 Quick Start

Basic Usage

import Crawl4AI from 'crawl4ai';

// Initialize the client
const client = new Crawl4AI({
  baseUrl: 'https://example.com', // or your local instance
  apiToken: 'your_token_here', // Optional
  timeout: 30000,
  debug: true // Enable request/response logging
});

// Perform a basic crawl
const results = await client.crawl({
  urls: 'https://example.com',
  browser_config: {
    headless: true,
    viewport: { width: 1920, height: 1080 }
  },
  crawler_config: {
    cache_mode: 'bypass',
    word_count_threshold: 10
  }
});

const result = results[0]; // API returns array of results
console.log('Title:', result.metadata?.title);
console.log('Content:', result.markdown?.slice(0, 200));

Configuration Options

const client = new Crawl4AI({
  baseUrl: 'https://example.com',
  apiToken: 'optional_api_token',
  timeout: 60000,          // Request timeout in ms
  retries: 3,              // Number of retry attempts
  retryDelay: 1000,        // Delay between retries in ms
  throwOnError: true,      // Throw on HTTP errors
  debug: false,            // Enable debug logging
  defaultHeaders: {        // Additional headers
    'User-Agent': 'MyApp/1.0'
  }
});

📖 API Reference

Core Methods

`crawl(request)` - Main Crawl Endpoint

Crawl one or more URLs with full configuration options:

const results = await client.crawl({
  urls: ['https://example.com', 'https://example.org'],
  browser_config: {
    headless: true,
    simulate_user: true,
    magic: true // Anti-detection features
  },
  crawler_config: {
    cache_mode: 'bypass',
    extraction_strategy: {
      type: 'json_css',
      params: { /* CSS extraction config */ }
    }
  }
});

Content Generation

`markdown(request)` - Get Markdown

Extract markdown with various filters:

const markdown = await client.markdown({
  url: 'https://example.com',
  filter: 'fit',  // 'raw' | 'fit' | 'bm25' | 'llm'
  query: 'search query for bm25/llm filters'
});

`html(request)` - Get Processed HTML

Get sanitized HTML for schema extraction:

const html = await client.html({
  url: 'https://example.com'
});

`screenshot(request)` - Capture Screenshot

Capture full-page screenshots:

const screenshotBase64 = await client.screenshot({
  url: 'https://example.com',
  screenshot_wait_for: 2,  // Wait 2 seconds before capture
  output_path: '/path/to/save.png'  // Optional: save to file
});

`pdf(request)` - Generate PDF

Generate PDF documents:

const pdfData = await client.pdf({
  url: 'https://example.com',
  output_path: '/path/to/save.pdf'  // Optional: save to file
});

JavaScript Execution

`executeJs(request)` - Run JavaScript

Execute JavaScript on the page and get full crawl results:

const result = await client.executeJs({
  url: 'https://example.com',
  scripts: [
    'return document.title;',
    'return document.querySelectorAll("a").length;',
    'window.scrollTo(0, document.body.scrollHeight);'
  ]
});

console.log('JS Results:', result.js_execution_result);

AI/LLM Features

`ask(params)` - Get Library Context

Get Crawl4AI documentation for AI assistants:

const answer = await client.ask({
  query: 'extraction strategies',
  context_type: 'doc',  // 'code' | 'doc' | 'all'
  max_results: 10
});

`llm(url, query)` - LLM Endpoint

Process URLs with LLM:

const response = await client.llm(
  'https://example.com',
  'What is the main purpose of this website?'
);

Utility Methods

// Test connection
const isConnected = await client.testConnection();
// With error details
const isConnected = await client.testConnection({ throwOnError: true });

// Get health status
const health = await client.health();

// Get API version
const version = await client.version();
// With error details
const version = await client.version({ throwOnError: true });

// Get Prometheus metrics
const metrics = await client.metrics();

// Update configuration
client.setApiToken('new_token');
client.setBaseUrl('https://new-url.com');
client.setDebug(true);

🎯 Data Extraction Strategies

CSS Selector Extraction

Extract structured data using CSS selectors:

const results = await client.crawl({
  urls: 'https://news.ycombinator.com',
  crawler_config: {
    extraction_strategy: {
      type: 'json_css',
      params: {
        schema: {
          baseSelector: '.athing',
          fields: [
            {
              name: 'title',
              selector: '.titleline > a',
              type: 'text'
            },
            {
              name: 'url', 
              selector: '.titleline > a',
              type: 'href'
            },
            {
              name: 'score',
              selector: '+ tr .score',
              type: 'text'
            }
          ]
        }
      }
    }
  }
});

const posts = JSON.parse(results[0].extracted_content || '[]');

LLM-Based Extraction

Use AI models for intelligent data extraction:

const results = await client.crawl({
  urls: 'https://www.bbc.com/news',
  crawler_config: {
    extraction_strategy: {
      type: 'llm',
      params: {
        provider: 'openai/gpt-4o-mini',
        api_token: process.env.OPENAI_API_KEY,
        schema: {
          type: 'object',
          properties: {
            headline: { type: 'string' },
            summary: { type: 'string' },
            author: { type: 'string' },
            tags: {
              type: 'array',
              items: { type: 'string' }
            }
          }
        },
        extraction_type: 'schema',
        instruction: 'Extract news articles with their key information'
      }
    }
  }
});

Cosine Similarity Extraction

Filter content based on semantic similarity:

const results = await client.crawl({
  urls: 'https://example.com/blog',
  crawler_config: {
    extraction_strategy: {
      type: 'cosine',
      params: {
        semantic_filter: 'artificial intelligence machine learning',
        word_count_threshold: 50,
        max_dist: 0.3,
        top_k: 5
      }
    }
  }
});

🛠️ Error Handling

The SDK provides custom error handling with detailed information:

import { Crawl4AIError } from 'crawl4ai-sdk';

try {
  const results = await client.crawl({ urls: 'https://example.com' });
} catch (error) {
  if (error instanceof Crawl4AIError) {
    console.error('API Error:', error.message);
    console.error('Status:', error.status);
    console.error('Details:', error.data);
  } else {
    console.error('Unexpected error:', error);
  }
}

🧪 Testing

Run the test suite:

# Run all tests
bun test

# Run specific test file
bun test src/sdk.test.ts

📚 Examples

Run the included examples:

# Basic usage
bun run example:basic

# Advanced features
bun run example:advanced

# LLM extraction
bun run example:llm

🔒 Security & Best Practices

Authentication

Always use API tokens in production:

const client = new Crawl4AI({
  baseUrl: 'https://your-crawl4ai-server.com',
  apiToken: process.env.CRAWL4AI_API_TOKEN
});

Rate Limiting

Implement client-side throttling:

// Sequential processing with delays
for (const url of urls) {
  const results = await client.crawl({ urls: url });
  await new Promise(resolve => setTimeout(resolve, 1000)); // 1s delay
}

Input Validation

The SDK automatically validates URLs before making requests. Invalid URLs will throw a Crawl4AIError.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This SDK is released under the MIT License.

🙏 Acknowledgments

Built for the amazing Crawl4AI project by @unclecode and the Crawl4AI community.