deepcrawl

v0.5.5

Published

14 days ago

JavaScript/TypeScript SDK for Deepcrawl API

0High
0Medium
0Low

felixlyu1018

deepcrawl web-scraping crawling api sdk typescript javascript

Deepcrawl SDK

TypeScript SDK for the Deepcrawl API - Web scraping and crawling with comprehensive error handling.

⚡ Why Deepcrawl SDK?

🏗️ oRPC-Powered: Built on oRPC framework for type-safe RPC
🔒 Type-Safe: End-to-end TypeScript with error handling
🖥️ Server-Side Only: Designed for Node.js, Cloudflare Workers, and Next.js Server Actions
🪶 Lightweight: Minimal bundle size with tree-shaking support
🛡️ Error Handling: Comprehensive, typed errors with context
🔄 Retry Logic: Built-in exponential backoff for transient failures
⚡ Connection Pooling: Automatic HTTP connection reuse (Node.js)

📦 Installation

npm install deepcrawl
# or
yarn add deepcrawl
# or
pnpm add deepcrawl

Zod v4 ships with the SDK as a runtime dependency and is mirrored as a peer dependency. If your app already provides Zod ≥4.1, your package manager will dedupe it; otherwise, the bundled copy means no extra install step.

🚀 Quick Start

import { DeepcrawlApp } from 'deepcrawl';

const deepcrawl = new DeepcrawlApp({
  apiKey: process.env.DEEPCRAWL_API_KEY
});

const result = await deepcrawl.readUrl('https://example.com');
console.log(result.markdown);

📦 Package Exports

The SDK uses dedicated export paths for better tree-shaking and organization:

Main Export (SDK Client)

import { DeepcrawlApp } from 'deepcrawl';

Types Export

import type {
  // Configuration
  DeepcrawlConfig,

  // API Types
  ReadUrlOptions,
  ReadUrlResponse,
  GetMarkdownOptions,
  GetMarkdownResponse,
  ExtractLinksOptions,
  ExtractLinksResponse,
  GetLinksOptions,
  GetLinksResponse,

  // Activity Logs
  ActivityLogEntry,
  ListLogsOptions,
  ListLogsResponse,
  GetOneLogOptions,

  // Metadata & Metrics
  Metadata,
  MetricsOptions,
  Metrics,

  // Links
  LinksTree,
  LinkItem,
  SocialMediaLink,

  // Errors
  DeepcrawlError,
  DeepcrawlReadError,
  DeepcrawlLinksError,
  DeepcrawlRateLimitError,
  DeepcrawlAuthError,
  DeepcrawlValidationError,
  DeepcrawlNotFoundError,
  DeepcrawlServerError,
  DeepcrawlNetworkError,
} from 'deepcrawl/types';

Schemas Export

import {
  z,
  // Request Schemas
  ReadUrlOptionsSchema,
  GetMarkdownOptionsSchema,
  ExtractLinksOptionsSchema,
  GetLinksOptionsSchema,
  ListLogsOptionsSchema,
  GetOneLogOptionsSchema,

  // Response Schemas
  ReadUrlResponseSchema,
  GetMarkdownResponseSchema,
  ExtractLinksResponseSchema,
  GetLinksResponseSchema,
  ListLogsResponseSchema,

  // Metadata & Metrics
  MetadataSchema,
  MetricsOptionsSchema,
  MetricsSchema,

  // Links
  LinksTreeSchema,

  // services
  CacheOptionsSchema
} from 'deepcrawl/schemas';

Importing z from deepcrawl/zod/v4 (or from deepcrawl/schemas) reuses the SDK's Zod runtime so schema composition works even if your app already has its own Zod installation.

Zod Helper

import { z } from 'deepcrawl/zod/v4';
import { ReadUrlOptionsSchema } from 'deepcrawl/schemas';

const CustomSchema = ReadUrlOptionsSchema.extend({
  customFlag: z.boolean().default(false),
});

Use this helper when you want to compose Zod schemas with the SDK’s public schemas or utils to avoid instance mismatch issues in projects that install multiple copies of Zod.

Utilities Export

import {
  // Zod schema helper
  OptionalBoolWithDefault,

  // Pagination normalization
  normalizeListLogsPagination
} from 'deepcrawl/types/utils';

// Example: Create optional boolean schema with default
const schema = OptionalBoolWithDefault(true);

// Example: Normalize pagination input
const normalized = normalizeListLogsPagination({ limit: 150, offset: -5 });
// Returns: { limit: 100, offset: 0 } (clamped to valid ranges)

📖 API Methods

readUrl(url, options?)

Extract clean content and metadata from any URL.

import { DeepcrawlApp } from 'deepcrawl';
import type { ReadUrlOptions } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

const result = await deepcrawl.readUrl('https://example.com', {
  metadata: true,
  markdown: true,
  cleanedHtml: true,
  metricsOptions: { enabled: true }
});

console.log(result.markdown);
console.log(result.metadata?.title);
console.log(result.metrics?.readableDuration);

getMarkdown(url, options?)

Simplified method to get just markdown content.

import { DeepcrawlApp } from 'deepcrawl';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

const result = await deepcrawl.getMarkdown('https://example.com', {
  metricsOptions: { enable: true }
});

console.log(result.markdown);

extractLinks(url, options?)

Extract all links from a page with powerful filtering options.

import { DeepcrawlApp } from 'deepcrawl';
import type { ExtractLinksOptions } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

const result = await deepcrawl.extractLinks('https://example.com', {
  includeInternal: true,
  includeExternal: false,
  includeEmails: false,
  includePhoneNumbers: false,
  includeSocialMedia: false,
  metricsOptions: { enable: true }
});

console.log(result.tree.internal);
console.log(result.tree.socialMedia);

listLogs(options?)

Retrieve activity logs with paginated results and filtering.

import { DeepcrawlApp } from 'deepcrawl';
import type { ListLogsOptions } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

const result = await deepcrawl.listLogs({
  limit: 50,
  offset: 0,
  path: 'read-getMarkdown',
  success: true,
  startDate: '2025-01-01T00:00:00Z',
  endDate: '2025-12-31T23:59:59Z',
  orderBy: 'requestTimestamp',
  orderDir: 'desc'
});

console.log(result.logs);
console.log(result.meta.hasMore);

getOneLog(options)

Get a single activity log entry by ID.

import { DeepcrawlApp } from 'deepcrawl';
import type { GetOneLogOptions } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

const log = await deepcrawl.getOneLog({ id: 'request-id-123' });

console.log(log.path);
console.log(log.response);

🌟 Real-World Usage Examples

E-commerce Product Monitoring

import { DeepcrawlApp } from 'deepcrawl';
import type { ReadUrlOptions } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

async function monitorProduct(productUrl: string) {
  try {
    const result = await deepcrawl.readUrl(productUrl, {
      metadata: true,
      cleanedHtml: true
    });

    return {
      title: result.metadata?.title,
      lastChecked: new Date().toISOString()
    };
  } catch (error) {
    if (error.isRateLimit?.()) {
      console.log(`Rate limited. Retry after ${error.retryAfter}s`);
      await new Promise(r => setTimeout(r, error.retryAfter * 1000));
      return monitorProduct(productUrl);
    }
    throw error;
  }
}

Content Aggregation Pipeline

import { DeepcrawlApp } from 'deepcrawl';
import type { ReadUrlResponse } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

async function aggregateArticles(urls: string[]) {
  const results = await Promise.allSettled(
    urls.map(url => deepcrawl.readUrl(url, {
      metadata: true,
      markdown: true
    }))
  );

  return results.map((result, index) => ({
    url: urls[index],
    success: result.status === 'fulfilled',
    data: result.status === 'fulfilled' ? result.value : null,
    error: result.status === 'rejected' ? result.reason.message : null
  }));
}

Next.js Server Actions

// app/actions/scrape.ts
'use server'

import { DeepcrawlApp } from 'deepcrawl';
import { headers } from 'next/headers';
import { revalidatePath } from 'next/cache';

export async function scrapeUrlAction(url: string) {
  const deepcrawl = new DeepcrawlApp({
    apiKey: process.env.DEEPCRAWL_API_KEY,
    headers: await headers(),
  });

  try {
    const result = await deepcrawl.readUrl(url, {
      metadata: true,
      markdown: true,
    });

    revalidatePath('/dashboard');

    return {
      success: true,
      data: {
        title: result.metadata?.title,
        content: result.markdown,
      }
    };
  } catch (error) {
    return {
      success: false,
      error: {
        message: error.message,
        retryable: error.isRateLimit?.() || error.isNetwork?.(),
      }
    };
  }
}

React Hook with Error Handling

import { useState, useCallback } from 'react';
import { DeepcrawlApp } from 'deepcrawl';
import type { ReadUrlResponse } from 'deepcrawl/types';

export function useScraping(apiKey: string) {
  const [data, setData] = useState<ReadUrlResponse | null>(null);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState<string | null>(null);

  const deepcrawl = new DeepcrawlApp({ apiKey });

  const scrape = useCallback(async (url: string) => {
    setLoading(true);
    setError(null);

    try {
      const result = await deepcrawl.readUrl(url, { metadata: true });
      setData(result);
    } catch (err) {
      setError(err.message);
    } finally {
      setLoading(false);
    }
  }, [deepcrawl]);

  return { data, loading, error, scrape };
}

Activity Logging with Server Actions

// app/actions/logs.ts
'use server';

import { DeepcrawlApp } from 'deepcrawl';
import type { ListLogsResponse } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

export async function getActivityLogs() {
  try {
    const logs = await deepcrawl.listLogs({
      limit: 50,
      offset: 0
    });
    return { success: true, data: logs };
  } catch (error) {
    return {
      success: false,
      error: error instanceof Error ? error.message : 'Failed to fetch logs'
    };
  }
}

🛡️ Error Handling

Error Classes

import type {
  DeepcrawlError,
  DeepcrawlReadError,
  DeepcrawlLinksError,
  DeepcrawlRateLimitError,
  DeepcrawlAuthError,
  DeepcrawlValidationError,
  DeepcrawlNotFoundError,
  DeepcrawlServerError,
  DeepcrawlNetworkError,
} from 'deepcrawl/types';

Try/Catch Pattern

import { DeepcrawlApp } from 'deepcrawl';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

try {
  const result = await deepcrawl.readUrl(url);
} catch (error) {
  if (error.isRateLimit?.()) {
    console.log(`Retry after ${error.retryAfter}s`);
  } else if (error.isRead?.()) {
    console.log(`Failed to read: ${error.message}`);
  }
}

Instance Type Checking

import { DeepcrawlApp } from 'deepcrawl';

const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });

try {
  const result = await deepcrawl.readUrl(url);
} catch (error) {
  // Check error type using instance methods
  if (error.isAuth?.()) {
    console.log('Authentication failed');
  } else if (error.isValidation?.()) {
    console.log('Invalid request parameters');
  }
}

Error Properties

All errors include:

code: string - oRPC error code
status: number - HTTP status
message: string - User-friendly error message
data: any - Raw error data from API

Rate limit errors include:

retryAfter: number - Seconds to wait
operation: string - What operation was rate limited

Read/Links errors include:

targetUrl: string - URL that failed
success: false - Always false for errors

🔧 Configuration

import { DeepcrawlApp } from 'deepcrawl';
import type { DeepcrawlConfig } from 'deepcrawl/types';

const deepcrawl = new DeepcrawlApp({
  apiKey: process.env.DEEPCRAWL_API_KEY,
  baseUrl: "https://api.deepcrawl.dev",
  headers: {
    'User-Agent': 'MyApp/1.0'
  },
  fetch: customFetch,
  fetchOptions: {
    timeout: 30000
  }
});

Connection Pooling (Node.js)

Automatic HTTP connection pooling in Node.js:

// Automatic configuration
{
  keepAlive: true,
  maxSockets: 10,
  maxFreeSockets: 5,
  timeout: 60000,
  keepAliveMsecs: 30000
}

Benefits:

⚡ Faster for concurrent requests
🔄 Connection reuse reduces handshake overhead
🎯 Auto-cleanup of idle connections

🔒 Security Best Practices

Next.js Server Actions (Recommended)

// ✅ SECURE: lib/deepcrawl.ts
'use server';

import { DeepcrawlApp } from 'deepcrawl';

export const deepcrawlClient = new DeepcrawlApp({
  apiKey: process.env.DEEPCRAWL_API_KEY
});

// ✅ SECURE: app/actions/scrape.ts
'use server';

import { deepcrawlClient } from '@/lib/deepcrawl';

export async function scrapeAction(url: string) {
  return deepcrawlClient.readUrl(url);
}

// ✅ SECURE: Client component
'use client';

import { scrapeAction } from '@/app/actions/scrape';

export function ScrapeButton() {
  const handleClick = async () => {
    const result = await scrapeAction('https://example.com');
    console.log(result);
  };

  return <button onClick={handleClick}>Scrape</button>;
}

What NOT to Do

// ❌ INSECURE: Direct SDK usage in client components
'use client';

import { DeepcrawlApp } from 'deepcrawl';

export function BadComponent() {
  const deepcrawl = new DeepcrawlApp({
    apiKey: process.env.DEEPCRAWL_API_KEY // ❌ Exposes API key!
  });
}

🌍 Environment Support

⚠️ Server-Side Only: The Deepcrawl SDK is designed for server-side use:

✅ Node.js (18+) with connection pooling
✅ Cloudflare Workers
✅ Vercel Edge Runtime
✅ Next.js Server Actions (recommended)
✅ Deno, Bun, and other modern runtimes
❌ Browser environments (use Server Actions instead)

📄 License

MIT - see LICENSE for details.

🤝 Support

Built with ❤️ by the @felixLu