@upcrawl/sdk

v1.5.0

Published

4 days ago

Official Upcrawl SDK - Extract data from any website with a single API call

0High
0Medium
0Low

adewaskar

upcrawl web-scraping scraper crawler web-crawler data-extraction ai llm markdown search api

Upcrawl

@upcrawl/sdk

Official Node.js/Browser SDK for the Upcrawl API. Extract data from any website with a single API call.

Installation

npm install @upcrawl/sdk

Or with yarn:

yarn add @upcrawl/sdk

Quick Start

import Upcrawl from '@upcrawl/sdk';

// Set your API key (get one at https://upcrawl.dev)
Upcrawl.setApiKey('uc-your-api-key');

// Scrape a webpage
const result = await Upcrawl.scrape({
  url: 'https://example.com',
  type: 'markdown'
});

console.log(result.markdown);

Usage

Setting API Key

The API key must be set before making any requests:

import Upcrawl from '@upcrawl/sdk';

Upcrawl.setApiKey('uc-your-api-key');

Or using named imports:

import { setApiKey } from '@upcrawl/sdk';

setApiKey('uc-your-api-key');

Scraping a Single URL

import Upcrawl from '@upcrawl/sdk';

Upcrawl.setApiKey('uc-your-api-key');

const result = await Upcrawl.scrape({
  url: 'https://example.com',
  type: 'markdown',          // 'markdown' or 'html'
  onlyMainContent: true,     // Remove nav, ads, footers
  extractMetadata: true      // Get title, description, etc.
});

console.log(result.markdown);
console.log(result.metadata?.title);

Batch Scraping

Scrape multiple URLs in a single request:

const result = await Upcrawl.batchScrape({
  urls: [
    'https://example.com/page1',
    'https://example.com/page2',
    // You can also pass detailed options per URL:
    { url: 'https://example.com/page3', type: 'html' }
  ],
  type: 'markdown'
});

console.log(`Scraped ${result.successful} of ${result.total} pages`);

result.results.forEach(page => {
  if (page.success) {
    console.log(`${page.url}: ${page.markdown?.length} chars`);
  } else {
    console.log(`${page.url}: Failed - ${page.error}`);
  }
});

Web Search

Search the web and get structured results:

const result = await Upcrawl.search({
  queries: ['latest AI news 2025'],
  limit: 10,
  location: 'US'
});

result.results.forEach(queryResult => {
  console.log(`Query: ${queryResult.query}`);
  queryResult.results.forEach(item => {
    console.log(`- ${item.title}`);
    console.log(`  ${item.url}`);
  });
});

Generate PDF from HTML

Generate a PDF from HTML content:

const result = await Upcrawl.generatePdf({
  html: '<html><body><h1>Invoice</h1><p>Total: $500</p></body></html>',
  title: 'invoice-123',
  pageSize: 'A4',
  printBackground: true,
  margin: { top: '20mm', right: '20mm', bottom: '20mm', left: '20mm' }
});

console.log(result.url); // Download URL for the PDF

Generate PDF from URL

Generate a PDF from any webpage:

const result = await Upcrawl.generatePdfFromUrl({
  url: 'https://example.com/report',
  title: 'report',
  pageSize: 'Letter',
  landscape: true
});

console.log(result.url); // Download URL for the PDF

Execute Code

Run code in an isolated sandbox environment:

const result = await Upcrawl.executeCode({
  code: 'print("Hello, World!")',
  language: 'python'
});

console.log(result.stdout);          // "Hello, World!\n"
console.log(result.exitCode);        // 0
console.log(result.executionTimeMs); // 95.23
console.log(result.memoryUsageMb);   // 8.45

Each execution runs in its own isolated subprocess inside a Kata micro-VM with no network access. Code is cleaned up immediately after execution.

// Multi-line code with imports
const result = await Upcrawl.executeCode({
  code: `
import json
data = {"name": "Upcrawl", "version": 1}
print(json.dumps(data, indent=2))
  `
});

console.log(result.stdout);
// {
//   "name": "Upcrawl",
//   "version": 1
// }

Domain Filtering

Filter search results by domain:

// Only include specific domains
const result = await Upcrawl.search({
  queries: ['machine learning tutorials'],
  includeDomains: ['medium.com', 'towardsdatascience.com']
});

// Or exclude domains
const result2 = await Upcrawl.search({
  queries: ['javascript frameworks'],
  excludeDomains: ['pinterest.com', 'quora.com']
});

LLM Summarization

Ask the API to summarize scraped content:

const result = await Upcrawl.scrape({
  url: 'https://example.com/product',
  type: 'markdown',
  summary: {
    query: 'Extract the product name, price, and key features in JSON format'
  }
});

console.log(result.content); // Summarized content

LLM Tool Definitions

The SDK includes pre-built tool definitions that you can pass directly to LLMs. Available in two formats:

Vercel AI SDK — works with generateText, streamText from the ai package
OpenAI — works with chat.completions.create from the openai package

Install the peer dependency for whichever format you need:

npm install @upcrawl/sdk ai          # for Vercel AI SDK
npm install @upcrawl/sdk openai      # for OpenAI

Vercel AI SDK

import Upcrawl from '@upcrawl/sdk';
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';

Upcrawl.setApiKey('uc-your-api-key');

// Pass all tools
const { text } = await generateText({
  model: openai('gpt-4.1'),
  tools: Upcrawl.tools.aiSdk.all,
  prompt: 'Find top AI startups and analyze their pricing',
});

// Or pick specific ones
const { text: text2 } = await generateText({
  model: openai('gpt-4.1'),
  tools: {
    webSearch: Upcrawl.tools.aiSdk.webSearch,
    scrape: Upcrawl.tools.aiSdk.scrape,
  },
  prompt: 'Search for and scrape the top 3 results',
});

OpenAI

import Upcrawl from '@upcrawl/sdk';
import OpenAI from 'openai';

Upcrawl.setApiKey('uc-your-api-key');
const client = new OpenAI();

// Pass all tool definitions
const response = await client.chat.completions.create({
  model: 'gpt-4.1',
  messages: [{ role: 'user', content: 'Search for AI trends' }],
  tools: Upcrawl.tools.openai.all,
});

// Handle tool calls
for (const toolCall of response.choices[0].message.tool_calls ?? []) {
  const result = await Upcrawl.tools.openai.execute(
    toolCall.function.name,
    JSON.parse(toolCall.function.arguments),
  );
  console.log(result);
}

Available Tools

| Tool | Description | |------|-------------| | webSearch | Search the web and get results with URLs, titles, descriptions | | scrape | Scrape a URL and get clean markdown or HTML content | | executeCode | Execute Python code in an isolated sandbox |

Configuration

Custom Base URL

For self-hosted instances or testing:

Upcrawl.setBaseUrl('https://your-instance.com/v1');

Request Timeout

Set a custom timeout (in milliseconds):

Upcrawl.setTimeout(60000); // 60 seconds

Configure Multiple Options

Upcrawl.configure({
  apiKey: 'uc-your-api-key',
  baseUrl: 'https://api.upcrawl.dev/v1',
  timeout: 120000
});

Error Handling

The SDK throws UpcrawlError for API errors:

import Upcrawl, { UpcrawlError } from '@upcrawl/sdk';

try {
  const result = await Upcrawl.scrape({ url: 'https://example.com' });
} catch (error) {
  if (error instanceof UpcrawlError) {
    console.error(`Error ${error.status}: ${error.message}`);
    console.error(`Code: ${error.code}`);
  }
}

Common error codes:

| Status | Code | Description | |--------|------|-------------| | 401 | UNAUTHORIZED | Invalid or missing API key | | 403 | FORBIDDEN | Access forbidden | | 429 | RATE_LIMIT_EXCEEDED | Too many requests | | 500 | INTERNAL_ERROR | Server error |

TypeScript Support

The SDK includes full TypeScript definitions:

import Upcrawl, {
  ScrapeOptions,
  ScrapeResponse,
  SearchOptions,
  SearchResponse,
  BatchScrapeOptions,
  BatchScrapeResponse,
  GeneratePdfOptions,
  PdfResponse,
  ExecuteCodeOptions,
  ExecuteCodeResponse
} from '@upcrawl/sdk';

const options: ScrapeOptions = {
  url: 'https://example.com',
  type: 'markdown'
};

const result: ScrapeResponse = await Upcrawl.scrape(options);

API Reference

Methods

| Method | Description | |--------|-------------| | Upcrawl.setApiKey(key) | Set the API key globally | | Upcrawl.setBaseUrl(url) | Set custom base URL | | Upcrawl.setTimeout(ms) | Set request timeout | | Upcrawl.configure(config) | Configure multiple options | | Upcrawl.scrape(options) | Scrape a single URL | | Upcrawl.batchScrape(options) | Scrape multiple URLs | | Upcrawl.search(options) | Search the web | | Upcrawl.generatePdf(options) | Generate PDF from HTML | | Upcrawl.generatePdfFromUrl(options) | Generate PDF from a URL | | Upcrawl.executeCode(options) | Execute code in an isolated sandbox |

UpcrawlConfig

| Field | Type | Required | Description | |-------|------|----------|-------------| | apiKey | string | No | Your Upcrawl API key | | baseUrl | string | No | Custom API base URL | | timeout | number | No | Request timeout in milliseconds |

SummaryQuery

| Field | Type | Required | Description | |-------|------|----------|-------------| | query | string | Yes | Query/instruction for content summarization |

ScrapeOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | url | string | Yes | URL to scrape (required) | | type | "html" | "markdown" | No | Output format: html or markdown. Defaults to "html" | | onlyMainContent | boolean | No | Extract only main content (removes nav, ads, footers). Defaults to true | | extractMetadata | boolean | No | Whether to extract page metadata | | summary | object | No | Summary query for LLM summarization | | timeoutMs | number | No | Custom timeout in milliseconds (1000-120000) | | waitUntil | "load" | "domcontentloaded" | "networkidle" | No | Wait strategy for page load |

ScrapeMetadata

| Field | Type | Required | Description | |-------|------|----------|-------------| | title | string | No | | | description | string | No | | | canonicalUrl | string | No | | | finalUrl | string | No | | | contentType | string | No | | | contentLength | number | No | |

ScrapeResponse

| Field | Type | Required | Description | |-------|------|----------|-------------| | url | string | Yes | Original URL that was scraped | | html | string | null | No | Rendered HTML content (when type is html) | | markdown | string | null | No | Content converted to Markdown (when type is markdown) | | statusCode | number | null | Yes | HTTP status code | | success | boolean | Yes | Whether scraping was successful | | error | string | No | Error message if scraping failed | | timestamp | string | Yes | ISO timestamp when scraping completed | | loadTimeMs | number | Yes | Time taken to load and render the page in milliseconds | | metadata | object | No | Additional page metadata | | retryCount | number | Yes | Number of retry attempts made | | cost | number | No | Cost in USD for this scrape operation | | content | string | null | No | Content after summarization (when summary query provided) |

BatchScrapeOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | urls | string | object[] | Yes | Array of URLs to scrape (strings or detailed request objects) | | type | "html" | "markdown" | No | Output format: html or markdown | | onlyMainContent | boolean | No | Extract only main content (removes nav, ads, footers) | | summary | object | No | Summary query for LLM summarization | | batchTimeoutMs | number | No | Global timeout for entire batch operation in milliseconds (10000-600000) | | failFast | boolean | No | Whether to stop on first error |

BatchScrapeResponse

| Field | Type | Required | Description | |-------|------|----------|-------------| | results | object[] | Yes | Array of scrape results | | total | number | Yes | Total number of URLs processed | | successful | number | Yes | Number of successful scrapes | | failed | number | Yes | Number of failed scrapes | | totalTimeMs | number | Yes | Total time taken for batch operation in milliseconds | | timestamp | string | Yes | Timestamp when batch operation completed | | cost | number | No | Total cost in USD for all scrape operations |

SearchOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | queries | string[] | Yes | Array of search queries to execute (1-20) | | limit | number | No | Number of results per query (1-100). Defaults to 10 | | location | string | No | Location for search (e.g., "IN", "US") | | includeDomains | string[] | No | Domains to include (will add site: to query) | | excludeDomains | string[] | No | Domains to exclude (will add -site: to query) |

SearchResultWeb

| Field | Type | Required | Description | |-------|------|----------|-------------| | url | string | Yes | URL of the search result | | title | string | Yes | Title of the search result | | description | string | Yes | Description/snippet of the search result |

SearchResultItem

| Field | Type | Required | Description | |-------|------|----------|-------------| | query | string | Yes | The search query | | success | boolean | Yes | Whether the search was successful | | results | object[] | Yes | Parsed search result links | | error | string | No | Error message if failed | | loadTimeMs | number | No | Time taken in milliseconds | | cost | number | No | Cost in USD for this query |

SearchResponse

| Field | Type | Required | Description | |-------|------|----------|-------------| | results | object[] | Yes | Array of search results per query | | total | number | Yes | Total number of queries | | successful | number | Yes | Number of successful searches | | failed | number | Yes | Number of failed searches | | totalTimeMs | number | Yes | Total time in milliseconds | | timestamp | string | Yes | ISO timestamp | | cost | number | No | Total cost in USD |

PdfMargin

| Field | Type | Required | Description | |-------|------|----------|-------------| | top | string | No | | | right | string | No | | | bottom | string | No | | | left | string | No | |

GeneratePdfOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | html | string | Yes | Complete HTML content to convert to PDF (required) | | title | string | No | Title used for the exported filename | | pageSize | "A4" | "Letter" | "Legal" | No | Page size. Defaults to "A4" | | landscape | boolean | No | Landscape orientation. Defaults to false | | margin | object | No | Page margins (e.g., { top: "20mm", right: "20mm", bottom: "20mm", left: "20mm" }) | | printBackground | boolean | No | Print background graphics and colors. Defaults to true | | skipChartWait | boolean | No | Skip waiting for chart rendering signal. Defaults to false | | timeoutMs | number | No | Timeout in milliseconds (5000-120000). Defaults to 30000 |

GeneratePdfFromUrlOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | url | string | Yes | URL to navigate to and convert to PDF (required) | | title | string | No | Title used for the exported filename | | pageSize | "A4" | "Letter" | "Legal" | No | Page size. Defaults to "A4" | | landscape | boolean | No | Landscape orientation. Defaults to false | | margin | object | No | Page margins | | printBackground | boolean | No | Print background graphics and colors. Defaults to true | | timeoutMs | number | No | Timeout in milliseconds (5000-120000). Defaults to 30000 |

PdfResponse

| Field | Type | Required | Description | |-------|------|----------|-------------| | success | boolean | Yes | Whether PDF generation succeeded | | url | string | No | Public URL of the generated PDF | | filename | string | No | Generated filename | | blobName | string | No | Blob storage path | | error | string | No | Error message on failure | | durationMs | number | Yes | Total time taken in milliseconds |

ExecuteCodeOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | code | string | Yes | Code to execute (required) | | language | "python" | No | Language runtime. Defaults to "python" |

ExecuteCodeResponse

| Field | Type | Required | Description | |-------|------|----------|-------------| | stdout | string | Yes | Standard output from the executed code | | stderr | string | Yes | Standard error from the executed code | | exitCode | number | Yes | Process exit code (0 = success, 124 = timeout) | | executionTimeMs | number | Yes | Execution time in milliseconds | | timedOut | boolean | Yes | Whether execution was killed due to timeout | | memoryUsageMb | number | No | Peak memory usage in megabytes | | error | string | No | Error message if execution infrastructure failed | | cost | number | No | Cost in USD for this execution |

UpcrawlErrorResponse

| Field | Type | Required | Description | |-------|------|----------|-------------| | error | object | Yes | | | statusCode | number | No | |

CreateBrowserSessionOptions

| Field | Type | Required | Description | |-------|------|----------|-------------| | width | number | No | Browser viewport width (800-3840). Defaults to 1280 | | height | number | No | Browser viewport height (600-2160). Defaults to 720 | | headless | boolean | No | Run browser in headless mode. Defaults to true |

BrowserSession

| Field | Type | Required | Description | |-------|------|----------|-------------| | sessionId | string | Yes | Unique session identifier | | wsEndpoint | string | Yes | WebSocket URL for connecting with Playwright/Puppeteer | | vncUrl | string | null | Yes | VNC URL for viewing the browser (if available) | | affinityCookie | string | No | Affinity cookie for sticky session routing (format: SCRAPER_AFFINITY=xxx) - extracted from response headers | | createdAt | Date | Yes | Session creation timestamp | | width | number | Yes | Browser viewport width | | height | number | Yes | Browser viewport height |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@upcrawl/sdk

Installation

Quick Start

Usage

Setting API Key

Scraping a Single URL

Batch Scraping

Web Search

Generate PDF from HTML

Generate PDF from URL

Execute Code

Domain Filtering

LLM Summarization

LLM Tool Definitions

Vercel AI SDK

OpenAI

Available Tools

Configuration

Custom Base URL

Request Timeout

Configure Multiple Options

Error Handling

TypeScript Support

API Reference

Methods

UpcrawlConfig

SummaryQuery

ScrapeOptions

ScrapeMetadata

ScrapeResponse

BatchScrapeOptions

BatchScrapeResponse

SearchOptions

SearchResultWeb

SearchResultItem

SearchResponse

PdfMargin

GeneratePdfOptions

GeneratePdfFromUrlOptions

PdfResponse

ExecuteCodeOptions

ExecuteCodeResponse

UpcrawlErrorResponse

CreateBrowserSessionOptions

BrowserSession

License