olyptik

v0.1.18

Published

7 months ago

Official TypeScript SDK for Olyptik API

0High
0Medium
0Low

olyptik

sdk typescript awesome

Olyptik - Node.js SDK

Get started with the Olyptik Node.js/TypeScript SDK for web crawling and content extraction

Installation

Install the SDK using npm:

npm install olyptik

Configuration

First, you'll need to initialize the SDK with your API key - you can get it from the settings page. You can either pass it directly or use environment variables.

import Olyptik from 'olyptik';

// Initialize with API key
const client = new Olyptik({ apiKey: 'your-api-key' });

Usage

Starting a Crawl

The SDK allows you to start web crawls with various configuration options:

Minimal settings crawl:

const crawl = await client.runCrawl({
  startUrl: 'https://example.com',
  maxResults: 10
});

Full example:

const crawl = await client.runCrawl({
  startUrl: 'https://example.com',
  maxResults: 100,
  maxDepth: 10,
  includeLinks: true,
  useSitemap: false,
  entireWebsite: false, 
  excludeNonMainTags: true,
  deduplicateContent: true,
  extraction: "",
  timeout: 60,
  engineType: "auto",
  useStaticIps: false
});

Get crawl

Retrieve a crawl - the response will be a crawl object

const crawl = await client.getCrawl(crawl.id);

Query crawls

const result: PaginationResult<Crawl> = await olyptik.queryCrawls({
    startUrls: ['https://example.com'],
    status: [CrawlStatus.SUCCEEDED],
    page: 0,
});

console.log("Crawls: ", result.results);
console.log("Page: ", result.page);
console.log("Total pages: ", result.totalPages);
console.log("Count of items per page: ", result.limit);
console.log("Total matched crawls: ", result.totalResults);

Getting Crawl Results

Retrieve the results of your crawl using the crawl ID. The results are paginated, and you can specify the page number and limit per page.

const limit = 50;
const page = 0;
const results: PaginationResult<CrawlResult> = await client.getCrawlResults(crawl.id, page, limit);

Abort a crawl

const abortedCrawl: Crawl = await client.abortCrawl(crawl.id);

Get crawl logs

Retrieve logs for a specific crawl to monitor its progress and debug issues:

const page = 1;
const limit = 1200;
const logs: PaginationResult<CrawlLog> = await client.getCrawlLogs(crawl.id, page, limit);

Scrape multiple URLs

Scrape up to 30 URLs at once without following links:

const scrapeResponse: ScrapeResponse = await client.scrape({
  urls: ['https://example.com', 'https://example.com/about'],
  includeLinks: true,
  excludeNonMainTags: true,
  deduplicateContent: true,
  extraction: "",
  timeout: 5,
  engineType: "auto",
  useStaticIps: false
});

for (const result of scrapeResponse.results) {
  if (result.isSuccess) {
    console.log(`URL: ${result.url}`);
    console.log(`Title: ${result.title}`);
    console.log(`Links found: ${result.links.length}`);
  } else {
    console.log(`Failed to scrape ${result.url}: ${result.errorMessage}`);
  }
}

Objects

RunCrawlPayload

You must provide at least one of the following: maxResults, useSitemap, or entireWebsite.

| Property | Type | Required | Default | Description | |--------|------|----------|---------|-------------| | startUrl | string | ✅ | - | The URL to start crawling from | | maxResults | number | ❌ | - | Maximum number of results to collect (1-5,000) | | useSitemap | boolean | ❌ | false | Whether to use sitemap.xml to crawl the website | | entireWebsite | boolean | ❌ | false | Whether to use sitemap.xml and all found links to crawl the website | | maxDepth | number | ❌ | 10 | Maximum depth of pages to crawl (1-100) | | includeLinks | boolean | ❌ | true | Whether to include links in the crawl results' markdown | | excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results | | deduplicateContent | boolean | ❌ | true | Remove duplicate content from markdown that appears on multiple pages | | extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the crawl results | | timeout | number | ❌ | 60 | Timeout duration in minutes | | engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) | | useStaticIps | boolean | ❌ | false | Whether to use static IPs for the crawl |

Crawl

| Property | Type | Description | |-------|------|-------------| | id | string | Unique crawl identifier | | status | string | Current status ("RUNNING", "SUCCEEDED", "FAILED", "TIMED_OUT", "ABORTED", "ERROR") | | startUrls | string[] | Starting URLs | | includeLinks | boolean | Whether links are included | | maxDepth | number | Maximum crawl depth | | maxResults | number | Maximum number of results | | teamId | string | Team identifier | | createdAt | string | Creation timestamp | | completedAt | string | null | Completion timestamp | | durationInSeconds | number | Total duration | | totalPages | number | Number of results found | | useSitemap | boolean | Whether sitemap was used | | entireWebsite | boolean | Whether to use both sitemap and all found links | | deduplicateContent | boolean | Remove duplicate content from markdown that appears on multiple pages | | extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the crawl results | | excludeNonMainTags | boolean | Whether non-main HTML tags were excluded | | timeout | number | The timeout of the crawl in minutes |

CrawlResult

Each crawl result includes:

| Property | Type | Description | |-------|------|-------------| | id | string | Unique identifier for the page result | | crawlId | string | Unique identifier for the crawl | | url | string | The crawled URL | | title | string | Page title extracted from the HTML | | markdown | string | Extracted content in markdown format | | depthOfUrl | number | How deep this URL was in the crawl (0 = start URL) | | isSuccess | boolean | Whether the crawl was successful | | error | string | Error message if the crawl failed | | createdAt | string | ISO timestamp when the result was created |

CrawlLog

Each crawl log includes:

| Property | Type | Description | |-------|------|-------------| | id | string | Unique identifier for the log entry | | message | string | Log message | | level | string | Log level: "info", "debug", "warn", or "error" | | description | string | Detailed description of the log entry | | crawlId | string | Unique identifier for the crawl | | teamId | string | null | Team identifier | | data | object | null | Additional data associated with the log entry | | createdAt | Date | Timestamp when the log was created |

StartScrapePayload

| Property | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | urls | string[] | ✅ | - | Array of URLs to scrape (max 30 URLs) | | includeLinks | boolean | ❌ | true | Whether to include links in the scrape results' markdown | | excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main tags from the scrape results' markdown | | deduplicateContent | boolean | ❌ | true | Whether to remove duplicate text fragments that appeared in multiple scraped pages | | extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the scrape results | | timeout | number | ❌ | 5 | Timeout duration in minutes | | engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) | | useStaticIps | boolean | ❌ | false | Whether to use static IPs for the scrape |

ScrapeResponse

The response from a scrape operation:

| Property | Type | Description | |-------|------|-------------| | id | string | Unique scrape identifier | | teamId | string | Team identifier | | projectId | string | Project identifier | | results | UrlResult[] | Array of scrape results | | timeout | number | Timeout in minutes | | origin | string | Origin of the scrape ("api" or "web") | | createdAt | Date | Creation timestamp | | updatedAt | Date | Last update timestamp |

UrlResult

Each URL scrape result includes:

| Property | Type | Description | |-------|------|-------------| | url | string | The URL that was scraped | | isSuccess | boolean | Whether the scrape was successful | | title | string | Page title | | markdown | string | Extracted content in markdown format | | links | string[] | Links found on the page | | duplicatesRemovedCount | number | Number of duplicate content blocks removed | | errorCode | number | Error code if the scrape failed | | errorMessage | string | Error message if the scrape failed |

Error Handling

The SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:

try {
  const crawl = await client.runCrawl({
    startUrl: 'https://example.com',
    maxResults: 10
  });
} catch (error) {
  if (e instanceof AxiosError) {
    // API returned an error response
    console.error('API Error:', error.response.status, error.response.data);
  }
}