olyptik
v0.1.18
Published
Official TypeScript SDK for Olyptik API
Readme
Olyptik - Node.js SDK
Get started with the Olyptik Node.js/TypeScript SDK for web crawling and content extraction
Installation
Install the SDK using npm:
npm install olyptikConfiguration
First, you'll need to initialize the SDK with your API key - you can get it from the settings page. You can either pass it directly or use environment variables.
import Olyptik from 'olyptik';
// Initialize with API key
const client = new Olyptik({ apiKey: 'your-api-key' });Usage
Starting a Crawl
The SDK allows you to start web crawls with various configuration options:
Minimal settings crawl:
const crawl = await client.runCrawl({
startUrl: 'https://example.com',
maxResults: 10
});Full example:
const crawl = await client.runCrawl({
startUrl: 'https://example.com',
maxResults: 100,
maxDepth: 10,
includeLinks: true,
useSitemap: false,
entireWebsite: false,
excludeNonMainTags: true,
deduplicateContent: true,
extraction: "",
timeout: 60,
engineType: "auto",
useStaticIps: false
});Get crawl
Retrieve a crawl - the response will be a crawl object
const crawl = await client.getCrawl(crawl.id);Query crawls
const result: PaginationResult<Crawl> = await olyptik.queryCrawls({
startUrls: ['https://example.com'],
status: [CrawlStatus.SUCCEEDED],
page: 0,
});
console.log("Crawls: ", result.results);
console.log("Page: ", result.page);
console.log("Total pages: ", result.totalPages);
console.log("Count of items per page: ", result.limit);
console.log("Total matched crawls: ", result.totalResults);Getting Crawl Results
Retrieve the results of your crawl using the crawl ID. The results are paginated, and you can specify the page number and limit per page.
const limit = 50;
const page = 0;
const results: PaginationResult<CrawlResult> = await client.getCrawlResults(crawl.id, page, limit);Abort a crawl
const abortedCrawl: Crawl = await client.abortCrawl(crawl.id);Get crawl logs
Retrieve logs for a specific crawl to monitor its progress and debug issues:
const page = 1;
const limit = 1200;
const logs: PaginationResult<CrawlLog> = await client.getCrawlLogs(crawl.id, page, limit);Scrape multiple URLs
Scrape up to 30 URLs at once without following links:
const scrapeResponse: ScrapeResponse = await client.scrape({
urls: ['https://example.com', 'https://example.com/about'],
includeLinks: true,
excludeNonMainTags: true,
deduplicateContent: true,
extraction: "",
timeout: 5,
engineType: "auto",
useStaticIps: false
});
for (const result of scrapeResponse.results) {
if (result.isSuccess) {
console.log(`URL: ${result.url}`);
console.log(`Title: ${result.title}`);
console.log(`Links found: ${result.links.length}`);
} else {
console.log(`Failed to scrape ${result.url}: ${result.errorMessage}`);
}
}Objects
RunCrawlPayload
You must provide at least one of the following: maxResults, useSitemap, or entireWebsite.
| Property | Type | Required | Default | Description | |--------|------|----------|---------|-------------| | startUrl | string | ✅ | - | The URL to start crawling from | | maxResults | number | ❌ | - | Maximum number of results to collect (1-5,000) | | useSitemap | boolean | ❌ | false | Whether to use sitemap.xml to crawl the website | | entireWebsite | boolean | ❌ | false | Whether to use sitemap.xml and all found links to crawl the website | | maxDepth | number | ❌ | 10 | Maximum depth of pages to crawl (1-100) | | includeLinks | boolean | ❌ | true | Whether to include links in the crawl results' markdown | | excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results | | deduplicateContent | boolean | ❌ | true | Remove duplicate content from markdown that appears on multiple pages | | extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the crawl results | | timeout | number | ❌ | 60 | Timeout duration in minutes | | engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) | | useStaticIps | boolean | ❌ | false | Whether to use static IPs for the crawl |
Crawl
| Property | Type | Description | |-------|------|-------------| | id | string | Unique crawl identifier | | status | string | Current status ("RUNNING", "SUCCEEDED", "FAILED", "TIMED_OUT", "ABORTED", "ERROR") | | startUrls | string[] | Starting URLs | | includeLinks | boolean | Whether links are included | | maxDepth | number | Maximum crawl depth | | maxResults | number | Maximum number of results | | teamId | string | Team identifier | | createdAt | string | Creation timestamp | | completedAt | string | null | Completion timestamp | | durationInSeconds | number | Total duration | | totalPages | number | Number of results found | | useSitemap | boolean | Whether sitemap was used | | entireWebsite | boolean | Whether to use both sitemap and all found links | | deduplicateContent | boolean | Remove duplicate content from markdown that appears on multiple pages | | extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the crawl results | | excludeNonMainTags | boolean | Whether non-main HTML tags were excluded | | timeout | number | The timeout of the crawl in minutes |
CrawlResult
Each crawl result includes:
| Property | Type | Description | |-------|------|-------------| | id | string | Unique identifier for the page result | | crawlId | string | Unique identifier for the crawl | | url | string | The crawled URL | | title | string | Page title extracted from the HTML | | markdown | string | Extracted content in markdown format | | depthOfUrl | number | How deep this URL was in the crawl (0 = start URL) | | isSuccess | boolean | Whether the crawl was successful | | error | string | Error message if the crawl failed | | createdAt | string | ISO timestamp when the result was created |
CrawlLog
Each crawl log includes:
| Property | Type | Description | |-------|------|-------------| | id | string | Unique identifier for the log entry | | message | string | Log message | | level | string | Log level: "info", "debug", "warn", or "error" | | description | string | Detailed description of the log entry | | crawlId | string | Unique identifier for the crawl | | teamId | string | null | Team identifier | | data | object | null | Additional data associated with the log entry | | createdAt | Date | Timestamp when the log was created |
StartScrapePayload
| Property | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | urls | string[] | ✅ | - | Array of URLs to scrape (max 30 URLs) | | includeLinks | boolean | ❌ | true | Whether to include links in the scrape results' markdown | | excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main tags from the scrape results' markdown | | deduplicateContent | boolean | ❌ | true | Whether to remove duplicate text fragments that appeared in multiple scraped pages | | extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the scrape results | | timeout | number | ❌ | 5 | Timeout duration in minutes | | engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) | | useStaticIps | boolean | ❌ | false | Whether to use static IPs for the scrape |
ScrapeResponse
The response from a scrape operation:
| Property | Type | Description | |-------|------|-------------| | id | string | Unique scrape identifier | | teamId | string | Team identifier | | projectId | string | Project identifier | | results | UrlResult[] | Array of scrape results | | timeout | number | Timeout in minutes | | origin | string | Origin of the scrape ("api" or "web") | | createdAt | Date | Creation timestamp | | updatedAt | Date | Last update timestamp |
UrlResult
Each URL scrape result includes:
| Property | Type | Description | |-------|------|-------------| | url | string | The URL that was scraped | | isSuccess | boolean | Whether the scrape was successful | | title | string | Page title | | markdown | string | Extracted content in markdown format | | links | string[] | Links found on the page | | duplicatesRemovedCount | number | Number of duplicate content blocks removed | | errorCode | number | Error code if the scrape failed | | errorMessage | string | Error message if the scrape failed |
Error Handling
The SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:
try {
const crawl = await client.runCrawl({
startUrl: 'https://example.com',
maxResults: 10
});
} catch (error) {
if (e instanceof AxiosError) {
// API returned an error response
console.error('API Error:', error.response.status, error.response.data);
}
}