@llmcrawl/llmcrawl-js
v1.0.1
Published
Official JavaScript SDK for LLMCrawl - scrape websites, crawl multiple pages, and extract structured data using AI
Readme
LLMCrawl JavaScript SDK
The official JavaScript SDK for LLMCrawl, providing a simple and powerful way to scrape websites, crawl multiple pages, and extract structured data using AI.
Installation
npm install @llmcrawl/llmcrawl-jsNote: Version 1.0.0 introduces breaking changes. If you're upgrading from 0.x, please use named imports:
import { LLMCrawl } from '@llmcrawl/llmcrawl-js'
Quick Start
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
const client = new LLMCrawl({
apiKey: "your-api-key-here",
});
// Scrape a single page
const result = await client.scrape("https://example.com");
console.log(result.data?.markdown);Features
- 🌐 Single Page Scraping - Extract content from individual web pages
- 🕷️ Website Crawling - Crawl entire websites with customizable options
- 🗺️ Site Mapping - Get all URLs from a website
- 🤖 AI-Powered Extraction - Extract structured data using custom schemas
- 📷 Screenshot Capture - Take screenshots of web pages
- ⚙️ Flexible Configuration - Extensive customization options
API Reference
Constructor
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
const client = new LLMCrawl({
apiKey: "your-api-key",
baseUrl: "https://api.llmcrawl.dev", // Optional custom base URL
});Scraping
scrape(url, options?)
Scrape a single webpage and extract its content.
const result = await client.scrape("https://example.com", {
formats: ["markdown", "html", "links"],
headers: {
"User-Agent": "Mozilla/5.0...",
},
waitFor: 3000, // Wait 3 seconds for page to load
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
price: { type: "number" },
description: { type: "string" },
},
required: ["title", "price"],
},
},
});
if (result.success) {
console.log("Markdown:", result.data.markdown);
console.log("Extracted data:", result.data.extract);
console.log("Links:", result.data.links);
}Options:
formats: Array of formats to return ('markdown','html','rawHtml','links','screenshot','screenshot@fullPage')headers: Custom HTTP headersincludeTags: HTML tags to include in outputexcludeTags: HTML tags to exclude from outputtimeout: Request timeout in milliseconds (1000-90000)waitFor: Delay before capturing content (0-60000ms)extract: AI extraction configurationwebhookUrls: URLs to send results tometadata: Additional metadata to include
Crawling
crawl(url, options?)
Start a crawl job to scrape multiple pages from a website.
const crawlResult = await client.crawl("https://example.com", {
limit: 100,
maxDepth: 3,
includePaths: ["/blog/*", "/docs/*"],
excludePaths: ["/admin/*"],
scrapeOptions: {
formats: ["markdown"],
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
content: { type: "string" },
},
},
},
},
});
if (crawlResult.success) {
console.log("Crawl started with ID:", crawlResult.id);
}Options:
limit: Maximum number of pages to crawlmaxDepth: Maximum crawl depthincludePaths: Array of path patterns to includeexcludePaths: Array of path patterns to excludeallowBackwardLinks: Allow crawling backward linksallowExternalLinks: Allow crawling external domainsignoreSitemap: Ignore robots.txt and sitemap.xmlscrapeOptions: Scraping options for each pagewebhookUrls: URLs for webhook notificationswebhookMetadata: Additional webhook metadata
getCrawlStatus(jobId)
Check the status of a crawl job.
const status = await client.getCrawlStatus("crawl-job-id");
if (status.success) {
console.log("Status:", status.status);
console.log("Progress:", `${status.completed}/${status.total}`);
console.log("Data:", status.data); // Array of scraped pages
}cancelCrawl(jobId)
Cancel a running crawl job.
const result = await client.cancelCrawl("crawl-job-id");
if (result.success) {
console.log("Crawl cancelled:", result.message);
}Site Mapping
map(url, options?)
Get all URLs from a website without scraping their content.
const mapResult = await client.map("https://example.com", {
limit: 1000,
includeSubdomains: true,
search: "documentation",
});
if (mapResult.success) {
console.log("Found URLs:", mapResult.links);
}Options:
limit: Maximum number of links to return (1-5000)includeSubdomains: Include subdomain URLssearch: Filter links by search queryignoreSitemap: Ignore robots.txt and sitemap.xmlincludePaths: Path patterns to includeexcludePaths: Path patterns to exclude
AI-Powered Data Extraction
LLMCrawl supports advanced AI-powered data extraction using custom schemas:
const result = await client.scrape("https://example-store.com/product/123", {
formats: ["markdown"],
extract: {
mode: "llm",
schema: {
type: "object",
properties: {
productName: { type: "string" },
price: { type: "number" },
description: { type: "string" },
inStock: { type: "boolean" },
reviews: {
type: "array",
items: {
type: "object",
properties: {
rating: { type: "number" },
comment: { type: "string" },
author: { type: "string" },
},
},
},
},
required: ["productName", "price"],
},
systemPrompt: "Extract product information from this e-commerce page.",
prompt: "Focus on getting accurate pricing and availability information.",
},
});
if (result.success && result.data.extract) {
const productData = JSON.parse(result.data.extract);
console.log("Product:", productData.productName);
console.log("Price:", productData.price);
console.log("In Stock:", productData.inStock);
}Error Handling
All methods return a response object with a success field:
const result = await client.scrape("https://example.com");
if (result.success) {
// Handle successful response
console.log(result.data);
} else {
// Handle error
console.error("Error:", result.error);
console.error("Details:", result.details);
}Type Definitions
The SDK includes comprehensive TypeScript types:
import {
ScrapeResponse,
CrawlResponse,
CrawlStatusResponse,
Document,
ExtractOptions,
ScrapeOptions,
CrawlerOptions,
} from "@llmcrawl/llmcrawl-js";Examples
E-commerce Product Scraping
const product = await client.scrape("https://store.example.com/product/123", {
formats: ["markdown"],
extract: {
schema: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
rating: { type: "number" },
availability: { type: "string" },
},
},
},
});News Article Extraction
const article = await client.scrape("https://news.example.com/article/123", {
formats: ["markdown", "html"],
extract: {
schema: {
type: "object",
properties: {
headline: { type: "string" },
author: { type: "string" },
publishDate: { type: "string" },
content: { type: "string" },
tags: { type: "array", items: { type: "string" } },
},
},
},
});Documentation Crawling
// Start crawling documentation
const crawl = await client.crawl("https://docs.example.com", {
limit: 500,
includePaths: ["/docs/*", "/api/*"],
excludePaths: ["/docs/internal/*"],
scrapeOptions: {
formats: ["markdown"],
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
section: { type: "string" },
content: { type: "string" },
},
},
},
},
});
// Poll for completion
if (crawl.success) {
let status = await client.getCrawlStatus(crawl.id);
while (status.success && status.status === "scraping") {
await new Promise((resolve) => setTimeout(resolve, 5000));
status = await client.getCrawlStatus(crawl.id);
}
if (status.success && status.status === "completed") {
console.log("Crawl completed!");
console.log("Pages scraped:", status.data.length);
}
}Support
- 📧 Email: [email protected]
- 📚 Documentation: https://docs.llmcrawl.dev
- 🐛 Issues: https://github.com/LLMCrawl/llmcrawl-js/issues
License
MIT License - see LICENSE file for details.
