@anycrawl/js-sdk
v0.0.3
Published
A lightweight ESM JavaScript/TypeScript client for the AnyCrawl API.
Readme
@anycrawl/js-sdk
A lightweight ESM JavaScript/TypeScript client for the AnyCrawl API.
Requirements
- Node.js 18+ (Node 20 recommended – matches CI)
- ESM modules
Install
pnpm add @anycrawl/js-sdkQuickstart
import { AnyCrawlClient } from "@anycrawl/js-sdk";
// Prefer loading from env in your app framework
const client = new AnyCrawlClient(process.env.ANYCRAWL_API_KEY || "<YOUR_API_KEY>");
// Base URL defaults to https://api.anycrawl.dev, but you can override:
// const client = new AnyCrawlClient(process.env.ANYCRAWL_API_KEY!, "https://api.anycrawl.dev");
// Health
await client.healthCheck(); // -> { status: "ok" }
// Scrape
const scrape = await client.scrape({
url: "https://example.com",
engine: "cheerio",
formats: ["markdown"],
});
// Crawl (async job)
const job = await client.createCrawl({
url: "https://anycrawl.dev",
engine: "cheerio",
max_depth: 3,
strategy: "same-domain",
limit: 50,
scrape_options: { formats: ["markdown"] },
});
const status = await client.getCrawlStatus(job.job_id);
const page1 = await client.getCrawlResults(job.job_id, 0);
// Cancel if needed
// await client.cancelCrawl(job.job_id);
// Blocking helper (waits until the crawl finishes and aggregates all pages)
const aggregated = await client.crawl(
{
url: "https://anycrawl.dev",
engine: "cheerio",
max_depth: 3,
strategy: "same-domain",
limit: 50,
},
2, // poll interval (seconds)
10 * 60_000 // optional timeout (ms)
);
// aggregated.data contains all results
// Search (optionally enrich with scraping)
const results = await client.search({
query: "OpenAI ChatGPT",
engine: "google",
pages: 1,
limit: 10,
scrape_options: { engine: "cheerio", formats: ["markdown"] },
});Usage details
Client
new AnyCrawlClient(apiKey: string, baseUrl = "https://api.anycrawl.dev", onAuthFailure?: () => void)apiKey: Bearer token for API calls.baseUrl: Override if self-hosting.onAuthFailure: Invoked on 401/403. Useful to trigger sign-out or refresh.
You can also set LOG_LEVEL to control internal logs (debug, info, warn, error).
scrape(input)
import type { Engine, ScrapeFormat } from "@anycrawl/js-sdk";
const engine: Engine = "cheerio";
const formats: ScrapeFormat[] = ["markdown", "html"];
await client.scrape({
url: "https://example.com",
engine,
// Optional:
proxy: "http://user:pass@host:port",
formats,
timeout: 60_000,
retry: true,
wait_for: 3000,
include_tags: ["article", "main"],
exclude_tags: ["nav", "footer"],
json_options: { user_prompt: "Extract title", schema: { type: "object" } },
extract_source: "markdown", // or "html"
});Returns either a success object with content or a failure with error.
createCrawl(input), getCrawlStatus(jobId), getCrawlResults(jobId, skip?), cancelCrawl(jobId)
const job = await client.createCrawl({
url: "https://site.com/docs",
engine: "playwright",
max_depth: 5,
strategy: "same-domain",
limit: 100,
scrape_options: { formats: ["markdown"] },
include_paths: ["/docs/*"],
exclude_paths: ["/admin/*"],
});
const status = await client.getCrawlStatus(job.job_id);
const page = await client.getCrawlResults(job.job_id, 0);
// await client.cancelCrawl(job.job_id);crawl(input, pollIntervalSeconds = 2, timeoutMs?)
Convenience wrapper that creates a crawl, polls status until it completes, and returns aggregated results of all pages. Throws on failed/cancelled jobs or when the timeout is reached.
// Type signature
async function crawl(
input: CrawlRequest,
pollIntervalSeconds?: number, // default 2s
timeoutMs?: number // optional
): Promise<CrawlAndWaitResult>;
// Example
try {
const aggregated = await client.crawl(
{
url: "https://anycrawl.dev",
engine: "cheerio",
max_depth: 3,
strategy: "same-domain",
limit: 50,
// Scrape options are nested under scrape_options
scrape_options: { formats: ["markdown"] },
},
3, // poll every 3s
5 * 60_000 // timeout after 5 minutes
);
console.log(aggregated.total, aggregated.completed);
console.log(aggregated.data.length, "pages aggregated");
} catch (err) {
// Handles API/network/auth errors, job failed/cancelled, or timeout
console.error(err);
}- Returns:
CrawlAndWaitResultwithjob_id,status,total,completed,creditsUsed, and aggregateddata. - Polling states: waits until
completed; throws onfailedorcancelled. - Use
getCrawlStatus/getCrawlResultsif you prefer manual pagination/progress.
search(input)
await client.search({
query: "best js tutorials",
engine: "google",
limit: 20,
offset: 0,
pages: 2,
lang: "en",
country: "US",
scrape_options: { engine: "cheerio", formats: ["markdown"] },
safeSearch: 1,
});Notes:
- scrape has no
scrape_options. - crawl only accepts nested
scrape_options; top-level only allows crawl strategy fields and optionalretry. - search supports optional
scrape_options; when provided, it must include itsengine.
Error handling
All methods throw standard Error with readable messages. Examples:
- Authentication errors:
Authentication failed: <message>(401/403) – triggersonAuthFailureif provided - Payment required:
Payment required: <message>. current_credits=<n>(402) - API errors:
API Error <status>: <message> - Network issues:
Network error: Unable to reach AnyCrawl API - Other request errors:
Request error: <message>
Wrap calls in try/catch to handle errors in your app.
Notes:
- Some endpoints return HTTP 2xx with
{ success: false, error?: string }. The SDK converts these intoRequest error: <message>(e.g.,Request error: Scraping failed). - 401 and 403 are both treated as authentication failures and will invoke
onAuthFailurewhen set. - For 402, when the response includes
current_credits, the message is specialized as shown above; otherwise it falls back toAPI Error 402: <message>.
API surface
healthCheck(): Promise<{ status: string }>setAuthFailureCallback(cb: () => void): voidscrape(input: ScrapeRequest): Promise<ScrapeResult>createCrawl(input: CrawlRequest): Promise<CrawlJobResponse>getCrawlStatus(jobId: string): Promise<CrawlStatusResponse>getCrawlResults(jobId: string, skip?: number): Promise<CrawlResultsResponse>crawl(input: CrawlRequest, pollIntervalSeconds?: number, timeoutMs?: number): Promise<CrawlAndWaitResult>cancelCrawl(jobId: string): Promise<{ job_id: string; status: string }>search(input: SearchRequest): Promise<SearchResult[]>
Type definitions are exported from @anycrawl/js-sdk for TypeScript users.
