@webclaw/sdk
v0.3.0
Published
TypeScript SDK for the Webclaw web extraction API
Maintainers
Readme
Installation
npm install @webclaw/sdkpnpm add @webclaw/sdkyarn add @webclaw/sdkbun add @webclaw/sdkQuick Start
import { Webclaw } from "@webclaw/sdk";
const client = new Webclaw({ apiKey: "wc-YOUR_API_KEY" });
const result = await client.scrape({ url: "https://example.com", formats: ["markdown"] });
console.log(result.markdown);Endpoints
Scrape
Extract content from a single URL. Supports multiple output formats, CSS selectors for targeting specific elements, and cache control.
const result = await client.scrape({
url: "https://example.com",
formats: ["markdown", "text", "llm", "json"],
include_selectors: ["article", ".content"],
exclude_selectors: ["nav", "footer"],
only_main_content: true,
no_cache: true,
});
result.url // string
result.markdown // string | undefined
result.text // string | undefined
result.llm // string | undefined
result.json // unknown | undefined
result.metadata // { title?, description?, language?, ... }
result.cache // { status: "hit" | "miss" | "bypass" }
result.warning // string | undefinedVertical extractors
28 site-specific extractors that return typed JSON (GitHub, Reddit, Amazon, YouTube, PyPI, HuggingFace, Trustpilot, etc.) instead of generic markdown. See the catalog for the full list.
// Discover available extractors
const catalog = await client.listExtractors();
catalog.extractors.forEach((e) => console.log(e.name, "-", e.label));
// Run a specific extractor
const pr = await client.scrapeVertical(
"github_pr",
"https://github.com/rust-lang/rust/pull/123456",
);
console.log(pr.data); // { title, state, author, commits, reviews, ... }
// Amazon product as typed JSON
const product = await client.scrapeVertical(
"amazon_product",
"https://www.amazon.com/dp/B0C6KKQ7ND",
);
console.log(product.data.price, product.data.rating);The data field is extractor-specific; call listExtractors() to discover what each returns.
Search
Web search with optional parallel scraping of each result page.
const result = await client.search({
query: "web scraping tools 2026",
num_results: 10,
scrape: true,
formats: ["markdown"],
country: "us",
lang: "en",
topic: "technology",
});
for (const r of result.results) {
console.log(r.title, r.url, r.snippet);
console.log(r.markdown); // present when scrape: true
}Map
Discover URLs from a site's sitemap.
const result = await client.map({ url: "https://example.com" });
console.log(`Found ${result.count} URLs`);
result.urls.forEach((url) => console.log(url));Endpoints
Discover API endpoints embedded in a page's JavaScript — scans inline <script> bodies plus <script src> bundles for request paths, absolute URLs, GraphQL, and WebSocket endpoints. This surfaces the request layer that map (sitemap-based) can't see.
const result = await client.endpoints({
url: "https://example.com",
include_third_party: false, // default; set true to include other hosts
max_bundles: 20, // default & max; bundles fetched on top of inline JS
});
console.log(`${result.endpoint_count} endpoints across ${result.bundles_scanned} bundles`);
for (const ep of result.endpoints) {
console.log(ep.kind, ep.value, ep.first_party ? "(1st-party)" : "(3rd-party)");
}
result.hosts // distinct hosts seen, e.g. ["api.example.com"]
result.truncated // true if results were capped by max_bundlesSecurity:
endpoints,hosts, and their fields are extracted from page content (inline scripts and fetched bundles), which is attacker-influenced. The SDK does not sanitize them. Never feed a returnedvalueorsourceinto another request, shell command,eval, or SQL query without your own validation.
Batch
Scrape multiple URLs in parallel with configurable concurrency.
const result = await client.batch({
urls: ["https://a.com", "https://b.com", "https://c.com"],
formats: ["markdown"],
concurrency: 5,
});
for (const item of result.results) {
if ("error" in item) console.error(item.url, item.error);
else console.log(item.url, item.markdown?.length);
}Extract
LLM-powered structured data extraction. Provide a JSON schema for typed output, or a natural-language prompt for flexible extraction.
// Schema-based extraction
const result = await client.extract({
url: "https://example.com/pricing",
schema: {
type: "object",
properties: {
plans: { type: "array", items: { type: "object" } },
},
},
});
console.log(result.data);
// Prompt-based extraction
const result2 = await client.extract({
url: "https://example.com",
prompt: "Extract all pricing tiers with names and prices",
});
console.log(result2.data);Summarize
Generate a concise summary of a page's content.
const result = await client.summarize({
url: "https://example.com/blog/long-article",
max_sentences: 3,
});
console.log(result.summary);Diff
Detect content changes on a page. Optionally provide a previous state to diff against.
const result = await client.diff({
url: "https://example.com",
previous: { title: "Old Title", body: "Old content..." },
});
console.log(result.changes);Brand
Extract brand identity information (name, colors, fonts, logos) from a URL.
const result = await client.brand({ url: "https://example.com" });
console.log(result); // { name, colors, fonts, logos, ... }Research
Start an async deep research job. The SDK automatically polls until the job completes.
const result = await client.research(
{
query: "How do modern web crawlers handle JavaScript rendering?",
max_sources: 15,
deep: true,
},
{ interval: 3_000, maxWait: 600_000 },
);
console.log(result.report);
console.log("Sources:", result.sources?.length);
console.log("Findings:", result.findings?.length);You can also poll manually using getResearchStatus:
const job = await client.research({ query: "AI trends 2026" });
// ... or check status independently:
const status = await client.getResearchStatus(job.id);Crawl
Start an async crawl job that discovers and scrapes pages from a root URL.
const job = await client.crawl({
url: "https://example.com",
max_depth: 3,
max_pages: 100,
use_sitemap: true,
});
console.log("Job ID:", job.id);Poll with waitForCompletion, which resolves when the crawl finishes or fails:
const result = await job.waitForCompletion({
interval: 2_000, // polling interval in ms
maxWait: 300_000, // max wait time in ms (5 min)
});
console.log(`Status: ${result.status}`);
console.log(`${result.completed}/${result.total} pages`);
for (const page of result.pages) {
console.log(page.url, page.markdown?.length);
}Or check status manually at any time:
const status = await job.getStatus();
// or: const status = await client.getCrawlStatus(job.id);Watch
Monitor URLs for content changes. Create watchers, check them on demand, and receive webhook notifications when content changes.
Create a watch
const watch = await client.watchCreate({
url: "https://example.com/pricing",
name: "Pricing page",
interval_minutes: 60,
webhook_url: "https://your-server.com/webhooks/webclaw",
});
console.log("Watch ID:", watch.id);List all watches
const watches = await client.watchList(10, 0); // limit, offset
for (const w of watches) {
console.log(w.id, w.url, w.active);
}Get a single watch
const watch = await client.watchGet("watch_abc123");
console.log(watch.last_checked_at, watch.last_changed_at);Trigger an immediate check
const updated = await client.watchCheck("watch_abc123");
console.log(updated.last_checked_at);Delete a watch
await client.watchDelete("watch_abc123");Firecrawl v2 compatibility
The API also exposes a Firecrawl-compatible surface at /v2/scrape, /v2/crawl, and /v2/search. These endpoints are not yet wrapped by this SDK (future work) — call them directly if you need Firecrawl drop-in compatibility today.
Error Handling
All errors extend WebclawError, so you can catch broadly or handle specific cases.
import {
WebclawError,
AuthenticationError,
NotFoundError,
RateLimitError,
TimeoutError,
} from "@webclaw/sdk";
try {
await client.scrape({ url: "https://example.com" });
} catch (err) {
if (err instanceof RateLimitError) {
console.error("Rate limited, retry after:", err.retryAfter, "s");
} else if (err instanceof AuthenticationError) {
console.error("Bad API key");
} else if (err instanceof NotFoundError) {
console.error("Resource not found");
} else if (err instanceof TimeoutError) {
console.error("Request timed out");
} else if (err instanceof WebclawError) {
console.error("API error:", err.message, err.status, err.body);
}
}Configuration
const client = new Webclaw({
apiKey: process.env.WEBCLAW_API_KEY!,
baseUrl: "https://api.webclaw.io", // default
timeout: 60_000, // ms, default 30_000
});| Option | Type | Default | Description |
|--------|------|---------|-------------|
| apiKey | string | required | Your Webclaw API key |
| baseUrl | string | https://api.webclaw.io | API base URL |
| timeout | number | 30000 | Request timeout in milliseconds |
TypeScript
Full type definitions are included for every request and response. All types are exported from the package root:
import type {
ScrapeRequest,
ScrapeResponse,
CrawlRequest,
CrawlStatusResponse,
EndpointsRequest,
EndpointsResponse,
SearchRequest,
SearchResponse,
ExtractRequest,
ExtractResponse,
ResearchRequest,
ResearchResponse,
WatchCreateRequest,
WatchResponse,
// ... and more
} from "@webclaw/sdk";Highlights
- Zero runtime dependencies. Uses native
fetch. - ESM + CJS dual output via tsup.
- Full TypeScript types for every request and response.
- Automatic polling for async jobs (crawl, research).
- Node.js 18+.
License
MIT
