@intrface/scraper-sdk
v0.3.7
Published
TypeScript SDK for web scraping with LangExtract integration for structured data extraction
Maintainers
Readme
Scraper SDK (TypeScript)
A powerful SDK for web scraping with integrated LangExtract support for structured data extraction. Features pre-built workflows for common domains and BYO (Bring Your Own) schema capabilities.
Install
# npm
npm install @intrface/scraper-sdk @intrface/scraper-contracts
# pnpm
pnpm add @intrface/scraper-sdk @intrface/scraper-contracts
# yarn
yarn add @intrface/scraper-sdk @intrface/scraper-contractsQuick Start
Server-to-server usage (recommended). Never expose API keys to the browser.
import { ScraperSDK } from "@intrface/scraper-sdk";
const sdk = new ScraperSDK({
baseURL: process.env.SCRAPER_API_BASE_URL || "https://scrape-production-xxxx.up.railway.app/api/v1",
apiKey: process.env.SCRAPER_API_KEY, // sent as X-API-Key
// Optional: if your backend issues JWTs
// token: process.env.API_TOKEN,
});Frontend apps should call your own backend (Next.js API routes, Convex actions) which then call the scraper service using the SDK. Do not put secrets in NEXT_PUBLIC_*.
import { ScraperSDK } from "@intrface/scraper-sdk";
// Server-side usage (recommended). Do NOT expose API keys to the browser.
const sdk = new ScraperSDK({
baseURL: process.env.SCRAPER_API_BASE_URL || "http://localhost:8000/api/v1",
apiKey: process.env.SCRAPER_API_KEY, // sent as X-API-Key
});Usage Examples
FS/DS/UP (Phase 2) quick examples
Per-request features and events (superuser-only overrides)
- features: request-scoped toggle to enable a gated flow for a single job (requires superuser on backend)
- events: explicit event subscription; backend clamps to allowed based on policy
import { ScraperSDK } from "@intrface/scraper-sdk";
const sdk = new ScraperSDK({ baseURL: process.env.SCRAPER_API_BASE_URL!, apiKey: process.env.SCRAPER_API_KEY! });
// FS: discover full site map and docs
await sdk.discover({
startUrl: "https://example.com",
maxDepth: 2,
maxPages: 50,
webhookUrl: `${process.env.CONVEX_SITE_URL}/scraper/webhook/`,
extractionWorkflow: "custom", // optional SF
schemaInline: { type: "object", properties: { title: { type: "string" } } },
// Request-scoped overrides (backend superuser-only)
features: { fs: true, perItemWebhooks: true },
events: ["JOB_COMPLETED", "SITEMAP_READY", "PAGE_PROCESSED"],
});
// DS: targeted scrape of URLs (with optional SF)
await sdk.scrape({
urls: ["https://example.com/a", "https://example.com/b"],
webhookUrl: `${process.env.CONVEX_SITE_URL}/scraper/webhook/`,
extractionWorkflow: "default",
features: { ds: true },
events: ["JOB_COMPLETED", "PAGE_PROCESSED"],
});
// UP: process documents by URL list
await sdk.processDocs({
urls: ["https://example.com/report.pdf"],
webhookUrl: `${process.env.CONVEX_SITE_URL}/scraper/webhook/`,
features: { up: true },
events: ["JOB_COMPLETED", "DOC_PROCESSED"],
});1. Pre-built Workflows
Use optimized extraction workflows for common domains:
// Tourism workflow - extract travel information
const tourismJob = await sdk.createJob({
start_url: "https://travel-site.com",
extraction_workflow: "tourism",
max_depth: 2
});
// Government workflow - extract public sector data
const govJob = await sdk.createJob({
start_url: "https://city.gov",
extraction_workflow: "government",
max_depth: 1
});
// Funding workflow - extract grant opportunities
const fundingJob = await sdk.createJob({
start_url: "https://grants.org",
extraction_workflow: "funding",
max_depth: 2
});2. Custom Schema (BYO Schema)
Define your own extraction schema:
// Inline schema
const customJob = await sdk.createCustomJob({
start_url: "https://example.org",
max_depth: 2,
schema_inline: {
type: "object",
properties: {
title: { type: "string" },
price: { type: "number" },
availability: { type: "boolean" },
features: {
type: "array",
items: { type: "string" }
}
},
required: ["title", "price"]
}
});
// Schema from URL
const urlJob = await sdk.createCustomJob({
start_url: "https://example.org",
schema_url: "https://cdn.example.com/schemas/product.v1.json"
});
// Schema from registry (if configured)
const registryJob = await sdk.createCustomJob({
start_url: "https://example.org",
schema_id: "product.v1"
});3. Accessing Results
// Poll for completion
const job = await sdk.getJob(jobId);
if (job.status === "completed") {
// Access structured data and extraction spans
job.pages?.forEach(page => {
console.log("URL:", page.url);
console.log("Structured Data:", page.structured);
console.log("Extraction Spans:", page.extraction_spans);
console.log("Workflow Version:", page.workflow_version);
});
}
4. Advanced Extraction Options
Fine-tune extraction with overrides:
const advancedJob = await sdk.createJob({
start_url: "https://example.org",
extraction_workflow: "tourism",
extraction_overrides: {
model_id: "gpt-4-turbo", // Use a different model
passes: 3, // More extraction passes
prompt: "Focus on luxury hotels", // Custom prompt
examples: [ // Provide examples
{
input: "5-star Ritz Carlton...",
output: { name: "Ritz Carlton", rating: 5 }
}
]
}
});5. Raw Content (No Extraction)
// Get raw HTML/text without extraction
const rawJob = await sdk.createJob({
start_url: "https://example.org",
max_depth: 1
// No extraction_workflow specified = raw content only
});Authentication
The backend uses simple API key authentication via the X-API-Key header.
Webhook authentication (outbound, from scraper to your app)
- Signed with HMAC-SHA256. For Convex endpoints (/scraper/webhook/), headers include:
- X-Timestamp (milliseconds since epoch)
- X-Signature = HMAC_SHA256(
${ts}.${rawBody}) using your shared secret
- Always use HTTPS; verify signature and enforce a 5-minute clock skew window.
// Server-side (recommended): do not expose API keys in the browser
const sdk = new ScraperSDK({
baseURL: process.env.SCRAPER_API_BASE_URL!,
apiKey: process.env.SCRAPER_API_KEY!, // maps to X-API-Key
});
// Optional: if your backend also supports JWT, you can pass a token
const sdkWithJwt = new ScraperSDK({
baseURL: process.env.SCRAPER_API_BASE_URL!,
token: process.env.API_TOKEN!, // sent as Authorization: Bearer <token>
});Security notes:
- Never put API keys in client-side code or NEXT_PUBLIC_* env vars.
- Use server-to-server calls (Convex/Next.js API routes/Edge functions) to talk to the scraper service.
Error Handling
try {
const job = await sdk.createJob({ start_url: "..." });
} catch (error) {
if (error instanceof SDKError) {
console.error(`API Error ${error.status}: ${error.message}`);
// Handle specific status codes
if (error.status === 429) {
// Rate limited - retry later
}
}
}Environment Variables
# .env.local
NEXT_PUBLIC_API_BASE_URL=http://localhost:8000/api/v1
API_TOKEN=your-secret-token # Server-side only!TypeScript Types
The SDK is fully typed. Import types as needed:
import type {
ScraperJobsCreate,
JobsResponseV1,
JobsPageItemV1
} from "@intrface/scraper-sdk";
// Type-safe job creation
const jobConfig: ScraperJobsCreate = {
start_url: "https://example.org",
extraction_workflow: "tourism",
max_depth: 2
};Workflow Reference
Available Pre-built Workflows
| Workflow | Description | Best For |
|----------|-------------|----------|
| tourism | Travel destinations, hotels, attractions | Travel sites, booking platforms |
| government | Officials, services, policies, meetings | Government websites, civic data |
| funding | Grants, deadlines, eligibility | Grant databases, funding portals |
BYO Schema Guidelines
- Use JSON Schema Draft 7 format
- Keep schemas focused - extract only what you need
- Provide examples in extraction_overrides for better accuracy
- Host large schemas externally (schema_url) rather than inline
- Version your schemas for reproducibility
API Methods
| Method | Description |
|--------|-------------|
| createJob(config) | Create a scraping job with optional extraction |
| createCustomJob(config) | Create a job with BYO schema (sets extraction_workflow: "custom") |
| getJob(jobId) | Get job status and results |
| listJobs(params?) | List jobs with pagination |
| cancelJob(jobId) | Cancel a running job |
Note: listPages and getPageStructured are experimental and disabled by default. Enable with experimentalPagesAPI: true only if your backend exposes the pages endpoints.
Building from Source
# Clone the repository
git clone https://github.com/basicalex/scrape.git
cd scrape/packages/scraper-sdk
# Install dependencies
npm install
# Build
npm run build
# Run tests
npm testContributing
Contributions welcome! Please read our Contributing Guide in the repo root.
License
MIT (see package.json license).
