@intrface/scraper-sdk

v0.3.7

Published

3 months ago

TypeScript SDK for web scraping with LangExtract integration for structured data extraction

0High
0Medium
0Low

ceii

sdk scraper parsing langextract fastapi api typescript

Scraper SDK (TypeScript)

A powerful SDK for web scraping with integrated LangExtract support for structured data extraction. Features pre-built workflows for common domains and BYO (Bring Your Own) schema capabilities.

Install

# npm
npm install @intrface/scraper-sdk @intrface/scraper-contracts

# pnpm
pnpm add @intrface/scraper-sdk @intrface/scraper-contracts

# yarn
yarn add @intrface/scraper-sdk @intrface/scraper-contracts

Quick Start

Server-to-server usage (recommended). Never expose API keys to the browser.

import { ScraperSDK } from "@intrface/scraper-sdk";

const sdk = new ScraperSDK({
  baseURL: process.env.SCRAPER_API_BASE_URL || "https://scrape-production-xxxx.up.railway.app/api/v1",
  apiKey: process.env.SCRAPER_API_KEY, // sent as X-API-Key
  // Optional: if your backend issues JWTs
  // token: process.env.API_TOKEN,
});

Frontend apps should call your own backend (Next.js API routes, Convex actions) which then call the scraper service using the SDK. Do not put secrets in NEXT_PUBLIC_*.

import { ScraperSDK } from "@intrface/scraper-sdk";

// Server-side usage (recommended). Do NOT expose API keys to the browser.
const sdk = new ScraperSDK({
  baseURL: process.env.SCRAPER_API_BASE_URL || "http://localhost:8000/api/v1",
  apiKey: process.env.SCRAPER_API_KEY, // sent as X-API-Key
});

Usage Examples

FS/DS/UP (Phase 2) quick examples

Per-request features and events (superuser-only overrides)

features: request-scoped toggle to enable a gated flow for a single job (requires superuser on backend)
events: explicit event subscription; backend clamps to allowed based on policy

import { ScraperSDK } from "@intrface/scraper-sdk";
const sdk = new ScraperSDK({ baseURL: process.env.SCRAPER_API_BASE_URL!, apiKey: process.env.SCRAPER_API_KEY! });

// FS: discover full site map and docs
await sdk.discover({
  startUrl: "https://example.com",
  maxDepth: 2,
  maxPages: 50,
  webhookUrl: `${process.env.CONVEX_SITE_URL}/scraper/webhook/`,
  extractionWorkflow: "custom", // optional SF
  schemaInline: { type: "object", properties: { title: { type: "string" } } },
  // Request-scoped overrides (backend superuser-only)
  features: { fs: true, perItemWebhooks: true },
  events: ["JOB_COMPLETED", "SITEMAP_READY", "PAGE_PROCESSED"],
});

// DS: targeted scrape of URLs (with optional SF)
await sdk.scrape({
  urls: ["https://example.com/a", "https://example.com/b"],
  webhookUrl: `${process.env.CONVEX_SITE_URL}/scraper/webhook/`,
  extractionWorkflow: "default",
  features: { ds: true },
  events: ["JOB_COMPLETED", "PAGE_PROCESSED"],
});

// UP: process documents by URL list
await sdk.processDocs({
  urls: ["https://example.com/report.pdf"],
  webhookUrl: `${process.env.CONVEX_SITE_URL}/scraper/webhook/`,
  features: { up: true },
  events: ["JOB_COMPLETED", "DOC_PROCESSED"],
});

1. Pre-built Workflows

Use optimized extraction workflows for common domains:

// Tourism workflow - extract travel information
const tourismJob = await sdk.createJob({
  start_url: "https://travel-site.com",
  extraction_workflow: "tourism",
  max_depth: 2
});

// Government workflow - extract public sector data
const govJob = await sdk.createJob({
  start_url: "https://city.gov",
  extraction_workflow: "government",
  max_depth: 1
});

// Funding workflow - extract grant opportunities
const fundingJob = await sdk.createJob({
  start_url: "https://grants.org",
  extraction_workflow: "funding",
  max_depth: 2
});

2. Custom Schema (BYO Schema)

Define your own extraction schema:

// Inline schema
const customJob = await sdk.createCustomJob({
  start_url: "https://example.org",
  max_depth: 2,
  schema_inline: {
    type: "object",
    properties: {
      title: { type: "string" },
      price: { type: "number" },
      availability: { type: "boolean" },
      features: {
        type: "array",
        items: { type: "string" }
      }
    },
    required: ["title", "price"]
  }
});

// Schema from URL
const urlJob = await sdk.createCustomJob({
  start_url: "https://example.org",
  schema_url: "https://cdn.example.com/schemas/product.v1.json"
});

// Schema from registry (if configured)
const registryJob = await sdk.createCustomJob({
  start_url: "https://example.org",
  schema_id: "product.v1"
});

3. Accessing Results

// Poll for completion
const job = await sdk.getJob(jobId);

if (job.status === "completed") {
  // Access structured data and extraction spans
  job.pages?.forEach(page => {
    console.log("URL:", page.url);
    console.log("Structured Data:", page.structured);
    console.log("Extraction Spans:", page.extraction_spans);
    console.log("Workflow Version:", page.workflow_version);
  });
}

4. Advanced Extraction Options

Fine-tune extraction with overrides:

const advancedJob = await sdk.createJob({
  start_url: "https://example.org",
  extraction_workflow: "tourism",
  extraction_overrides: {
    model_id: "gpt-4-turbo",           // Use a different model
    passes: 3,                          // More extraction passes
    prompt: "Focus on luxury hotels",  // Custom prompt
    examples: [                         // Provide examples
      {
        input: "5-star Ritz Carlton...",
        output: { name: "Ritz Carlton", rating: 5 }
      }
    ]
  }
});

5. Raw Content (No Extraction)

// Get raw HTML/text without extraction
const rawJob = await sdk.createJob({
  start_url: "https://example.org",
  max_depth: 1
  // No extraction_workflow specified = raw content only
});

Authentication

The backend uses simple API key authentication via the X-API-Key header.

Webhook authentication (outbound, from scraper to your app)

Signed with HMAC-SHA256. For Convex endpoints (/scraper/webhook/), headers include:
- X-Timestamp (milliseconds since epoch)
- X-Signature = HMAC_SHA256(${ts}.${rawBody}) using your shared secret
Always use HTTPS; verify signature and enforce a 5-minute clock skew window.

// Server-side (recommended): do not expose API keys in the browser
const sdk = new ScraperSDK({
  baseURL: process.env.SCRAPER_API_BASE_URL!,
  apiKey: process.env.SCRAPER_API_KEY!, // maps to X-API-Key
});

// Optional: if your backend also supports JWT, you can pass a token
const sdkWithJwt = new ScraperSDK({
  baseURL: process.env.SCRAPER_API_BASE_URL!,
  token: process.env.API_TOKEN!, // sent as Authorization: Bearer <token>
});

Security notes:

Never put API keys in client-side code or NEXT_PUBLIC_* env vars.
Use server-to-server calls (Convex/Next.js API routes/Edge functions) to talk to the scraper service.

Error Handling

try {
  const job = await sdk.createJob({ start_url: "..." });
} catch (error) {
  if (error instanceof SDKError) {
    console.error(`API Error ${error.status}: ${error.message}`);
    // Handle specific status codes
    if (error.status === 429) {
      // Rate limited - retry later
    }
  }
}

Environment Variables

# .env.local
NEXT_PUBLIC_API_BASE_URL=http://localhost:8000/api/v1
API_TOKEN=your-secret-token  # Server-side only!

TypeScript Types

The SDK is fully typed. Import types as needed:

import type { 
  ScraperJobsCreate,
  JobsResponseV1,
  JobsPageItemV1 
} from "@intrface/scraper-sdk";

// Type-safe job creation
const jobConfig: ScraperJobsCreate = {
  start_url: "https://example.org",
  extraction_workflow: "tourism",
  max_depth: 2
};

Workflow Reference

Available Pre-built Workflows

| Workflow | Description | Best For | |----------|-------------|----------| | tourism | Travel destinations, hotels, attractions | Travel sites, booking platforms | | government | Officials, services, policies, meetings | Government websites, civic data | | funding | Grants, deadlines, eligibility | Grant databases, funding portals |

BYO Schema Guidelines

Use JSON Schema Draft 7 format
Keep schemas focused - extract only what you need
Provide examples in extraction_overrides for better accuracy
Host large schemas externally (schema_url) rather than inline
Version your schemas for reproducibility

API Methods

| Method | Description | |--------|-------------| | createJob(config) | Create a scraping job with optional extraction | | createCustomJob(config) | Create a job with BYO schema (sets extraction_workflow: "custom") | | getJob(jobId) | Get job status and results | | listJobs(params?) | List jobs with pagination | | cancelJob(jobId) | Cancel a running job |

Note: listPages and getPageStructured are experimental and disabled by default. Enable with experimentalPagesAPI: true only if your backend exposes the pages endpoints.

Building from Source

# Clone the repository
git clone https://github.com/basicalex/scrape.git
cd scrape/packages/scraper-sdk

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

Contributing

Contributions welcome! Please read our Contributing Guide in the repo root.

License

MIT (see package.json license).