@jobo-ai/scraper-sdk

v0.3.0

Published

12 days ago

TypeScript SDK for building Jobo scraper providers — scrape job listings and details from any ATS.

Downloads

1,008

0High
0Medium
0Low

jobo-ai

jobo scraper sdk job-scraping ats web-scraping

@jobo/scraper-sdk

TypeScript SDK for building Jobo scraper providers — scrape job listings and details from any ATS.

Overview

The Scraper SDK provides the types, base classes, and utilities needed to build a scraper provider that integrates with the Jobo platform. Each provider targets a specific ATS (Applicant Tracking System) and implements a standardized interface for discovering and extracting job postings.

Key Concepts

Provider — A scraper implementation targeting one ATS (e.g. Greenhouse, Lever, Workday).
Manifest — A manifest.json declaring the provider's identity, URL patterns, and configuration.
Context — Platform-injected services (HTTP client, browser client, logger) that handle proxying, fingerprinting, and rate limiting. Scrapers must not make HTTP requests outside the context.
Category — Scrapers are either "full" (all details on the listing page) or "split" (minimal listings + secondary detail fetch).
Transport — How data is retrieved: "api" (JSON/REST), "html" (static HTML), or "browser" (headless JS rendering).

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Jobo Platform                         │
│  ┌──────────┐  ┌──────────┐  ┌────────────────────┐    │
│  │  Proxy   │  │ Finger-  │  │   Rate Limiter     │    │
│  │  Pool    │  │ printing │  │   & Retry Engine    │    │
│  └────┬─────┘  └────┬─────┘  └─────────┬──────────┘    │
│       └──────────────┴──────────────────┘               │
│                      │                                   │
│              ┌───────▼───────┐                           │
│              │ ScraperContext │  ← Injected into provider │
│              │  .http        │                           │
│              │  .browser     │                           │
│              │  .log         │                           │
│              └───────┬───────┘                           │
└──────────────────────┼───────────────────────────────────┘
                       │
              ┌────────▼────────┐
              │ Your Provider   │  ← You build this
              │  validate()     │
              │  scrapeListings()│
              │  scrapeJobDetails()│
              └─────────────────┘

Installation

npm install @jobo/scraper-sdk

Modules

| Import Path | Description | | --------------------------- | ----------------------------------------------------------- | | @jobo/scraper-sdk | Core types, interfaces, errors, defineProvider() helper | | @jobo/scraper-sdk/runtime | Base classes, HTML parsing, pagination & data helpers | | @jobo/scraper-sdk/build | Compiler for bundling provider projects | | @jobo/scraper-sdk/testing | Test runner for exercising providers against live endpoints |

Quick Start

1. Project Structure

my-ats-scraper/
├── src/
│   ├── manifest.json      ← Provider metadata
│   └── provider.ts        ← Scraper implementation
├── package.json
└── tsconfig.json

2. Manifest (`src/manifest.json`)

{
  "providerId": "greenhouse",
  "displayName": "Greenhouse",
  "category": "split",
  "transport": "api",
  "urlPatterns": [
    "https://boards.greenhouse.io/**",
    "https://*.greenhouse.io/**"
  ],
  "companyIdFromUrl": "https://boards.greenhouse.io/:companyId",
  "runtimeConfig": {
    "maxConcurrentDetails": 5,
    "delayBetweenRequestsMs": 200
  }
}

3. Provider Implementation

Full-Details Scraper (API-based)

When the ATS API returns all job details in the listing response:

import {
  defineProvider,
  ScraperContext,
  ListingsResult,
  PaginationCursor,
  JobDetails,
} from "@jobo/scraper-sdk";
import { currentPage, pageCursor, buildMeta } from "@jobo/scraper-sdk/runtime";

interface ApiJob {
  id: number;
  title: string;
  content: string;
  location: { name: string };
  departments: { name: string }[];
  absolute_url: string;
  updated_at: string;
  metadata: { name: string; value: string }[];
}

interface ApiResponse {
  jobs: ApiJob[];
  meta: { total: number; per_page: number };
}

export default defineProvider({
  validate(url) {
    const match = url.match(/boards\.greenhouse\.io\/(\w+)/);
    return {
      isValid: !!match,
      companyId: match?.[1],
      baseUrl: match
        ? `https://boards-api.greenhouse.io/v1/boards/${match[1]}`
        : undefined,
    };
  },

  async scrapeListings(
    ctx: ScraperContext,
    cursor?: PaginationCursor,
  ): Promise<ListingsResult> {
    const page = currentPage(cursor);
    const res = await ctx.http.get<ApiResponse>(
      `${ctx.baseUrl}/jobs?page=${page}&per_page=50`,
    );

    const listings: JobDetails[] = res.data.jobs.map((job) => ({
      externalId: String(job.id),
      title: job.title,
      description: job.content,
      listingUrl: job.absolute_url,
      applyUrl: `${job.absolute_url}#app`,
      locations: [{ text: job.location.name }],
      postedAt: job.updated_at,
      meta: buildMeta(
        ["departments", job.departments.map((d) => d.name)],
        ...job.metadata.map((m) => [m.name, m.value] as [string, string]),
      ),
    }));

    const totalPages = Math.ceil(res.data.meta.total / res.data.meta.per_page);

    return {
      listings,
      hasMore: page < totalPages,
      nextCursor: page < totalPages ? pageCursor(page + 1) : undefined,
      totalCount: res.data.meta.total,
    };
  },
});

Split Scraper (HTML-based)

When the listing page only has links and details require a secondary fetch:

import {
  defineProvider,
  ScraperContext,
  ListingsResult,
  JobDetailsResult,
  JobListing,
} from "@jobo/scraper-sdk";
import {
  BaseSplitScraper,
  extractLinks,
  extractText,
  extractJsonLd,
  findJsonLdByType,
  resolveUrl,
  stripHtml,
  parseLocation,
  buildMeta,
} from "@jobo/scraper-sdk/runtime";

export default defineProvider({
  validate(url) {
    const match = url.match(/jobs\.example\.com\/(\w+)/);
    return {
      isValid: !!match,
      companyId: match?.[1],
      baseUrl: match ? `https://jobs.example.com/${match[1]}` : undefined,
    };
  },

  async scrapeListings(ctx: ScraperContext): Promise<ListingsResult> {
    // Fetch the listing page (static HTML, no JS needed)
    const res = await ctx.http.get<string>(`${ctx.baseUrl}/jobs`, {
      responseType: "text",
    });

    // Extract job links
    const links = extractLinks(res.data, /\/jobs\/[\w-]+$/);

    const listings: JobListing[] = links.map((link) => {
      const slug = link.split("/").pop()!;
      return {
        externalId: slug,
        applyUrl: resolveUrl(`${link}/apply`, ctx.baseUrl),
        listingUrl: resolveUrl(link, ctx.baseUrl),
      };
    });

    return { listings, hasMore: false };
  },

  async scrapeJobDetails(
    ctx: ScraperContext,
    listing: JobListing,
  ): Promise<JobDetailsResult> {
    const res = await ctx.http.get<string>(listing.listingUrl!, {
      responseType: "text",
    });

    const html = res.data;

    // Try JSON-LD first (most reliable)
    const jsonLd = extractJsonLd(html);
    const jobPosting = findJsonLdByType(jsonLd, "JobPosting");

    if (jobPosting) {
      return {
        details: {
          ...listing,
          title: jobPosting.title as string,
          description: jobPosting.description as string,
          listingUrl: listing.listingUrl!,
          locations: [
            parseLocation(
              (jobPosting.jobLocation as any)?.address?.addressLocality ?? "",
            ),
          ],
          companyName: (jobPosting.hiringOrganization as any)?.name,
          postedAt: jobPosting.datePosted as string,
          meta: buildMeta([
            "employmentType",
            jobPosting.employmentType as string,
          ]),
        },
      };
    }

    // Fallback to HTML extraction
    const title = extractText(html, "h1");
    const description = extractText(html, 'div class="job-description"');

    if (!title || !description) {
      return {
        details: null,
        error: "Could not extract job details from HTML",
      };
    }

    return {
      details: {
        ...listing,
        title,
        description,
        listingUrl: listing.listingUrl!,
        locations: [
          parseLocation(extractText(html, 'span class="location"') ?? ""),
        ],
      },
    };
  },
});

Browser-Based Scraper (JS-rendered pages)

When the ATS is a SPA that requires JavaScript execution:

import {
  defineProvider,
  ScraperContext,
  ListingsResult,
} from "@jobo/scraper-sdk";
import {
  extractText,
  extractLinks,
  resolveUrl,
} from "@jobo/scraper-sdk/runtime";

export default defineProvider({
  validate(url) {
    return { isValid: url.includes("spa-ats.example.com") };
  },

  async scrapeListings(ctx: ScraperContext): Promise<ListingsResult> {
    // Render the page with JavaScript enabled
    const page = await ctx.browser.renderPage({
      url: `${ctx.baseUrl}/careers`,
      waitUntil: "selector",
      waitForSelector: ".job-card",
      javascript: true,
    });

    // Now parse the rendered HTML
    const links = extractLinks(page.html, /\/careers\/\d+/);

    return {
      listings: links.map((link) => ({
        externalId: link.split("/").pop()!,
        applyUrl: resolveUrl(link, ctx.baseUrl),
        listingUrl: resolveUrl(link, ctx.baseUrl),
        title: extractText(page.html, ".job-card-title") ?? undefined,
      })),
      hasMore: false,
    };
  },
});

API Reference

Core Types (`@jobo/scraper-sdk`)

Data Models

| Type | Description | | ------------------ | ------------------------------------------------------------------------------------------------------- | | JobListing | A job listing from a search/listing page. Minimum: externalId + applyUrl. | | JobDetails | Full job details. Extends JobListing with required title, description, locations, listingUrl. | | JobLocation | A geographic location. Minimum: text. Optional: city, state, country, isRemote. | | MetadataMap | Record<string, MetadataValue> — arbitrary key-value metadata. | | MetadataValue | string \| number \| boolean \| null \| MetadataValue[] \| { [key: string]: MetadataValue } | | PaginationCursor | Opaque cursor for paginated scraping. Types: offset, cursor, page, url, custom. | | ListingsResult | Result of scrapeListings(): { listings, hasMore, nextCursor?, totalCount? }. | | JobDetailsResult | Result of scrapeJobDetails(): { details, isExpired?, error? }. |

Provider Interface

| Type | Description | | ------------------------- | ------------------------------------------------------------ | | ScraperProvider | The main interface every scraper must implement. | | ScraperProviderManifest | Schema for manifest.json. | | ScraperCategory | "full" or "split". | | ScraperTransport | "api", "html", or "browser". | | ValidationResult | Result of validate(): { isValid, companyId?, baseUrl? }. | | defineProvider(p) | Type-narrowing helper that returns the provider as-is. |

Platform Context

| Type | Description | | ---------------- | -------------------------------------------------------------------------------------------------------- | | ScraperContext | Injected into every scraper method. Contains http, browser, log, baseUrl, companyId, config. | | HttpClient | HTTP client: request(), get(), post(). Handles proxying and fingerprinting. | | BrowserClient | Browser client: renderPage(). Handles headless rendering with JS. | | ScraperLogger | Structured logger: debug(), info(), warn(), error(). |

Errors

| Error Class | Code | Retryable | Use Case | | --------------------- | --------------- | ---------- | ---------------------------------- | | ScraperError | (base) | — | Base class for all scraper errors. | | NetworkError | NETWORK_ERROR | ✓ | DNS, connection, TLS failures. | | HttpError | HTTP_ERROR | 429/5xx: ✓ | HTTP error responses. | | RateLimitError | RATE_LIMITED | ✓ | 429 Too Many Requests. | | ParseError | PARSE_ERROR | ✗ | HTML/JSON structure changed. | | AuthenticationError | AUTH_REQUIRED | ✗ | Login wall, CAPTCHA. | | ExpiredError | EXPIRED | ✗ | Job posting removed. |

Runtime Utilities (`@jobo/scraper-sdk/runtime`)

Base Classes

| Class | Description | | ------------------------------------------- | ----------------------------------------------------------------------------------------------------- | | BaseFullDetailsScraper | Abstract base for full-details scrapers. Implement validate() + scrapeListings(). | | BaseSplitScraper | Abstract base for split scrapers. Implement validate() + scrapeListings() + scrapeJobDetails(). | | mergeListingWithDetails(listing, details) | Merge a listing with partial detail data into a complete JobDetails. |

HTML Parsing

| Function | Description | | ----------------------------------------------- | ----------------------------------------------------------- | | stripHtml(html) | Strip all HTML tags, decode entities, normalize whitespace. | | stripHtmlAndTruncate(html, maxLength) | Strip HTML and truncate with .... | | extractText(html, selector) | Extract inner text of the first matching tag. | | extractAttribute(html, tagPattern, attribute) | Extract an attribute value from a tag. | | extractMetaByName(html, name) | Extract <meta name="..." content="...">. | | extractMetaByProperty(html, property) | Extract <meta property="..." content="..."> (Open Graph). | | extractJsonLd(html) | Parse all <script type="application/ld+json"> blocks. | | findJsonLdByType(items, type) | Find a JSON-LD object by @type. | | extractLinks(html, pattern?) | Extract all <a href> values, optionally filtered. | | resolveUrl(url, baseUrl) | Resolve a relative URL against a base. |

Pagination Helpers

| Function | Description | | -------------------------- | --------------------------------------------- | | offsetCursor(offset) | Create an offset-based cursor. | | pageCursor(page) | Create a page-number cursor. | | cursorPagination(cursor) | Create an opaque cursor. | | urlCursor(url) | Create a URL-based cursor. | | currentPage(cursor?) | Parse page number from cursor (default: 1). | | currentOffset(cursor?) | Parse offset from cursor (default: 0). |

Data Helpers

| Function | Description | | ---------------------------------- | ----------------------------------------------------------- | | parseLocation(text) | Parse a location string into a JobLocation. | | parseLocations(text, delimiter?) | Parse multiple locations from a delimited string. | | parseDate(dateStr) | Parse a date string into ISO 8601 (handles relative dates). | | buildMeta(...entries) | Build a metadata map, skipping null/undefined values. | | asString(value) | Safely extract a string from unknown data. | | asNumber(value) | Safely extract a number from unknown data. | | asStringArray(value) | Safely extract a string array from unknown data. | | deduplicateListings(listings) | Remove duplicate listings by externalId. | | extractPathSegment(url, index) | Extract a URL path segment by position. | | buildUrl(base, path, params?) | Build a URL from base + path segments + query params. |

Build Tools (`@jobo/scraper-sdk/build`)

| Export | Description | | ---------------------------------- | -------------------------------------------------------------------------------- | | compileProvider(config) | Compile a provider project into a deployable JS bundle. | | watchProvider(config, onChange?) | Watch mode: recompile on file changes. | | BuildConfig | Build configuration: projectDir, outDir?, minify?, sourcemap?. | | BuildResult | Build result: success, outputDir, files, errors, warnings, duration. |

Testing (`@jobo/scraper-sdk/testing`)

| Export | Description | | ---------------------- | ----------------------------------------------------------------------------------------------- | | runTest(config) | Run a comprehensive test against a live ATS endpoint. | | runTestSuite(config) | Run a full tests.json suite against a provider. | | TestConfig | Test config: provider, url, maxPages?, maxDetails?, verbose?. | | TestResult | Test result: success, totalListings, issues, listings, detailResults, qualityScore. | | TestSuiteResult | Suite result: success, total, passed, failed, cases[], durationMs. | | TestIssue | A data quality issue: severity, phase, message, externalId?. | | QualityScore | Metadata quality score: overall (0-100), breakdown, findings. |

Manifest Reference

| Field | Type | Required | Description | | -------------------------------------- | ------------------------------ | -------- | ---------------------------------------------------------------------------------------- | | providerId | string | ✓ | Unique identifier (e.g. "greenhouse"). | | displayName | string | ✓ | Human-readable name (e.g. "Greenhouse"). | | category | "full" \| "split" | ✓ | Whether listings contain all details or need secondary fetch. | | transport | "api" \| "html" \| "browser" | ✓ | Primary data retrieval method. | | urlPatterns | string[] | ✓ | Glob patterns identifying this ATS (supports *, **, :param). | | companyIdFromUrl | string \| string[] | | URL pattern(s) for extracting companyId (:param syntax). | | jobIdFromUrl | string \| string[] | | URL pattern(s) for extracting jobId from a job detail URL. Used by jobId test cases. | | jsProviderName | string | | Global JS variable name when bundled. | | runtimeConfig.maxConcurrentDetails | number | | Max parallel detail fetches (split only). Default: 3. | | runtimeConfig.delayBetweenRequestsMs | number | | Min delay between requests. Default: 0. | | runtimeConfig.pageSettleDelayMs | number | | Delay after page nav (browser only). Default: 0. | | runtimeConfig.maxListingsPerRun | number | | Max listings per run. 0 = unlimited. | | runtimeConfig.maxRetries | number | | Max retry attempts. Default: 3. |

Testing (`tests.json`)

Every provider should have a src/tests.json file. The test suite is run with jobo-scraper test (no --url flag). Four test types are supported:

`urlPattern` — Offline URL matching

Checks the URL against the provider's urlPatterns without making any HTTP requests.

{ "type": "urlPattern", "name": "Company board matches", "url": "https://jobs.ashbyhq.com/ramp", "expect": { "matches": true } }
{ "type": "urlPattern", "name": "Job detail URL matches", "url": "https://jobs.ashbyhq.com/ramp/f2ad6068-02e7-4986-967e-804ecef9e043", "expect": { "matches": true } }
{ "type": "urlPattern", "name": "Wrong domain does not match", "url": "https://jobs.lever.co/ramp", "expect": { "matches": false } }

`validation` — Provider URL validation

Calls provider.validate(url) and checks isValid, companyId, and baseUrl.

{ "type": "validation", "name": "Extracts companyId", "url": "https://jobs.ashbyhq.com/ramp", "expect": { "isValid": true, "companyId": "ramp", "baseUrl": "https://jobs.ashbyhq.com/ramp" } }
{ "type": "validation", "name": "Rejects non-Ashby URL", "url": "https://jobs.lever.co/ramp", "expect": { "isValid": false } }

| Assertion | Description | | ----------- | ------------------------------------------------------------- | | isValid | Required. Whether validate() should return isValid: true. | | companyId | Optional. Expected companyId extracted from the URL. | | baseUrl | Optional. Expected baseUrl returned by validate(). |

`jobId` — Job ID extraction from URL

Uses the manifest's jobIdFromUrl patterns to extract jobId and companyId from a job detail URL. No HTTP requests.

{
  "type": "jobId",
  "name": "Extracts jobId and companyId",
  "url": "https://jobs.ashbyhq.com/ramp/f2ad6068-02e7-4986-967e-804ecef9e043",
  "expect": {
    "jobId": "f2ad6068-02e7-4986-967e-804ecef9e043",
    "companyId": "ramp"
  }
}

| Assertion | Description | | ----------- | ----------------------------------------------------- | | jobId | Required. Expected job ID extracted from the URL. | | companyId | Optional. Expected company ID extracted from the URL. |

Requires jobIdFromUrl in manifest.json, e.g. "jobIdFromUrl": ["https://jobs.ashbyhq.com/:companyId/:jobId"]

`scrape` — Live scrape with assertions

Runs a full live scrape and validates listing/detail counts, required fields, and extracted metadata.

{
  "type": "scrape",
  "name": "Ramp jobs",
  "url": "https://jobs.ashbyhq.com/ramp",
  "maxPages": 1,
  "maxDetails": 3,
  "expect": {
    "minListings": 1,
    "listingHasFields": ["externalId", "applyUrl", "listingUrl"],
    "detailHasFields": ["title", "description", "listingUrl", "applyUrl"],
    "detailMetaHasKeys": ["employment_type", "department"]
  }
}

| Assertion | Description | | ------------------- | ----------------------------------------------------------------------- | | minListings | Minimum number of listings expected. | | listingHasFields | Fields that must be non-null/non-empty on every listing. | | detailHasFields | Fields that must be non-null/non-empty on every detail. | | detailMetaHasKeys | Meta keys that must be present on at least one detail's meta map. |

Important Rules

Never make HTTP requests directly. Always use ctx.http or ctx.browser. The platform handles proxying, fingerprinting, rate limiting, and retries.
External IDs must be deterministic. The same job must always produce the same externalId. This is critical for deduplication across scrape runs.
Return applyUrl on every listing. This is the minimum viable field — even split scrapers must provide it.
Use typed errors. Throw HttpError, ParseError, RateLimitError, etc. instead of generic Error. The platform uses error types to decide retry strategy.
Metadata keys should be camelCase. Common keys: employmentType, experienceLevel, department, workplaceType, compensation, benefits, tags.

Scripts

| Command | Description | | ------------------- | -------------------------------- | | npm run build | Build the SDK with tsup. | | npm run dev | Watch mode — rebuild on changes. | | npm run typecheck | Run TypeScript type checking. | | npm run clean | Remove the dist/ directory. |

License

MIT