npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@jobo-ai/scraper-sdk

v0.3.0

Published

TypeScript SDK for building Jobo scraper providers — scrape job listings and details from any ATS.

Downloads

1,008

Readme

@jobo/scraper-sdk

TypeScript SDK for building Jobo scraper providers — scrape job listings and details from any ATS.

Overview

The Scraper SDK provides the types, base classes, and utilities needed to build a scraper provider that integrates with the Jobo platform. Each provider targets a specific ATS (Applicant Tracking System) and implements a standardized interface for discovering and extracting job postings.

Key Concepts

  • Provider — A scraper implementation targeting one ATS (e.g. Greenhouse, Lever, Workday).
  • Manifest — A manifest.json declaring the provider's identity, URL patterns, and configuration.
  • Context — Platform-injected services (HTTP client, browser client, logger) that handle proxying, fingerprinting, and rate limiting. Scrapers must not make HTTP requests outside the context.
  • Category — Scrapers are either "full" (all details on the listing page) or "split" (minimal listings + secondary detail fetch).
  • Transport — How data is retrieved: "api" (JSON/REST), "html" (static HTML), or "browser" (headless JS rendering).

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Jobo Platform                         │
│  ┌──────────┐  ┌──────────┐  ┌────────────────────┐    │
│  │  Proxy   │  │ Finger-  │  │   Rate Limiter     │    │
│  │  Pool    │  │ printing │  │   & Retry Engine    │    │
│  └────┬─────┘  └────┬─────┘  └─────────┬──────────┘    │
│       └──────────────┴──────────────────┘               │
│                      │                                   │
│              ┌───────▼───────┐                           │
│              │ ScraperContext │  ← Injected into provider │
│              │  .http        │                           │
│              │  .browser     │                           │
│              │  .log         │                           │
│              └───────┬───────┘                           │
└──────────────────────┼───────────────────────────────────┘
                       │
              ┌────────▼────────┐
              │ Your Provider   │  ← You build this
              │  validate()     │
              │  scrapeListings()│
              │  scrapeJobDetails()│
              └─────────────────┘

Installation

npm install @jobo/scraper-sdk

Modules

| Import Path | Description | | --------------------------- | ----------------------------------------------------------- | | @jobo/scraper-sdk | Core types, interfaces, errors, defineProvider() helper | | @jobo/scraper-sdk/runtime | Base classes, HTML parsing, pagination & data helpers | | @jobo/scraper-sdk/build | Compiler for bundling provider projects | | @jobo/scraper-sdk/testing | Test runner for exercising providers against live endpoints |

Quick Start

1. Project Structure

my-ats-scraper/
├── src/
│   ├── manifest.json      ← Provider metadata
│   └── provider.ts        ← Scraper implementation
├── package.json
└── tsconfig.json

2. Manifest (src/manifest.json)

{
  "providerId": "greenhouse",
  "displayName": "Greenhouse",
  "category": "split",
  "transport": "api",
  "urlPatterns": [
    "https://boards.greenhouse.io/**",
    "https://*.greenhouse.io/**"
  ],
  "companyIdFromUrl": "https://boards.greenhouse.io/:companyId",
  "runtimeConfig": {
    "maxConcurrentDetails": 5,
    "delayBetweenRequestsMs": 200
  }
}

3. Provider Implementation

Full-Details Scraper (API-based)

When the ATS API returns all job details in the listing response:

import {
  defineProvider,
  ScraperContext,
  ListingsResult,
  PaginationCursor,
  JobDetails,
} from "@jobo/scraper-sdk";
import { currentPage, pageCursor, buildMeta } from "@jobo/scraper-sdk/runtime";

interface ApiJob {
  id: number;
  title: string;
  content: string;
  location: { name: string };
  departments: { name: string }[];
  absolute_url: string;
  updated_at: string;
  metadata: { name: string; value: string }[];
}

interface ApiResponse {
  jobs: ApiJob[];
  meta: { total: number; per_page: number };
}

export default defineProvider({
  validate(url) {
    const match = url.match(/boards\.greenhouse\.io\/(\w+)/);
    return {
      isValid: !!match,
      companyId: match?.[1],
      baseUrl: match
        ? `https://boards-api.greenhouse.io/v1/boards/${match[1]}`
        : undefined,
    };
  },

  async scrapeListings(
    ctx: ScraperContext,
    cursor?: PaginationCursor,
  ): Promise<ListingsResult> {
    const page = currentPage(cursor);
    const res = await ctx.http.get<ApiResponse>(
      `${ctx.baseUrl}/jobs?page=${page}&per_page=50`,
    );

    const listings: JobDetails[] = res.data.jobs.map((job) => ({
      externalId: String(job.id),
      title: job.title,
      description: job.content,
      listingUrl: job.absolute_url,
      applyUrl: `${job.absolute_url}#app`,
      locations: [{ text: job.location.name }],
      postedAt: job.updated_at,
      meta: buildMeta(
        ["departments", job.departments.map((d) => d.name)],
        ...job.metadata.map((m) => [m.name, m.value] as [string, string]),
      ),
    }));

    const totalPages = Math.ceil(res.data.meta.total / res.data.meta.per_page);

    return {
      listings,
      hasMore: page < totalPages,
      nextCursor: page < totalPages ? pageCursor(page + 1) : undefined,
      totalCount: res.data.meta.total,
    };
  },
});

Split Scraper (HTML-based)

When the listing page only has links and details require a secondary fetch:

import {
  defineProvider,
  ScraperContext,
  ListingsResult,
  JobDetailsResult,
  JobListing,
} from "@jobo/scraper-sdk";
import {
  BaseSplitScraper,
  extractLinks,
  extractText,
  extractJsonLd,
  findJsonLdByType,
  resolveUrl,
  stripHtml,
  parseLocation,
  buildMeta,
} from "@jobo/scraper-sdk/runtime";

export default defineProvider({
  validate(url) {
    const match = url.match(/jobs\.example\.com\/(\w+)/);
    return {
      isValid: !!match,
      companyId: match?.[1],
      baseUrl: match ? `https://jobs.example.com/${match[1]}` : undefined,
    };
  },

  async scrapeListings(ctx: ScraperContext): Promise<ListingsResult> {
    // Fetch the listing page (static HTML, no JS needed)
    const res = await ctx.http.get<string>(`${ctx.baseUrl}/jobs`, {
      responseType: "text",
    });

    // Extract job links
    const links = extractLinks(res.data, /\/jobs\/[\w-]+$/);

    const listings: JobListing[] = links.map((link) => {
      const slug = link.split("/").pop()!;
      return {
        externalId: slug,
        applyUrl: resolveUrl(`${link}/apply`, ctx.baseUrl),
        listingUrl: resolveUrl(link, ctx.baseUrl),
      };
    });

    return { listings, hasMore: false };
  },

  async scrapeJobDetails(
    ctx: ScraperContext,
    listing: JobListing,
  ): Promise<JobDetailsResult> {
    const res = await ctx.http.get<string>(listing.listingUrl!, {
      responseType: "text",
    });

    const html = res.data;

    // Try JSON-LD first (most reliable)
    const jsonLd = extractJsonLd(html);
    const jobPosting = findJsonLdByType(jsonLd, "JobPosting");

    if (jobPosting) {
      return {
        details: {
          ...listing,
          title: jobPosting.title as string,
          description: jobPosting.description as string,
          listingUrl: listing.listingUrl!,
          locations: [
            parseLocation(
              (jobPosting.jobLocation as any)?.address?.addressLocality ?? "",
            ),
          ],
          companyName: (jobPosting.hiringOrganization as any)?.name,
          postedAt: jobPosting.datePosted as string,
          meta: buildMeta([
            "employmentType",
            jobPosting.employmentType as string,
          ]),
        },
      };
    }

    // Fallback to HTML extraction
    const title = extractText(html, "h1");
    const description = extractText(html, 'div class="job-description"');

    if (!title || !description) {
      return {
        details: null,
        error: "Could not extract job details from HTML",
      };
    }

    return {
      details: {
        ...listing,
        title,
        description,
        listingUrl: listing.listingUrl!,
        locations: [
          parseLocation(extractText(html, 'span class="location"') ?? ""),
        ],
      },
    };
  },
});

Browser-Based Scraper (JS-rendered pages)

When the ATS is a SPA that requires JavaScript execution:

import {
  defineProvider,
  ScraperContext,
  ListingsResult,
} from "@jobo/scraper-sdk";
import {
  extractText,
  extractLinks,
  resolveUrl,
} from "@jobo/scraper-sdk/runtime";

export default defineProvider({
  validate(url) {
    return { isValid: url.includes("spa-ats.example.com") };
  },

  async scrapeListings(ctx: ScraperContext): Promise<ListingsResult> {
    // Render the page with JavaScript enabled
    const page = await ctx.browser.renderPage({
      url: `${ctx.baseUrl}/careers`,
      waitUntil: "selector",
      waitForSelector: ".job-card",
      javascript: true,
    });

    // Now parse the rendered HTML
    const links = extractLinks(page.html, /\/careers\/\d+/);

    return {
      listings: links.map((link) => ({
        externalId: link.split("/").pop()!,
        applyUrl: resolveUrl(link, ctx.baseUrl),
        listingUrl: resolveUrl(link, ctx.baseUrl),
        title: extractText(page.html, ".job-card-title") ?? undefined,
      })),
      hasMore: false,
    };
  },
});

API Reference

Core Types (@jobo/scraper-sdk)

Data Models

| Type | Description | | ------------------ | ------------------------------------------------------------------------------------------------------- | | JobListing | A job listing from a search/listing page. Minimum: externalId + applyUrl. | | JobDetails | Full job details. Extends JobListing with required title, description, locations, listingUrl. | | JobLocation | A geographic location. Minimum: text. Optional: city, state, country, isRemote. | | MetadataMap | Record<string, MetadataValue> — arbitrary key-value metadata. | | MetadataValue | string \| number \| boolean \| null \| MetadataValue[] \| { [key: string]: MetadataValue } | | PaginationCursor | Opaque cursor for paginated scraping. Types: offset, cursor, page, url, custom. | | ListingsResult | Result of scrapeListings(): { listings, hasMore, nextCursor?, totalCount? }. | | JobDetailsResult | Result of scrapeJobDetails(): { details, isExpired?, error? }. |

Provider Interface

| Type | Description | | ------------------------- | ------------------------------------------------------------ | | ScraperProvider | The main interface every scraper must implement. | | ScraperProviderManifest | Schema for manifest.json. | | ScraperCategory | "full" or "split". | | ScraperTransport | "api", "html", or "browser". | | ValidationResult | Result of validate(): { isValid, companyId?, baseUrl? }. | | defineProvider(p) | Type-narrowing helper that returns the provider as-is. |

Platform Context

| Type | Description | | ---------------- | -------------------------------------------------------------------------------------------------------- | | ScraperContext | Injected into every scraper method. Contains http, browser, log, baseUrl, companyId, config. | | HttpClient | HTTP client: request(), get(), post(). Handles proxying and fingerprinting. | | BrowserClient | Browser client: renderPage(). Handles headless rendering with JS. | | ScraperLogger | Structured logger: debug(), info(), warn(), error(). |

Errors

| Error Class | Code | Retryable | Use Case | | --------------------- | --------------- | ---------- | ---------------------------------- | | ScraperError | (base) | — | Base class for all scraper errors. | | NetworkError | NETWORK_ERROR | ✓ | DNS, connection, TLS failures. | | HttpError | HTTP_ERROR | 429/5xx: ✓ | HTTP error responses. | | RateLimitError | RATE_LIMITED | ✓ | 429 Too Many Requests. | | ParseError | PARSE_ERROR | ✗ | HTML/JSON structure changed. | | AuthenticationError | AUTH_REQUIRED | ✗ | Login wall, CAPTCHA. | | ExpiredError | EXPIRED | ✗ | Job posting removed. |

Runtime Utilities (@jobo/scraper-sdk/runtime)

Base Classes

| Class | Description | | ------------------------------------------- | ----------------------------------------------------------------------------------------------------- | | BaseFullDetailsScraper | Abstract base for full-details scrapers. Implement validate() + scrapeListings(). | | BaseSplitScraper | Abstract base for split scrapers. Implement validate() + scrapeListings() + scrapeJobDetails(). | | mergeListingWithDetails(listing, details) | Merge a listing with partial detail data into a complete JobDetails. |

HTML Parsing

| Function | Description | | ----------------------------------------------- | ----------------------------------------------------------- | | stripHtml(html) | Strip all HTML tags, decode entities, normalize whitespace. | | stripHtmlAndTruncate(html, maxLength) | Strip HTML and truncate with .... | | extractText(html, selector) | Extract inner text of the first matching tag. | | extractAttribute(html, tagPattern, attribute) | Extract an attribute value from a tag. | | extractMetaByName(html, name) | Extract <meta name="..." content="...">. | | extractMetaByProperty(html, property) | Extract <meta property="..." content="..."> (Open Graph). | | extractJsonLd(html) | Parse all <script type="application/ld+json"> blocks. | | findJsonLdByType(items, type) | Find a JSON-LD object by @type. | | extractLinks(html, pattern?) | Extract all <a href> values, optionally filtered. | | resolveUrl(url, baseUrl) | Resolve a relative URL against a base. |

Pagination Helpers

| Function | Description | | -------------------------- | --------------------------------------------- | | offsetCursor(offset) | Create an offset-based cursor. | | pageCursor(page) | Create a page-number cursor. | | cursorPagination(cursor) | Create an opaque cursor. | | urlCursor(url) | Create a URL-based cursor. | | currentPage(cursor?) | Parse page number from cursor (default: 1). | | currentOffset(cursor?) | Parse offset from cursor (default: 0). |

Data Helpers

| Function | Description | | ---------------------------------- | ----------------------------------------------------------- | | parseLocation(text) | Parse a location string into a JobLocation. | | parseLocations(text, delimiter?) | Parse multiple locations from a delimited string. | | parseDate(dateStr) | Parse a date string into ISO 8601 (handles relative dates). | | buildMeta(...entries) | Build a metadata map, skipping null/undefined values. | | asString(value) | Safely extract a string from unknown data. | | asNumber(value) | Safely extract a number from unknown data. | | asStringArray(value) | Safely extract a string array from unknown data. | | deduplicateListings(listings) | Remove duplicate listings by externalId. | | extractPathSegment(url, index) | Extract a URL path segment by position. | | buildUrl(base, path, params?) | Build a URL from base + path segments + query params. |

Build Tools (@jobo/scraper-sdk/build)

| Export | Description | | ---------------------------------- | -------------------------------------------------------------------------------- | | compileProvider(config) | Compile a provider project into a deployable JS bundle. | | watchProvider(config, onChange?) | Watch mode: recompile on file changes. | | BuildConfig | Build configuration: projectDir, outDir?, minify?, sourcemap?. | | BuildResult | Build result: success, outputDir, files, errors, warnings, duration. |

Testing (@jobo/scraper-sdk/testing)

| Export | Description | | ---------------------- | ----------------------------------------------------------------------------------------------- | | runTest(config) | Run a comprehensive test against a live ATS endpoint. | | runTestSuite(config) | Run a full tests.json suite against a provider. | | TestConfig | Test config: provider, url, maxPages?, maxDetails?, verbose?. | | TestResult | Test result: success, totalListings, issues, listings, detailResults, qualityScore. | | TestSuiteResult | Suite result: success, total, passed, failed, cases[], durationMs. | | TestIssue | A data quality issue: severity, phase, message, externalId?. | | QualityScore | Metadata quality score: overall (0-100), breakdown, findings. |

Manifest Reference

| Field | Type | Required | Description | | -------------------------------------- | ------------------------------ | -------- | ---------------------------------------------------------------------------------------- | | providerId | string | ✓ | Unique identifier (e.g. "greenhouse"). | | displayName | string | ✓ | Human-readable name (e.g. "Greenhouse"). | | category | "full" \| "split" | ✓ | Whether listings contain all details or need secondary fetch. | | transport | "api" \| "html" \| "browser" | ✓ | Primary data retrieval method. | | urlPatterns | string[] | ✓ | Glob patterns identifying this ATS (supports *, **, :param). | | companyIdFromUrl | string \| string[] | | URL pattern(s) for extracting companyId (:param syntax). | | jobIdFromUrl | string \| string[] | | URL pattern(s) for extracting jobId from a job detail URL. Used by jobId test cases. | | jsProviderName | string | | Global JS variable name when bundled. | | runtimeConfig.maxConcurrentDetails | number | | Max parallel detail fetches (split only). Default: 3. | | runtimeConfig.delayBetweenRequestsMs | number | | Min delay between requests. Default: 0. | | runtimeConfig.pageSettleDelayMs | number | | Delay after page nav (browser only). Default: 0. | | runtimeConfig.maxListingsPerRun | number | | Max listings per run. 0 = unlimited. | | runtimeConfig.maxRetries | number | | Max retry attempts. Default: 3. |

Testing (tests.json)

Every provider should have a src/tests.json file. The test suite is run with jobo-scraper test (no --url flag). Four test types are supported:

urlPattern — Offline URL matching

Checks the URL against the provider's urlPatterns without making any HTTP requests.

{ "type": "urlPattern", "name": "Company board matches", "url": "https://jobs.ashbyhq.com/ramp", "expect": { "matches": true } }
{ "type": "urlPattern", "name": "Job detail URL matches", "url": "https://jobs.ashbyhq.com/ramp/f2ad6068-02e7-4986-967e-804ecef9e043", "expect": { "matches": true } }
{ "type": "urlPattern", "name": "Wrong domain does not match", "url": "https://jobs.lever.co/ramp", "expect": { "matches": false } }

validation — Provider URL validation

Calls provider.validate(url) and checks isValid, companyId, and baseUrl.

{ "type": "validation", "name": "Extracts companyId", "url": "https://jobs.ashbyhq.com/ramp", "expect": { "isValid": true, "companyId": "ramp", "baseUrl": "https://jobs.ashbyhq.com/ramp" } }
{ "type": "validation", "name": "Rejects non-Ashby URL", "url": "https://jobs.lever.co/ramp", "expect": { "isValid": false } }

| Assertion | Description | | ----------- | ------------------------------------------------------------- | | isValid | Required. Whether validate() should return isValid: true. | | companyId | Optional. Expected companyId extracted from the URL. | | baseUrl | Optional. Expected baseUrl returned by validate(). |

jobId — Job ID extraction from URL

Uses the manifest's jobIdFromUrl patterns to extract jobId and companyId from a job detail URL. No HTTP requests.

{
  "type": "jobId",
  "name": "Extracts jobId and companyId",
  "url": "https://jobs.ashbyhq.com/ramp/f2ad6068-02e7-4986-967e-804ecef9e043",
  "expect": {
    "jobId": "f2ad6068-02e7-4986-967e-804ecef9e043",
    "companyId": "ramp"
  }
}

| Assertion | Description | | ----------- | ----------------------------------------------------- | | jobId | Required. Expected job ID extracted from the URL. | | companyId | Optional. Expected company ID extracted from the URL. |

Requires jobIdFromUrl in manifest.json, e.g. "jobIdFromUrl": ["https://jobs.ashbyhq.com/:companyId/:jobId"]

scrape — Live scrape with assertions

Runs a full live scrape and validates listing/detail counts, required fields, and extracted metadata.

{
  "type": "scrape",
  "name": "Ramp jobs",
  "url": "https://jobs.ashbyhq.com/ramp",
  "maxPages": 1,
  "maxDetails": 3,
  "expect": {
    "minListings": 1,
    "listingHasFields": ["externalId", "applyUrl", "listingUrl"],
    "detailHasFields": ["title", "description", "listingUrl", "applyUrl"],
    "detailMetaHasKeys": ["employment_type", "department"]
  }
}

| Assertion | Description | | ------------------- | ----------------------------------------------------------------------- | | minListings | Minimum number of listings expected. | | listingHasFields | Fields that must be non-null/non-empty on every listing. | | detailHasFields | Fields that must be non-null/non-empty on every detail. | | detailMetaHasKeys | Meta keys that must be present on at least one detail's meta map. |

Important Rules

  1. Never make HTTP requests directly. Always use ctx.http or ctx.browser. The platform handles proxying, fingerprinting, rate limiting, and retries.

  2. External IDs must be deterministic. The same job must always produce the same externalId. This is critical for deduplication across scrape runs.

  3. Return applyUrl on every listing. This is the minimum viable field — even split scrapers must provide it.

  4. Use typed errors. Throw HttpError, ParseError, RateLimitError, etc. instead of generic Error. The platform uses error types to decide retry strategy.

  5. Metadata keys should be camelCase. Common keys: employmentType, experienceLevel, department, workplaceType, compensation, benefits, tags.

Scripts

| Command | Description | | ------------------- | -------------------------------- | | npm run build | Build the SDK with tsup. | | npm run dev | Watch mode — rebuild on changes. | | npm run typecheck | Run TypeScript type checking. | | npm run clean | Remove the dist/ directory. |

License

MIT