npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@sendwithxmit/serverless-agent-browser

v0.1.3

Published

Serverless headless browser with built-in crawler. BFS/DFS traversal, sitemap discovery, robots.txt compliance. Powered by Lightpanda.

Readme

@sendwithxmit/serverless-agent-browser

Run Lightpanda headless browser as a serverless function on AWS Lambda, Vercel, and other Node.js serverless platforms. Includes a built-in crawler with BFS/DFS traversal, sitemap discovery, and robots.txt compliance.

Lightpanda is 9x less memory and 11x faster than headless Chrome. It uses a full V8 JavaScript engine, so React/Vue/Angular SPAs render natively. This package bundles the binary compressed with Brotli, decompresses to /tmp on cold start, and caches it for warm invocations.

Install

npm install @sendwithxmit/serverless-agent-browser

Note: Before publishing, run npm run prepare-binary to download and compress the Lightpanda Linux binary into the package.

Quick Start

One-shot page fetch

import Lightpanda from "@sendwithxmit/serverless-agent-browser";

// Fetch a page as markdown (JS-rendered, works with React/Vue/SPAs)
const markdown = await Lightpanda.fetch("https://example.com", {
  dump: "markdown",
});

// Fetch as HTML
const html = await Lightpanda.fetch("https://example.com", { dump: "html" });

// Fetch semantic tree (great for LLMs)
const tree = await Lightpanda.fetch("https://example.com", {
  dump: "semantic_tree",
});

Discover sitemaps

import { discoverSitemaps } from "@sendwithxmit/serverless-agent-browser/crawler";

const result = await discoverSitemaps("https://example.com");
console.log(`Found ${result.totalUrls} pages across ${result.sitemaps.length} sitemaps`);
console.log(result.urls); // [{ url, lastmod, priority, source }]

Crawl multiple pages

import { crawlBatch } from "@sendwithxmit/serverless-agent-browser/crawler";

// First batch
let result = await crawlBatch({
  url: "https://example.com",
  source: "all",        // discover via sitemaps + link following
  strategy: "bfs",      // breadth-first
  maxPages: 50,
  concurrency: 3,       // 3 pages in parallel per batch
});

console.log(result.pages);  // [{ url, markdown, links, images, metadata, ... }]

// Continue crawling with cursor until done
while (result.cursor) {
  result = await crawlBatch({ url: "https://example.com" }, result.cursor);
  console.log(`Crawled ${result.stats.totalPagesCrawled} pages total`);
}

Start a CDP server (Puppeteer/Playwright)

import Lightpanda from "@sendwithxmit/serverless-agent-browser";

const { wsEndpoint, kill } = await Lightpanda.serve({ port: 9222 });

import puppeteer from "puppeteer-core";
const browser = await puppeteer.connect({ browserWSEndpoint: wsEndpoint });
const page = await browser.newPage();
await page.goto("https://example.com");

kill();

API

Lightpanda.fetch(url, options?)

Single-page fetch using Lightpanda's built-in fetch mode. Returns page content as a string.

| Option | Type | Default | Description | |--------|------|---------|-------------| | dump | "html" \| "markdown" \| "semantic_tree" \| "semantic_tree_text" | "html" | Output format | | timeout | number | 30000 | Timeout in milliseconds | | withFrames | boolean | false | Include frame content | | withBase | boolean | false | Include base URL | | obeyRobots | boolean | false | Respect robots.txt | | strip | string[] | [] | Strip modes: "js", "css", "ui" | | insecureTls | boolean | false | Disable TLS host verification (Lambda/AL2023 workaround) | | userAgentSuffix | string | — | Append to User-Agent header |

Lightpanda.executablePath(input?)

Returns the path to the Lightpanda binary, decompressing from Brotli on first call.

Lightpanda.serve(options?)

Starts a CDP WebSocket server. Returns { process, wsEndpoint, host, port, kill }.

| Option | Type | Default | Description | |--------|------|---------|-------------| | host | string | "127.0.0.1" | Host to bind | | port | number | 9222 | Port to listen on | | maxConnections | number | 16 | Max simultaneous CDP connections | | timeout | number | 30 | Inactivity timeout in seconds |


Crawler

Import from @sendwithxmit/serverless-agent-browser/crawler.

crawlBatch(config, cursor?)

Crawl pages with BFS/DFS traversal. Designed for serverless — uses cursor-based pagination to work within time limits (60s on Vercel Pro).

Config options:

| Option | Type | Default | Description | |--------|------|---------|-------------| | url | string | — | Required. Starting URL | | source | "all" \| "sitemaps" \| "links" | "all" | URL discovery method | | strategy | "bfs" \| "dfs" | "bfs" | Traversal strategy | | maxDepth | number | 10 | Max link-follow depth from seed | | maxPages | number | 100 | Max pages total across all batches | | concurrency | number | 3 | Pages fetched in parallel per batch | | timeBudget | number | 55000 | ms budget per invocation (leave margin for response) | | render | boolean | true | Use Lightpanda JS rendering. false = plain HTTP only | | obeyRobots | boolean | true | Respect robots.txt and Crawl-delay | | extractMetadata | boolean | false | Fetch <head> metadata (OG, Twitter, JSON-LD). Adds a plain HTTP call per page | | includeSubdomains | boolean | false | Follow links to subdomains | | includeExternalLinks | boolean | false | Follow links to external domains | | includePatterns | string[] | [] | Only crawl URLs matching these patterns (* and ** wildcards) | | excludePatterns | string[] | [] | Skip URLs matching these patterns (takes priority over include) | | pageTimeout | number | 15000 | Timeout per page in ms | | strip | string[] | ["css"] | Strip modes passed to Lightpanda |

Response:

{
  pages: [{
    url: string;
    status: "completed" | "error" | "disallowed" | "skipped";
    status_code: number | null;
    markdown: string;           // JS-rendered content
    metadata: PageMetadata | null; // when extractMetadata: true
    links: PageLink[];          // { url, anchor_text, type, location }
    images: PageImage[];        // { url, alt }
    depth: number;
    fetchTime: number;          // ms
    error?: string;
  }];
  cursor: string | null;        // null = crawl complete
  stats: {
    pagesCrawled: number;       // this batch
    totalPagesCrawled: number;  // all batches
    pagesRemaining: number;
    elapsed: number;
    stopReason: "complete" | "limit" | "time_budget" | "crawl_delay";
  };
}

discoverSitemaps(domain, options?)

Discover all sitemaps for a domain and enumerate their URLs. No browser rendering needed — all plain HTTP.

const result = await discoverSitemaps("https://example.com", {
  maxSitemaps: 50,  // cap recursion
  maxDepth: 3,      // max sitemap index depth
});

// result: { domain, sitemaps[], totalUrls, urls[], hasMore }

Metadata extraction

When extractMetadata: true, each page gets an extra plain HTTP fetch (~50-100ms) to parse <head> tags:

interface PageMetadata {
  title: string | null;
  meta_description: string | null;
  canonical_url: string | null;
  language: string | null;
  author: string | null;
  robots: string | null;
  og: { title, description, image, url, type, site_name } | null;
  twitter: { card, title, description, image, site } | null;
  jsonLd: unknown[] | null;
}

Link categorization

Links extracted from markdown are automatically categorized:

  • type: "internal" (same domain) or "external"
  • location: "content" (body) or "navigational" (header/footer)

Robots.txt utilities

import {
  fetchRobotsTxt,
  parseRobotsTxt,
  isUrlAllowed,
  getCrawlDelay,
  getSitemapUrls,
} from "@sendwithxmit/serverless-agent-browser/crawler";

const content = await fetchRobotsTxt("https://example.com");
const rules = parseRobotsTxt(content!, "MyCrawler");
console.log(isUrlAllowed("https://example.com/admin", rules)); // false
console.log(getCrawlDelay(rules)); // 2000 (ms) or null
console.log(getSitemapUrls(rules)); // ["https://example.com/sitemap.xml"]

Vercel Deployment

The examples/vercel/ directory contains a working Vercel app with three endpoints:

{
  "functions": {
    "api/**/*.ts": {
      "memory": 1024,
      "maxDuration": 60,
      "includeFiles": "node_modules/@sendwithxmit/serverless-agent-browser/bin/**"
    }
  }
}

| Endpoint | Method | Description | |----------|--------|-------------| | /api/fetch?url=...&format=... | GET | Single page fetch | | /api/discover?url=... | GET | Sitemap discovery + page count | | /api/crawl | POST | Multi-page crawl with cursor pagination |

Crawl endpoint example

# Start a crawl
curl -X POST https://your-app.vercel.app/api/crawl \
  -H 'Content-Type: application/json' \
  -d '{"config": {"url": "https://example.com", "maxPages": 10}}'

# Continue with cursor
curl -X POST https://your-app.vercel.app/api/crawl \
  -H 'Content-Type: application/json' \
  -d '{"config": {"url": "https://example.com"}, "cursor": "base64..."}'

AWS Lambda Deployment

import Lightpanda from "@sendwithxmit/serverless-agent-browser";

export const handler = async (event) => {
  const content = await Lightpanda.fetch(event.url, {
    dump: event.format || "markdown",
    insecureTls: true, // Required on Lambda/AL2023
    timeout: 25_000,
  });
  return { statusCode: 200, body: content };
};

Testing

npm test

48 tests covering robots.txt parsing, metadata extraction, link categorization, cursor round-trips, sitemap parsing, and integration tests with real HTTP calls.

CI runs on every push via GitHub Actions (Node 20 + 22).

Binary Preparation

npm run prepare-binary              # Download both x64 and arm64
npm run prepare-binary -- --arch x64  # x64 only
LIGHTPANDA_RELEASE=v1.0.0 npm run prepare-binary  # Specific release

Size Budget

| Component | Size | |-----------|------| | Lightpanda binary (uncompressed) | ~107 MB | | Brotli-compressed (.br) | ~26 MB | | npm package total | ~27 MB |

How It Works

  1. Build time: prepare-binary downloads the Lightpanda Linux binary and compresses with Brotli (~4x compression)
  2. Cold start: Decompresses binary to /tmp/lightpanda (~3-4s)
  3. Warm start: Binary already cached in /tmp, skip decompression
  4. Fetch: Spawns binary as child process. Full V8 JS engine renders SPAs natively
  5. Crawl: BFS/DFS traversal with cursor-based pagination. Up to 3 pages in parallel per invocation. Respects robots.txt and Crawl-delay

Known Issues

  • TLS on Lambda/AL2023: Lightpanda's Zig-based cert loader doesn't load CA certs correctly on Amazon Linux 2023. Use insecureTls: true as a workaround

License

AGPL-3.0 — Same as Lightpanda