npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

web-page-extractor

v1.1.2

Published

Extract full-page HTML (with inlined CSS), screenshots, and a PDF from any URL using a headless browser. Optionally upload everything to S3. Diff two captured HTML files.

Readme

web-page-extractor

Extract full-page HTML (with inlined CSS), screenshots, and a PDF from any URL using a headless browser. Optionally upload everything to S3. Diff two captured HTML files.

Install

npm install web-page-extractor

Requires Node.js ≥ 18. Puppeteer will download Chromium automatically on first install.


Quick start

const { extract } = require('web-page-extractor');

const result = await extract('https://example.com');

console.log(result.meta.title);        // "Example Domain"
console.log(result.images);            // ['https://...']
// result.html          — self-contained HTML with CSS inlined
// result.screenshotFull — Buffer (full-page PNG)
// result.screenshotPages — Buffer[] (viewport-sized PNG slices)
// result.pdf           — Buffer (PDF, one page per slice)

API

extract(urls, options?)

Launches a headless browser, loads the page(s), and returns extracted data.

// Single URL → single result object
const result = await extract('https://example.com');

// Multiple URLs → array of result objects (run in parallel)
const results = await extract(['https://example.com', 'https://example.org']);

Options

| Option | Type | Default | Description | |---|---|---|---| | viewportWidth | number | 1280 | Browser viewport width (px) | | viewportHeight | number | 800 | Viewport height — also the height of each paged screenshot slice | | timeout | number | 30000 | Navigation timeout (ms) | | waitUntil | string | 'networkidle2' | When to consider the page loaded: 'load', 'domcontentloaded', 'networkidle0', 'networkidle2' | | scrollDelay | number | 300 | ms between scroll steps — increase for lazy-loaded pages | | fullPageScreenshot | boolean | true | Capture a single tall full-page PNG | | pagedScreenshots | boolean | true | Capture viewport-sized PNG slices | | pageOverlap | number | 0 | Pixel overlap between consecutive slices | | generatePDF | boolean | true | Merge paged screenshots into a PDF | | inlineCSS | boolean | true | Fetch and inline external stylesheets — makes the HTML self-contained | | extraHeaders | object | {} | Extra HTTP request headers | | cookieConsent | boolean|string|object | false | Auto-click common cookie consent controls before scrolling and capture | | s3 | S3Options | — | Upload all artifacts to S3 (see below) |

Cookie consent bypass

Set cookieConsent when cookie banners block the captured HTML or screenshots.

// Click common "accept all" / "accept cookies" buttons
const accepted = await extract('https://example.com', {
  cookieConsent: true,
});

// Prefer rejecting optional cookies
const rejected = await extract('https://example.com', {
  cookieConsent: 'reject',
});

// Use custom selectors or button text for a specific site
const custom = await extract('https://example.com', {
  cookieConsent: {
    action: 'accept',
    selectors: ['#cookie-accept', '[data-testid="accept-cookies"]'],
    text: ['save choices', 'continue without selecting'],
    attempts: 3,
    settleDelay: 750,
  },
});

Supported actions are 'accept', 'reject', and 'dismiss'. The handler checks the page and accessible frames, tries custom selectors/text first, then falls back to common consent labels.

Result object (no S3)

{
  url:             string,      // original URL
  html:            string,      // full HTML with CSS inlined
  images:          string[],    // all image URLs found on the page
  meta:            { title: string, description: string|null },
  screenshotFull:  Buffer,      // full-page PNG
  screenshotPages: Buffer[],    // viewport-sized PNG slices
  pdf:             Buffer,      // merged PDF
  s3:              null,
  error:           Error|null
}

Result object (with S3)

When s3 options are provided, raw Buffers are replaced with S3 URLs and the result is fully JSON-serializable:

{
  url:    string,
  html:   string,        // HTML string is always returned locally too
  images: string[],
  meta:   { title: string, description: string|null },
  s3: {
    html:            string,    // S3 URL of the HTML file
    screenshotFull:  string,    // S3 URL of the full-page PNG
    screenshotPages: string[],  // S3 URLs of the paged PNG slices
    pdf:             string,    // S3 URL of the PDF
  },
  error: Error|null
}

S3 upload

Pass an s3 object in the options to automatically upload all artifacts. Each run is stored under {prefix}/{hostname}/{timestamp}/.

const result = await extract('https://example.com', {
  s3: {
    accessKeyId:     'AKIA...',
    secretAccessKey: 'secret...',
    region:          'us-east-1',
    bucket:          'my-bucket',

    // optional
    prefix:   'web-captures/',   // key prefix / folder
    acl:      'public-read',     // canned ACL (omit for private buckets)
    presign:  3600,              // return pre-signed GET URLs valid for N seconds
    endpoint: 'https://...',     // custom endpoint for MinIO, Cloudflare R2, etc.
  }
});

console.log(result.s3.pdf);
// https://my-bucket.s3.us-east-1.amazonaws.com/web-captures/example.com/2024-04-15T12-00-00/page.pdf

S3 options

| Option | Type | Required | Description | |---|---|---|---| | accessKeyId | string | yes | AWS access key ID | | secretAccessKey | string | yes | AWS secret access key | | region | string | yes | AWS region (e.g. 'us-east-1') | | bucket | string | yes | S3 bucket name | | prefix | string | no | Key prefix/folder (e.g. 'captures/') | | acl | string | no | Canned ACL (e.g. 'public-read'). Omit for private buckets | | presign | number | no | Return pre-signed GET URLs valid for this many seconds | | endpoint | string | no | Custom endpoint for S3-compatible stores (MinIO, Cloudflare R2, DigitalOcean Spaces, …) |


imagesToPDF(pngBuffers)

Convert an array of PNG buffers into a single PDF. Each image becomes one page, sized to match the image dimensions.

const { imagesToPDF } = require('web-page-extractor');
const fs = require('fs');

const pages = [
  fs.readFileSync('page-1.png'),
  fs.readFileSync('page-2.png'),
];

const pdf = await imagesToPDF(pages);
fs.writeFileSync('output.pdf', pdf);

diffHTML(url1, url2, options?)

Fetch two HTML files and return a line-level diff.

const { diffHTML } = require('web-page-extractor');

const result = await diffHTML(
  'https://bucket.s3.region.amazonaws.com/capture-1/page.html',
  'https://bucket.s3.region.amazonaws.com/capture-2/page.html',
  {
    // optional — only needed for private S3 objects
    s3: {
      accessKeyId:     'AKIA...',
      secretAccessKey: 'secret...',
      region:          'us-east-1',
    }
  }
);

Works with any public HTTPS URL, not just S3.

Diff result

{
  url1:      string,
  url2:      string,
  identical: boolean,
  stats: {
    added:     number,   // lines only in url2
    removed:   number,   // lines only in url1
    unchanged: number,   // lines in both
  },
  unified: string,       // full unified diff (git diff style)
  changes: [             // only added/removed chunks — unchanged lines omitted
    { type: 'removed', value: '<title>Old Title</title>\n', count: 1 },
    { type: 'added',   value: '<title>New Title</title>\n', count: 1 },
    ...
  ]
}

When identical is true, unified is an empty string and changes is empty.


Examples

Save everything locally

const { extract } = require('web-page-extractor');
const fs = require('fs');

const result = await extract('https://news.ycombinator.com', {
  viewportWidth: 1440,
  viewportHeight: 900,
  scrollDelay: 150,
});

fs.writeFileSync('page.html', result.html);
fs.writeFileSync('full.png', result.screenshotFull);
fs.writeFileSync('page.pdf', result.pdf);

result.screenshotPages.forEach((buf, i) =>
  fs.writeFileSync(`page-${i + 1}.png`, buf)
);

console.log('Images found:', result.images);

Capture multiple URLs and upload to S3

const results = await extract(
  ['https://example.com', 'https://example.org'],
  {
    s3: {
      accessKeyId: process.env.AWS_ACCESS_KEY_ID,
      secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
      region: 'eu-west-1',
      bucket: 'my-captures',
      prefix: 'daily/',
      acl: 'public-read',
    },
  }
);

results.forEach(r => {
  if (r.error) {
    console.error(r.url, r.error.message);
  } else {
    console.log(r.url, '→', r.s3.pdf);
  }
});

Diff two S3 captures

const { diffHTML } = require('web-page-extractor');

const diff = await diffHTML(url1, url2, {
  s3: { accessKeyId: '...', secretAccessKey: '...', region: 'us-east-1' }
});

console.log(`+${diff.stats.added} -${diff.stats.removed} lines changed`);
diff.changes.forEach(c =>
  console.log(`[${c.type}] ${c.value.trim().slice(0, 80)}`)
);

How it works

  • HTMLpage.content() is returned after all external <link rel="stylesheet"> tags are fetched and replaced with inline <style> blocks, making the file self-contained.
  • Images — collected from <img src/srcset>, inline background-image styles, and og:image / twitter:image meta tags.
  • Screenshots — a single fullPage: true Puppeteer screenshot is taken. The full-page buffer is then split pixel-perfectly into viewport-sized slices using pngjs, so the paged screenshots are exact crops of the full capture.
  • PDFpdfkit assembles the slices into a multi-page PDF where each page is sized to its image.
  • S3 — all uploads run in parallel (Promise.all). Keys follow the pattern {prefix}/{hostname}/{ISO-timestamp}/ so each run gets its own folder.

License

MIT