web-page-extractor

v1.1.2

Published

3 months ago

Extract full-page HTML (with inlined CSS), screenshots, and a PDF from any URL using a headless browser. Optionally upload everything to S3. Diff two captured HTML files.

0High
0Medium
0Low

manasmishra13

puppeteer scraper screenshot html-extractor headless pdf s3 web-capture diff html-diff

web-page-extractor

Extract full-page HTML (with inlined CSS), screenshots, and a PDF from any URL using a headless browser. Optionally upload everything to S3. Diff two captured HTML files.

Install

npm install web-page-extractor

Requires Node.js ≥ 18. Puppeteer will download Chromium automatically on first install.

Quick start

const { extract } = require('web-page-extractor');

const result = await extract('https://example.com');

console.log(result.meta.title);        // "Example Domain"
console.log(result.images);            // ['https://...']
// result.html          — self-contained HTML with CSS inlined
// result.screenshotFull — Buffer (full-page PNG)
// result.screenshotPages — Buffer[] (viewport-sized PNG slices)
// result.pdf           — Buffer (PDF, one page per slice)

API

`extract(urls, options?)`

Launches a headless browser, loads the page(s), and returns extracted data.

// Single URL → single result object
const result = await extract('https://example.com');

// Multiple URLs → array of result objects (run in parallel)
const results = await extract(['https://example.com', 'https://example.org']);

Options

| Option | Type | Default | Description | |---|---|---|---| | viewportWidth | number | 1280 | Browser viewport width (px) | | viewportHeight | number | 800 | Viewport height — also the height of each paged screenshot slice | | timeout | number | 30000 | Navigation timeout (ms) | | waitUntil | string | 'networkidle2' | When to consider the page loaded: 'load', 'domcontentloaded', 'networkidle0', 'networkidle2' | | scrollDelay | number | 300 | ms between scroll steps — increase for lazy-loaded pages | | fullPageScreenshot | boolean | true | Capture a single tall full-page PNG | | pagedScreenshots | boolean | true | Capture viewport-sized PNG slices | | pageOverlap | number | 0 | Pixel overlap between consecutive slices | | generatePDF | boolean | true | Merge paged screenshots into a PDF | | inlineCSS | boolean | true | Fetch and inline external stylesheets — makes the HTML self-contained | | extraHeaders | object | {} | Extra HTTP request headers | | cookieConsent | boolean|string|object | false | Auto-click common cookie consent controls before scrolling and capture | | s3 | S3Options | — | Upload all artifacts to S3 (see below) |

Cookie consent bypass

Set cookieConsent when cookie banners block the captured HTML or screenshots.

// Click common "accept all" / "accept cookies" buttons
const accepted = await extract('https://example.com', {
  cookieConsent: true,
});

// Prefer rejecting optional cookies
const rejected = await extract('https://example.com', {
  cookieConsent: 'reject',
});

// Use custom selectors or button text for a specific site
const custom = await extract('https://example.com', {
  cookieConsent: {
    action: 'accept',
    selectors: ['#cookie-accept', '[data-testid="accept-cookies"]'],
    text: ['save choices', 'continue without selecting'],
    attempts: 3,
    settleDelay: 750,
  },
});

Supported actions are 'accept', 'reject', and 'dismiss'. The handler checks the page and accessible frames, tries custom selectors/text first, then falls back to common consent labels.

Result object (no S3)

{
  url:             string,      // original URL
  html:            string,      // full HTML with CSS inlined
  images:          string[],    // all image URLs found on the page
  meta:            { title: string, description: string|null },
  screenshotFull:  Buffer,      // full-page PNG
  screenshotPages: Buffer[],    // viewport-sized PNG slices
  pdf:             Buffer,      // merged PDF
  s3:              null,
  error:           Error|null
}

Result object (with S3)

When s3 options are provided, raw Buffers are replaced with S3 URLs and the result is fully JSON-serializable:

{
  url:    string,
  html:   string,        // HTML string is always returned locally too
  images: string[],
  meta:   { title: string, description: string|null },
  s3: {
    html:            string,    // S3 URL of the HTML file
    screenshotFull:  string,    // S3 URL of the full-page PNG
    screenshotPages: string[],  // S3 URLs of the paged PNG slices
    pdf:             string,    // S3 URL of the PDF
  },
  error: Error|null
}

S3 upload

Pass an s3 object in the options to automatically upload all artifacts. Each run is stored under {prefix}/{hostname}/{timestamp}/.

const result = await extract('https://example.com', {
  s3: {
    accessKeyId:     'AKIA...',
    secretAccessKey: 'secret...',
    region:          'us-east-1',
    bucket:          'my-bucket',

    // optional
    prefix:   'web-captures/',   // key prefix / folder
    acl:      'public-read',     // canned ACL (omit for private buckets)
    presign:  3600,              // return pre-signed GET URLs valid for N seconds
    endpoint: 'https://...',     // custom endpoint for MinIO, Cloudflare R2, etc.
  }
});

console.log(result.s3.pdf);
// https://my-bucket.s3.us-east-1.amazonaws.com/web-captures/example.com/2024-04-15T12-00-00/page.pdf

S3 options

| Option | Type | Required | Description | |---|---|---|---| | accessKeyId | string | yes | AWS access key ID | | secretAccessKey | string | yes | AWS secret access key | | region | string | yes | AWS region (e.g. 'us-east-1') | | bucket | string | yes | S3 bucket name | | prefix | string | no | Key prefix/folder (e.g. 'captures/') | | acl | string | no | Canned ACL (e.g. 'public-read'). Omit for private buckets | | presign | number | no | Return pre-signed GET URLs valid for this many seconds | | endpoint | string | no | Custom endpoint for S3-compatible stores (MinIO, Cloudflare R2, DigitalOcean Spaces, …) |

`imagesToPDF(pngBuffers)`

Convert an array of PNG buffers into a single PDF. Each image becomes one page, sized to match the image dimensions.

const { imagesToPDF } = require('web-page-extractor');
const fs = require('fs');

const pages = [
  fs.readFileSync('page-1.png'),
  fs.readFileSync('page-2.png'),
];

const pdf = await imagesToPDF(pages);
fs.writeFileSync('output.pdf', pdf);

`diffHTML(url1, url2, options?)`

Fetch two HTML files and return a line-level diff.

const { diffHTML } = require('web-page-extractor');

const result = await diffHTML(
  'https://bucket.s3.region.amazonaws.com/capture-1/page.html',
  'https://bucket.s3.region.amazonaws.com/capture-2/page.html',
  {
    // optional — only needed for private S3 objects
    s3: {
      accessKeyId:     'AKIA...',
      secretAccessKey: 'secret...',
      region:          'us-east-1',
    }
  }
);

Works with any public HTTPS URL, not just S3.

Diff result

{
  url1:      string,
  url2:      string,
  identical: boolean,
  stats: {
    added:     number,   // lines only in url2
    removed:   number,   // lines only in url1
    unchanged: number,   // lines in both
  },
  unified: string,       // full unified diff (git diff style)
  changes: [             // only added/removed chunks — unchanged lines omitted
    { type: 'removed', value: '<title>Old Title</title>\n', count: 1 },
    { type: 'added',   value: '<title>New Title</title>\n', count: 1 },
    ...
  ]
}

When identical is true, unified is an empty string and changes is empty.

Examples

Save everything locally

const { extract } = require('web-page-extractor');
const fs = require('fs');

const result = await extract('https://news.ycombinator.com', {
  viewportWidth: 1440,
  viewportHeight: 900,
  scrollDelay: 150,
});

fs.writeFileSync('page.html', result.html);
fs.writeFileSync('full.png', result.screenshotFull);
fs.writeFileSync('page.pdf', result.pdf);

result.screenshotPages.forEach((buf, i) =>
  fs.writeFileSync(`page-${i + 1}.png`, buf)
);

console.log('Images found:', result.images);

Capture multiple URLs and upload to S3

const results = await extract(
  ['https://example.com', 'https://example.org'],
  {
    s3: {
      accessKeyId: process.env.AWS_ACCESS_KEY_ID,
      secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
      region: 'eu-west-1',
      bucket: 'my-captures',
      prefix: 'daily/',
      acl: 'public-read',
    },
  }
);

results.forEach(r => {
  if (r.error) {
    console.error(r.url, r.error.message);
  } else {
    console.log(r.url, '→', r.s3.pdf);
  }
});

Diff two S3 captures

const { diffHTML } = require('web-page-extractor');

const diff = await diffHTML(url1, url2, {
  s3: { accessKeyId: '...', secretAccessKey: '...', region: 'us-east-1' }
});

console.log(`+${diff.stats.added} -${diff.stats.removed} lines changed`);
diff.changes.forEach(c =>
  console.log(`[${c.type}] ${c.value.trim().slice(0, 80)}`)
);

How it works

HTML — page.content() is returned after all external <link rel="stylesheet"> tags are fetched and replaced with inline <style> blocks, making the file self-contained.
Images — collected from <img src/srcset>, inline background-image styles, and og:image / twitter:image meta tags.
Screenshots — a single fullPage: true Puppeteer screenshot is taken. The full-page buffer is then split pixel-perfectly into viewport-sized slices using pngjs, so the paged screenshots are exact crops of the full capture.
PDF — pdfkit assembles the slices into a multi-page PDF where each page is sized to its image.
S3 — all uploads run in parallel (Promise.all). Keys follow the pattern {prefix}/{hostname}/{ISO-timestamp}/ so each run gets its own folder.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

web-page-extractor

Install

Quick start

API

extract(urls, options?)

Options

Cookie consent bypass

Result object (no S3)

Result object (with S3)

S3 upload

S3 options

imagesToPDF(pngBuffers)

diffHTML(url1, url2, options?)

Diff result

Examples

Save everything locally

Capture multiple URLs and upload to S3

Diff two S3 captures

How it works

License

`extract(urls, options?)`

`imagesToPDF(pngBuffers)`

`diffHTML(url1, url2, options?)`