web-page-extractor
v1.1.2
Published
Extract full-page HTML (with inlined CSS), screenshots, and a PDF from any URL using a headless browser. Optionally upload everything to S3. Diff two captured HTML files.
Maintainers
Readme
web-page-extractor
Extract full-page HTML (with inlined CSS), screenshots, and a PDF from any URL using a headless browser. Optionally upload everything to S3. Diff two captured HTML files.
Install
npm install web-page-extractorRequires Node.js ≥ 18. Puppeteer will download Chromium automatically on first install.
Quick start
const { extract } = require('web-page-extractor');
const result = await extract('https://example.com');
console.log(result.meta.title); // "Example Domain"
console.log(result.images); // ['https://...']
// result.html — self-contained HTML with CSS inlined
// result.screenshotFull — Buffer (full-page PNG)
// result.screenshotPages — Buffer[] (viewport-sized PNG slices)
// result.pdf — Buffer (PDF, one page per slice)API
extract(urls, options?)
Launches a headless browser, loads the page(s), and returns extracted data.
// Single URL → single result object
const result = await extract('https://example.com');
// Multiple URLs → array of result objects (run in parallel)
const results = await extract(['https://example.com', 'https://example.org']);Options
| Option | Type | Default | Description |
|---|---|---|---|
| viewportWidth | number | 1280 | Browser viewport width (px) |
| viewportHeight | number | 800 | Viewport height — also the height of each paged screenshot slice |
| timeout | number | 30000 | Navigation timeout (ms) |
| waitUntil | string | 'networkidle2' | When to consider the page loaded: 'load', 'domcontentloaded', 'networkidle0', 'networkidle2' |
| scrollDelay | number | 300 | ms between scroll steps — increase for lazy-loaded pages |
| fullPageScreenshot | boolean | true | Capture a single tall full-page PNG |
| pagedScreenshots | boolean | true | Capture viewport-sized PNG slices |
| pageOverlap | number | 0 | Pixel overlap between consecutive slices |
| generatePDF | boolean | true | Merge paged screenshots into a PDF |
| inlineCSS | boolean | true | Fetch and inline external stylesheets — makes the HTML self-contained |
| extraHeaders | object | {} | Extra HTTP request headers |
| cookieConsent | boolean|string|object | false | Auto-click common cookie consent controls before scrolling and capture |
| s3 | S3Options | — | Upload all artifacts to S3 (see below) |
Cookie consent bypass
Set cookieConsent when cookie banners block the captured HTML or screenshots.
// Click common "accept all" / "accept cookies" buttons
const accepted = await extract('https://example.com', {
cookieConsent: true,
});
// Prefer rejecting optional cookies
const rejected = await extract('https://example.com', {
cookieConsent: 'reject',
});
// Use custom selectors or button text for a specific site
const custom = await extract('https://example.com', {
cookieConsent: {
action: 'accept',
selectors: ['#cookie-accept', '[data-testid="accept-cookies"]'],
text: ['save choices', 'continue without selecting'],
attempts: 3,
settleDelay: 750,
},
});Supported actions are 'accept', 'reject', and 'dismiss'. The handler checks the page and accessible frames, tries custom selectors/text first, then falls back to common consent labels.
Result object (no S3)
{
url: string, // original URL
html: string, // full HTML with CSS inlined
images: string[], // all image URLs found on the page
meta: { title: string, description: string|null },
screenshotFull: Buffer, // full-page PNG
screenshotPages: Buffer[], // viewport-sized PNG slices
pdf: Buffer, // merged PDF
s3: null,
error: Error|null
}Result object (with S3)
When s3 options are provided, raw Buffers are replaced with S3 URLs and the result is fully JSON-serializable:
{
url: string,
html: string, // HTML string is always returned locally too
images: string[],
meta: { title: string, description: string|null },
s3: {
html: string, // S3 URL of the HTML file
screenshotFull: string, // S3 URL of the full-page PNG
screenshotPages: string[], // S3 URLs of the paged PNG slices
pdf: string, // S3 URL of the PDF
},
error: Error|null
}S3 upload
Pass an s3 object in the options to automatically upload all artifacts. Each run is stored under {prefix}/{hostname}/{timestamp}/.
const result = await extract('https://example.com', {
s3: {
accessKeyId: 'AKIA...',
secretAccessKey: 'secret...',
region: 'us-east-1',
bucket: 'my-bucket',
// optional
prefix: 'web-captures/', // key prefix / folder
acl: 'public-read', // canned ACL (omit for private buckets)
presign: 3600, // return pre-signed GET URLs valid for N seconds
endpoint: 'https://...', // custom endpoint for MinIO, Cloudflare R2, etc.
}
});
console.log(result.s3.pdf);
// https://my-bucket.s3.us-east-1.amazonaws.com/web-captures/example.com/2024-04-15T12-00-00/page.pdfS3 options
| Option | Type | Required | Description |
|---|---|---|---|
| accessKeyId | string | yes | AWS access key ID |
| secretAccessKey | string | yes | AWS secret access key |
| region | string | yes | AWS region (e.g. 'us-east-1') |
| bucket | string | yes | S3 bucket name |
| prefix | string | no | Key prefix/folder (e.g. 'captures/') |
| acl | string | no | Canned ACL (e.g. 'public-read'). Omit for private buckets |
| presign | number | no | Return pre-signed GET URLs valid for this many seconds |
| endpoint | string | no | Custom endpoint for S3-compatible stores (MinIO, Cloudflare R2, DigitalOcean Spaces, …) |
imagesToPDF(pngBuffers)
Convert an array of PNG buffers into a single PDF. Each image becomes one page, sized to match the image dimensions.
const { imagesToPDF } = require('web-page-extractor');
const fs = require('fs');
const pages = [
fs.readFileSync('page-1.png'),
fs.readFileSync('page-2.png'),
];
const pdf = await imagesToPDF(pages);
fs.writeFileSync('output.pdf', pdf);diffHTML(url1, url2, options?)
Fetch two HTML files and return a line-level diff.
const { diffHTML } = require('web-page-extractor');
const result = await diffHTML(
'https://bucket.s3.region.amazonaws.com/capture-1/page.html',
'https://bucket.s3.region.amazonaws.com/capture-2/page.html',
{
// optional — only needed for private S3 objects
s3: {
accessKeyId: 'AKIA...',
secretAccessKey: 'secret...',
region: 'us-east-1',
}
}
);Works with any public HTTPS URL, not just S3.
Diff result
{
url1: string,
url2: string,
identical: boolean,
stats: {
added: number, // lines only in url2
removed: number, // lines only in url1
unchanged: number, // lines in both
},
unified: string, // full unified diff (git diff style)
changes: [ // only added/removed chunks — unchanged lines omitted
{ type: 'removed', value: '<title>Old Title</title>\n', count: 1 },
{ type: 'added', value: '<title>New Title</title>\n', count: 1 },
...
]
}When identical is true, unified is an empty string and changes is empty.
Examples
Save everything locally
const { extract } = require('web-page-extractor');
const fs = require('fs');
const result = await extract('https://news.ycombinator.com', {
viewportWidth: 1440,
viewportHeight: 900,
scrollDelay: 150,
});
fs.writeFileSync('page.html', result.html);
fs.writeFileSync('full.png', result.screenshotFull);
fs.writeFileSync('page.pdf', result.pdf);
result.screenshotPages.forEach((buf, i) =>
fs.writeFileSync(`page-${i + 1}.png`, buf)
);
console.log('Images found:', result.images);Capture multiple URLs and upload to S3
const results = await extract(
['https://example.com', 'https://example.org'],
{
s3: {
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
region: 'eu-west-1',
bucket: 'my-captures',
prefix: 'daily/',
acl: 'public-read',
},
}
);
results.forEach(r => {
if (r.error) {
console.error(r.url, r.error.message);
} else {
console.log(r.url, '→', r.s3.pdf);
}
});Diff two S3 captures
const { diffHTML } = require('web-page-extractor');
const diff = await diffHTML(url1, url2, {
s3: { accessKeyId: '...', secretAccessKey: '...', region: 'us-east-1' }
});
console.log(`+${diff.stats.added} -${diff.stats.removed} lines changed`);
diff.changes.forEach(c =>
console.log(`[${c.type}] ${c.value.trim().slice(0, 80)}`)
);How it works
- HTML —
page.content()is returned after all external<link rel="stylesheet">tags are fetched and replaced with inline<style>blocks, making the file self-contained. - Images — collected from
<img src/srcset>, inlinebackground-imagestyles, andog:image/twitter:imagemeta tags. - Screenshots — a single
fullPage: truePuppeteer screenshot is taken. The full-page buffer is then split pixel-perfectly into viewport-sized slices usingpngjs, so the paged screenshots are exact crops of the full capture. - PDF —
pdfkitassembles the slices into a multi-page PDF where each page is sized to its image. - S3 — all uploads run in parallel (
Promise.all). Keys follow the pattern{prefix}/{hostname}/{ISO-timestamp}/so each run gets its own folder.
License
MIT
