srcfull

v2.0.1

Published

2 days ago

Image extraction and source-resolution toolkit for high-quality web images.

0High
0Medium
0Low

howells

Srcfull

srcfull is a package-first toolkit for extracting and upgrading web image URLs.

It is designed as a standalone library and CLI for image extraction and source resolution. The focus is:

extract image candidates from HTML
filter obvious junk like logos and icons
resolve CDN/transformed URLs back to larger originals
probe likely source variants when no curated pattern exists
optionally plug in HTML fetchers like ScrapingBee and fallback image providers like Firecrawl

It handles the page-shape problems that usually make this kind of package annoying in practice:

relative image paths resolved against the page URL
lazy-loaded image attributes like data-src, data-srcset, and data-original
img srcset, picture source, inline background images, and social/meta image tags
private-host blocking for both page scraping and image validation
HEAD fallback to ranged GET for hosts that refuse metadata requests
persistent file-backed cache/pattern stores for repeat runs

Install

pnpm install
pnpm build

Library Usage

import { scrapePage, resolveImageUrl } from "srcfull";

const resolved = await resolveImageUrl(
  "https://cdn.example.com/image.jpg?w=400&q=80"
);

const page = await scrapePage("https://example.com/product-page");

scrapePage() normalizes relative candidates against the page URL before validation and resolution, so typical product/article HTML works without extra preprocessing.

If you need rendered HTML instead of plain fetch, inject a custom fetcher:

import { scrapePage } from "srcfull";
import { createScrapingBeeHtmlFetcher } from "srcfull/providers/scrapingbee";

const fetchHtml = createScrapingBeeHtmlFetcher({
  apiKey: process.env.SCRAPINGBEE_API_KEY!,
});

const result = await scrapePage("https://example.com", { fetchHtml });

If you want the built-in fetcher with different timeout or header behavior:

import { createDefaultHtmlFetcher, scrapePage } from "srcfull";

const fetchHtml = createDefaultHtmlFetcher({
  timeoutMs: 15_000,
  headers: {
    "Accept-Language": "en-GB,en;q=0.9",
  },
});

const result = await scrapePage("https://example.com", { fetchHtml });

For image-only fallback:

import { createFirecrawlImageFallback } from "srcfull/providers/firecrawl";

If you want candidate extraction without the rest of the pipeline:

import { extractImageCandidatesFromHtml } from "srcfull";

const candidates = extractImageCandidatesFromHtml(
  html,
  "https://example.com/product-page"
);

For repeat jobs, persist cache and learned patterns on disk:

import {
  createFileCache,
  createFilePatternStore,
  resolveImageUrl,
} from "srcfull";

const cache = createFileCache({ filePath: ".srcfull/cache.json" });
const patternStore = createFilePatternStore({
  filePath: ".srcfull/patterns.json",
});

const result = await resolveImageUrl("https://cdn.example.com/photo.jpg?w=400", {
  cache,
  patternStore,
});

CLI

srcfull resolve 'https://cdn.example.com/photo.jpg?w=300'
srcfull scrape 'https://example.com/listing' --max-images=12
srcfull scrape 'https://example.com/listing' --max-images=12 --min-size=300 --resolve-concurrency=8
srcfull --version

The JSON response from scrape includes stats.returned as well as found, resolved, failed, and durationMs.

Demo Page

There is a self-contained demo page at docs/demo/index.html.

pnpm demo:build
pnpm demo:serve

The page is generated from real calls to the package, so the HTML samples, extracted candidates, resolved URLs, and persisted cache/pattern snapshots are actual outputs rather than hand-written mockups.

Development

pnpm test
pnpm test:live-patterns
pnpm typecheck
pnpm build

pnpm test:live-patterns revalidates the researched real-world CDN fixtures in test/fixtures/curated-patterns.json against the network.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Srcfull

Install

Library Usage

CLI

Demo Page

Development