srcfull
v2.0.1
Published
Image extraction and source-resolution toolkit for high-quality web images.
Readme
Srcfull
srcfull is a package-first toolkit for extracting and upgrading web image URLs.
It is designed as a standalone library and CLI for image extraction and source resolution. The focus is:
- extract image candidates from HTML
- filter obvious junk like logos and icons
- resolve CDN/transformed URLs back to larger originals
- probe likely source variants when no curated pattern exists
- optionally plug in HTML fetchers like ScrapingBee and fallback image providers like Firecrawl
It handles the page-shape problems that usually make this kind of package annoying in practice:
- relative image paths resolved against the page URL
- lazy-loaded image attributes like
data-src,data-srcset, anddata-original img srcset,picture source, inline background images, and social/meta image tags- private-host blocking for both page scraping and image validation
HEADfallback to rangedGETfor hosts that refuse metadata requests- persistent file-backed cache/pattern stores for repeat runs
Install
pnpm install
pnpm buildLibrary Usage
import { scrapePage, resolveImageUrl } from "srcfull";
const resolved = await resolveImageUrl(
"https://cdn.example.com/image.jpg?w=400&q=80"
);
const page = await scrapePage("https://example.com/product-page");scrapePage() normalizes relative candidates against the page URL before validation and resolution, so typical product/article HTML works without extra preprocessing.
If you need rendered HTML instead of plain fetch, inject a custom fetcher:
import { scrapePage } from "srcfull";
import { createScrapingBeeHtmlFetcher } from "srcfull/providers/scrapingbee";
const fetchHtml = createScrapingBeeHtmlFetcher({
apiKey: process.env.SCRAPINGBEE_API_KEY!,
});
const result = await scrapePage("https://example.com", { fetchHtml });If you want the built-in fetcher with different timeout or header behavior:
import { createDefaultHtmlFetcher, scrapePage } from "srcfull";
const fetchHtml = createDefaultHtmlFetcher({
timeoutMs: 15_000,
headers: {
"Accept-Language": "en-GB,en;q=0.9",
},
});
const result = await scrapePage("https://example.com", { fetchHtml });For image-only fallback:
import { createFirecrawlImageFallback } from "srcfull/providers/firecrawl";If you want candidate extraction without the rest of the pipeline:
import { extractImageCandidatesFromHtml } from "srcfull";
const candidates = extractImageCandidatesFromHtml(
html,
"https://example.com/product-page"
);For repeat jobs, persist cache and learned patterns on disk:
import {
createFileCache,
createFilePatternStore,
resolveImageUrl,
} from "srcfull";
const cache = createFileCache({ filePath: ".srcfull/cache.json" });
const patternStore = createFilePatternStore({
filePath: ".srcfull/patterns.json",
});
const result = await resolveImageUrl("https://cdn.example.com/photo.jpg?w=400", {
cache,
patternStore,
});CLI
srcfull resolve 'https://cdn.example.com/photo.jpg?w=300'
srcfull scrape 'https://example.com/listing' --max-images=12
srcfull scrape 'https://example.com/listing' --max-images=12 --min-size=300 --resolve-concurrency=8
srcfull --versionThe JSON response from scrape includes stats.returned as well as found, resolved, failed, and durationMs.
Demo Page
There is a self-contained demo page at docs/demo/index.html.
pnpm demo:build
pnpm demo:serveThe page is generated from real calls to the package, so the HTML samples, extracted candidates, resolved URLs, and persisted cache/pattern snapshots are actual outputs rather than hand-written mockups.
Development
pnpm test
pnpm test:live-patterns
pnpm typecheck
pnpm buildpnpm test:live-patterns revalidates the researched real-world CDN fixtures in test/fixtures/curated-patterns.json against the network.
