npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ak-image-scrape

v0.1.0

Published

Multi-source image scraper with optional AI curation (Gemini Vision) and GCS upload. Hybrid Node CLI + bundled Python scrapers.

Downloads

112

Readme

ak-image-scrape

Multi-source image scraper with optional AI curation (Gemini Vision via Vertex AI) and GCS upload. Hybrid Node CLI + bundled Python scrapers.

Point it at any topic, get a clean folder of high-quality reference images.

ak-image-scrape "studio ghibli totoro" --curate --gcs-bucket gs://my-bucket/totoro/

What it does

search query → 5 parallel scrapers → dedup + min-dim filter
            → [Gemini Vision curation] → PNG normalize + rename
            → [GCS upload]

Sources (all run in parallel by default):

  • bing — icrawler Bing Images
  • baidu — icrawler Baidu Images
  • ddgs — DuckDuckGo image search
  • jjlimm — Selenium-driven Bing scraper (catches images icrawler misses)
  • gallerydl — gallery-dl for DeviantArt + ArtStation

Install

npm install -g ak-image-scrape

Requires:

  • Node ≥ 18
  • Python 3.10+ (auto-detected)
  • Chrome (for the Selenium-based jjlimm scraper)
  • gcloud auth application-default login (only if using --curate or --gcs-bucket)

First run installs a Python venv to ~/.cache/ak-image-scrape/venv/ (~60–120s). Subsequent runs reuse it. Override the location with AK_IMAGE_SCRAPE_VENV=/some/path or XDG_CACHE_HOME=/some/cache.

Quick start

# Scrape only
ak-image-scrape "fantastic four 90s comic" --per-query 50

# Scrape + AI-curate (requires Vertex AI auth)
ak-image-scrape "fantastic four 90s comic" --curate --prefix ff

# Full pipeline + upload
ak-image-scrape "fantastic four 90s comic" --curate --gcs-bucket gs://my-bucket/ff/

Output layout

<out>/
├── final/        ← normalized PNGs: <prefix>-NNNN.png (this is the result)
├── _pool/        ← survivors of dedup + min-dim, before curation
├── _raw/         ← per-source raw downloads
│   ├── bing/<query-slug>/...
│   ├── ddgs/...
│   └── ...
├── _logs/        ← per-source subprocess stdout/stderr
└── .curate-cache.json  ← Gemini decisions cache (if --curate)

Flags

Core

| Flag | Default | Description | |------|---------|-------------| | <query> | required | Primary search query (positional) | | --queries-file path | | Newline-delimited file of additional queries | | --out dir | ./scraped | Output directory | | --per-query N | 100 | Images per query per source | | --min-dim N | 600 | Minimum width/height (pixels) for kept images | | --sources s1,s2,... | all | Subset of bing,baidu,ddgs,jjlimm,gallerydl | | --prefix s | img | Output filename prefix → <prefix>-NNNN.png | | --concurrency N | 5 | Max parallel scraper subprocesses |

Curation (require --curate)

| Flag | Default | Description | |------|---------|-------------| | --curate | false | Enable AI curation pass | | --curate-model id | gemini-3.1-flash-lite-preview | Vertex AI Gemini model | | --curate-prompt-file path | | Custom prompt; replaces built-in template | | --curate-concurrency N | 8 | Parallel Gemini classifications | | --vertex-project id | $GOOGLE_CLOUD_PROJECT | Vertex AI project | | --vertex-region r | us-central1 | Vertex AI region |

GCS upload

| Flag | Default | Description | |------|---------|-------------| | --gcs-bucket gs://b/path/ | | Upload destination. Skipped if absent. | | --gcs-concurrency N | 50 | Parallel uploads |

Other

| Flag | Default | Description | |------|---------|-------------| | --resume | true | Re-use cached curation decisions | | --dry-run | false | Plan only, no scrape/curate/upload | | --verbose / -v | false | Verbose logging |

Authentication

Vertex AI (for --curate)

gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=your-project
export CLOUD_ML_REGION=us-central1   # optional

Or pass --vertex-project and --vertex-region explicitly.

Google Cloud Storage (for --gcs-bucket)

Same ADC. The signed-in account needs storage.objects.create on the destination bucket.

Cost (per 1000 images)

  • Scraping: free (rate-limited by source)
  • Curation (Gemini 3.1 Flash Lite): ~$0.20
  • GCS storage (Standard, us-central): ~$0.02/month/GB

Library API

import { run } from 'ak-image-scrape';

const result = await run({
  query: 'studio ghibli totoro',
  perQuery: 50,
  curate: true,
  gcsBucket: 'gs://my-bucket/totoro/',
  prefix: 'tt',
  out: './totoro-out',
});
console.log(`${result.count} PNGs in ${result.outDir}`);

Programmatic source list

import { ALL_SOURCES, SOURCES } from 'ak-image-scrape';
console.log(ALL_SOURCES); // ['bing', 'baidu', 'ddgs', 'jjlimm', 'gallerydl']

Adding new sources

See docs/ADDING_SOURCES.md.

Troubleshooting

No python3/python interpreter found — install Python 3.10+ on your $PATH.

venv install failed — delete ~/.cache/ak-image-scrape/venv/ (or whatever AK_IMAGE_SCRAPE_VENV points to) and re-run; ensure you have build tools (xcode-select --install on macOS).

jjlimm finds 0 images — Selenium needs Chrome installed. Check _logs/jjlimm.log for ChromeDriver errors. Source DOM rot is expected; see ADDING_SOURCES.md for repair pattern.

Vertex permission denied — verify gcloud auth application-default print-access-token works and the account has roles/aiplatform.user on the project.

GCS 403 — confirm the ADC account has storage.objectAdmin on the bucket.

License

MIT