ak-image-scrape

v0.1.0

Published

7 days ago

Multi-source image scraper with optional AI curation (Gemini Vision) and GCS upload. Hybrid Node CLI + bundled Python scrapers.

Downloads

112

0High
0Medium
0Low

ak--47

ak-image-scrape

Multi-source image scraper with optional AI curation (Gemini Vision via Vertex AI) and GCS upload. Hybrid Node CLI + bundled Python scrapers.

Point it at any topic, get a clean folder of high-quality reference images.

ak-image-scrape "studio ghibli totoro" --curate --gcs-bucket gs://my-bucket/totoro/

What it does

search query → 5 parallel scrapers → dedup + min-dim filter
            → [Gemini Vision curation] → PNG normalize + rename
            → [GCS upload]

Sources (all run in parallel by default):

bing — icrawler Bing Images
baidu — icrawler Baidu Images
ddgs — DuckDuckGo image search
jjlimm — Selenium-driven Bing scraper (catches images icrawler misses)
gallerydl — gallery-dl for DeviantArt + ArtStation

Install

npm install -g ak-image-scrape

Requires:

Node ≥ 18
Python 3.10+ (auto-detected)
Chrome (for the Selenium-based jjlimm scraper)
gcloud auth application-default login (only if using --curate or --gcs-bucket)

First run installs a Python venv to ~/.cache/ak-image-scrape/venv/ (~60–120s). Subsequent runs reuse it. Override the location with AK_IMAGE_SCRAPE_VENV=/some/path or XDG_CACHE_HOME=/some/cache.

Quick start

# Scrape only
ak-image-scrape "fantastic four 90s comic" --per-query 50

# Scrape + AI-curate (requires Vertex AI auth)
ak-image-scrape "fantastic four 90s comic" --curate --prefix ff

# Full pipeline + upload
ak-image-scrape "fantastic four 90s comic" --curate --gcs-bucket gs://my-bucket/ff/

Output layout

<out>/
├── final/        ← normalized PNGs: <prefix>-NNNN.png (this is the result)
├── _pool/        ← survivors of dedup + min-dim, before curation
├── _raw/         ← per-source raw downloads
│   ├── bing/<query-slug>/...
│   ├── ddgs/...
│   └── ...
├── _logs/        ← per-source subprocess stdout/stderr
└── .curate-cache.json  ← Gemini decisions cache (if --curate)

Flags

Core

| Flag | Default | Description | |------|---------|-------------| | <query> | required | Primary search query (positional) | | --queries-file path | | Newline-delimited file of additional queries | | --out dir | ./scraped | Output directory | | --per-query N | 100 | Images per query per source | | --min-dim N | 600 | Minimum width/height (pixels) for kept images | | --sources s1,s2,... | all | Subset of bing,baidu,ddgs,jjlimm,gallerydl | | --prefix s | img | Output filename prefix → <prefix>-NNNN.png | | --concurrency N | 5 | Max parallel scraper subprocesses |

Curation (require `--curate`)

| Flag | Default | Description | |------|---------|-------------| | --curate | false | Enable AI curation pass | | --curate-model id | gemini-3.1-flash-lite-preview | Vertex AI Gemini model | | --curate-prompt-file path | | Custom prompt; replaces built-in template | | --curate-concurrency N | 8 | Parallel Gemini classifications | | --vertex-project id | $GOOGLE_CLOUD_PROJECT | Vertex AI project | | --vertex-region r | us-central1 | Vertex AI region |

GCS upload

| Flag | Default | Description | |------|---------|-------------| | --gcs-bucket gs://b/path/ | | Upload destination. Skipped if absent. | | --gcs-concurrency N | 50 | Parallel uploads |

Other

| Flag | Default | Description | |------|---------|-------------| | --resume | true | Re-use cached curation decisions | | --dry-run | false | Plan only, no scrape/curate/upload | | --verbose / -v | false | Verbose logging |

Authentication

Vertex AI (for `--curate`)

gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=your-project
export CLOUD_ML_REGION=us-central1   # optional

Or pass --vertex-project and --vertex-region explicitly.

Google Cloud Storage (for `--gcs-bucket`)

Same ADC. The signed-in account needs storage.objects.create on the destination bucket.

Cost (per 1000 images)

Scraping: free (rate-limited by source)
Curation (Gemini 3.1 Flash Lite): ~$0.20
GCS storage (Standard, us-central): ~$0.02/month/GB

Library API

import { run } from 'ak-image-scrape';

const result = await run({
  query: 'studio ghibli totoro',
  perQuery: 50,
  curate: true,
  gcsBucket: 'gs://my-bucket/totoro/',
  prefix: 'tt',
  out: './totoro-out',
});
console.log(`${result.count} PNGs in ${result.outDir}`);

Programmatic source list

import { ALL_SOURCES, SOURCES } from 'ak-image-scrape';
console.log(ALL_SOURCES); // ['bing', 'baidu', 'ddgs', 'jjlimm', 'gallerydl']

Adding new sources

See docs/ADDING_SOURCES.md.

Troubleshooting

No python3/python interpreter found — install Python 3.10+ on your $PATH.

venv install failed — delete ~/.cache/ak-image-scrape/venv/ (or whatever AK_IMAGE_SCRAPE_VENV points to) and re-run; ensure you have build tools (xcode-select --install on macOS).

jjlimm finds 0 images — Selenium needs Chrome installed. Check _logs/jjlimm.log for ChromeDriver errors. Source DOM rot is expected; see ADDING_SOURCES.md for repair pattern.

Vertex permission denied — verify gcloud auth application-default print-access-token works and the account has roles/aiplatform.user on the project.

GCS 403 — confirm the ADC account has storage.objectAdmin on the bucket.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ak-image-scrape

What it does

Install

Quick start

Output layout

Flags

Core

Curation (require --curate)

GCS upload

Other

Authentication

Vertex AI (for --curate)

Google Cloud Storage (for --gcs-bucket)

Cost (per 1000 images)

Library API

Programmatic source list

Adding new sources

Troubleshooting

License

Curation (require `--curate`)

Vertex AI (for `--curate`)

Google Cloud Storage (for `--gcs-bucket`)