ak-image-scrape
v0.1.0
Published
Multi-source image scraper with optional AI curation (Gemini Vision) and GCS upload. Hybrid Node CLI + bundled Python scrapers.
Downloads
112
Readme
ak-image-scrape
Multi-source image scraper with optional AI curation (Gemini Vision via Vertex AI) and GCS upload. Hybrid Node CLI + bundled Python scrapers.
Point it at any topic, get a clean folder of high-quality reference images.
ak-image-scrape "studio ghibli totoro" --curate --gcs-bucket gs://my-bucket/totoro/What it does
search query → 5 parallel scrapers → dedup + min-dim filter
→ [Gemini Vision curation] → PNG normalize + rename
→ [GCS upload]Sources (all run in parallel by default):
bing— icrawler Bing Imagesbaidu— icrawler Baidu Imagesddgs— DuckDuckGo image searchjjlimm— Selenium-driven Bing scraper (catches images icrawler misses)gallerydl— gallery-dl for DeviantArt + ArtStation
Install
npm install -g ak-image-scrapeRequires:
- Node ≥ 18
- Python 3.10+ (auto-detected)
- Chrome (for the Selenium-based
jjlimmscraper) gcloud auth application-default login(only if using--curateor--gcs-bucket)
First run installs a Python venv to ~/.cache/ak-image-scrape/venv/ (~60–120s). Subsequent runs reuse it. Override the location with AK_IMAGE_SCRAPE_VENV=/some/path or XDG_CACHE_HOME=/some/cache.
Quick start
# Scrape only
ak-image-scrape "fantastic four 90s comic" --per-query 50
# Scrape + AI-curate (requires Vertex AI auth)
ak-image-scrape "fantastic four 90s comic" --curate --prefix ff
# Full pipeline + upload
ak-image-scrape "fantastic four 90s comic" --curate --gcs-bucket gs://my-bucket/ff/Output layout
<out>/
├── final/ ← normalized PNGs: <prefix>-NNNN.png (this is the result)
├── _pool/ ← survivors of dedup + min-dim, before curation
├── _raw/ ← per-source raw downloads
│ ├── bing/<query-slug>/...
│ ├── ddgs/...
│ └── ...
├── _logs/ ← per-source subprocess stdout/stderr
└── .curate-cache.json ← Gemini decisions cache (if --curate)Flags
Core
| Flag | Default | Description |
|------|---------|-------------|
| <query> | required | Primary search query (positional) |
| --queries-file path | | Newline-delimited file of additional queries |
| --out dir | ./scraped | Output directory |
| --per-query N | 100 | Images per query per source |
| --min-dim N | 600 | Minimum width/height (pixels) for kept images |
| --sources s1,s2,... | all | Subset of bing,baidu,ddgs,jjlimm,gallerydl |
| --prefix s | img | Output filename prefix → <prefix>-NNNN.png |
| --concurrency N | 5 | Max parallel scraper subprocesses |
Curation (require --curate)
| Flag | Default | Description |
|------|---------|-------------|
| --curate | false | Enable AI curation pass |
| --curate-model id | gemini-3.1-flash-lite-preview | Vertex AI Gemini model |
| --curate-prompt-file path | | Custom prompt; replaces built-in template |
| --curate-concurrency N | 8 | Parallel Gemini classifications |
| --vertex-project id | $GOOGLE_CLOUD_PROJECT | Vertex AI project |
| --vertex-region r | us-central1 | Vertex AI region |
GCS upload
| Flag | Default | Description |
|------|---------|-------------|
| --gcs-bucket gs://b/path/ | | Upload destination. Skipped if absent. |
| --gcs-concurrency N | 50 | Parallel uploads |
Other
| Flag | Default | Description |
|------|---------|-------------|
| --resume | true | Re-use cached curation decisions |
| --dry-run | false | Plan only, no scrape/curate/upload |
| --verbose / -v | false | Verbose logging |
Authentication
Vertex AI (for --curate)
gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=your-project
export CLOUD_ML_REGION=us-central1 # optionalOr pass --vertex-project and --vertex-region explicitly.
Google Cloud Storage (for --gcs-bucket)
Same ADC. The signed-in account needs storage.objects.create on the destination bucket.
Cost (per 1000 images)
- Scraping: free (rate-limited by source)
- Curation (Gemini 3.1 Flash Lite): ~$0.20
- GCS storage (Standard, us-central): ~$0.02/month/GB
Library API
import { run } from 'ak-image-scrape';
const result = await run({
query: 'studio ghibli totoro',
perQuery: 50,
curate: true,
gcsBucket: 'gs://my-bucket/totoro/',
prefix: 'tt',
out: './totoro-out',
});
console.log(`${result.count} PNGs in ${result.outDir}`);Programmatic source list
import { ALL_SOURCES, SOURCES } from 'ak-image-scrape';
console.log(ALL_SOURCES); // ['bing', 'baidu', 'ddgs', 'jjlimm', 'gallerydl']Adding new sources
Troubleshooting
No python3/python interpreter found — install Python 3.10+ on your $PATH.
venv install failed — delete ~/.cache/ak-image-scrape/venv/ (or whatever AK_IMAGE_SCRAPE_VENV points to) and re-run; ensure you have build tools (xcode-select --install on macOS).
jjlimm finds 0 images — Selenium needs Chrome installed. Check _logs/jjlimm.log for ChromeDriver errors. Source DOM rot is expected; see ADDING_SOURCES.md for repair pattern.
Vertex permission denied — verify gcloud auth application-default print-access-token works and the account has roles/aiplatform.user on the project.
GCS 403 — confirm the ADC account has storage.objectAdmin on the bucket.
License
MIT
