mallmaverick-store-scraper
v0.2.0
Published
MCP server + CLI for scraping shopping mall store directories. Hours-first layered pipeline + image classification.
Maintainers
Readme
mall-scraper-mcp
Layered scraper for shopping-mall store directories. Works as:
- MCP server — coworkers drive scrapes from Claude Desktop / Claude Code
- CLI — direct command-line use (
node src/main.js)
Both share the same v5 pipeline: deterministic hours extraction (JSON-LD → DOM patterns → labeled section → sync-with-mall → focused LLM → external follow), per-page image classification with logo/brand/storefront separation, brand-site fallback for problematic logos.
How coworkers install it
Once published to npm and the Cloudflare Worker is deployed, every coworker runs one command in their terminal:
claude mcp add mall-scraper \
--env MALL_SCRAPER_PROXY_URL=https://mall-scraper-openai-proxy.YOURSUB.workers.dev \
--env MALL_SCRAPER_TOKEN=YOUR_SHARED_SECRET \
-- npx -y mallmaverick-store-scraper@latestThen in Claude they say things like:
Scrape https://grasslands.ca/store-directory/, first 10 stores. Save as CSV.
Claude calls the scrape_directory tool, returns the data, and Claude can
do follow-up analysis (write CSV, find missing fields, retry specific stores).
What requires no setup on coworker machines
- ❌ No git clone
- ❌ No OpenAI API key (it lives in your Worker)
- ❌ No zip to download or replace on updates
- ✅ npm/npx + Node 18+ (most have this; otherwise nodejs.org)
- ✅ The shared-secret token (you give them)
The first scrape downloads Chromium (~170 MB, one-time, automatic via Puppeteer).
How YOU set it up (one-time)
1. Deploy the Cloudflare Worker (10 min)
See cloudflare-worker/README.md. The short version:
cd cloudflare-worker
npm install
npx wrangler login # browser auth to your Cloudflare account
npx wrangler deploy
npx wrangler secret put OPENAI_API_KEY # paste your real OpenAI key
npx wrangler secret put SHARED_SECRET # paste a long random stringYou now have:
MALL_SCRAPER_PROXY_URL= https://mall-scraper-openai-proxy.YOURSUB.workers.devMALL_SCRAPER_TOKEN= (whatever you put as SHARED_SECRET)
Free tier covers ~300 mall scrapes/day. Cost = whatever your OpenAI bill is (~$0.005/store at gpt-5.4-mini).
2. Publish the npm package
# Log in to npm
npm login
# Sanity check
npm pack --dry-run # see exactly what would be published
# First publish
npm publish --access publicIf mall-scraper-mcp is taken, edit package.json "name" to something
available (or use a scope like @yourname/mall-scraper-mcp — make sure to
npm publish --access public for scoped public packages).
3. Share the install command with coworkers
Send them the one-line claude mcp add command above, with your actual
proxy URL and shared secret pasted in.
How you ship updates
This is the workflow that makes "easy updates" actually easy:
# Make changes
git commit -am "improve hours layer 4 for X site"
# Bump the version
npm version patch # 0.1.0 → 0.1.1 (bug fixes)
npm version minor # 0.1.0 → 0.2.0 (new features)
# Publish
npm publishCoworkers get the new version automatically on their next Claude session
because the install command uses npx -y mallmaverick-store-scraper@latest — npx
re-resolves to the latest published version every time.
If you want stricter pinning (you publish a buggy version, want time to
revert), tell them to use [email protected] instead of @latest.
Worker updates (less frequent)
cd cloudflare-worker
npx wrangler deployLive in seconds. No coworker action needed.
CLI usage (you, or fallback)
cd path/to/mall-scraper-mcp
npm install
echo "OPENAI_API_KEY=sk-..." > .env # or set MALL_SCRAPER_* env vars
./run.shCLI prompts for: directory URL, model, max stores, concurrency, threshold,
vision yes/no. Output lands in extracted_stores/.
MCP tools exposed
| Tool | Use when |
|---|---|
| scrape_directory | User wants the full per-store extraction across a directory listing |
| get_store_hours | Debugging — quick hours-only check on a single store URL |
| validate_image_url | A logo isn't loading in the CMS — confirm whether the URL itself is bad |
All three accept JSON inputs documented in their schemas; Claude figures out the args from the conversation.
File layout
mall-scraper-mcp/
├── package.json ← bin entry → src/mcp-server.js
├── src/
│ ├── mcp-server.js ← MCP stdio server (entry for `npx mallmaverick-store-scraper`)
│ ├── main.js ← CLI entry
│ ├── openai-proxy.js ← chooses direct OpenAI vs Worker proxy from env
│ ├── browser.js ← Puppeteer wrapper + XHR intercept
│ ├── discovery.js ← directory URL discovery + logo map
│ ├── hoursParser.js ← canonical hours parsing / validation
│ ├── hoursPipeline.js ← 7-layer hours extraction
│ ├── mallContext.js ← mall hours + socials + chrome images detection
│ ├── imageExtraction.js ← logo/brand/storefront classifier
│ ├── brandSiteFallback.js ← brand-site logo when mall has GIF/missing
│ ├── deterministic.js ← phone, socials, website, status flags
│ ├── storeExtractor.js ← LLM extraction for non-deterministic fields
│ ├── retryStrategy.js ← 3-attempt escalating page loads
│ ├── storeModel.js ← 40-field schema + CSV writer (CRLF/BOM)
│ └── output.js ← (legacy, unused by mcp server)
├── cloudflare-worker/
│ ├── worker.js ← OpenAI proxy (30 LOC)
│ ├── wrangler.toml
│ └── README.md
└── test/
└── hoursParser.test.js ← 40+ unit testsAuth modes
The scraper supports two ways to reach OpenAI; it picks the first that's configured:
Proxy mode (production / coworker default).
MALL_SCRAPER_PROXY_URL+MALL_SCRAPER_TOKENset → calls go through the Cloudflare Worker, which holds your real OpenAI key.Direct mode (your local dev fallback).
OPENAI_API_KEYset → calls go straight to api.openai.com. Useful when developing without spinning up the Worker.
If neither is set, the scraper refuses to start with a clear error.
Troubleshooting
Logo URL returns HTML in coworker's CMS:
Ask Claude to run validate_image_url on the failing URL. Confirms whether
the URL itself returns a real image. If it does, the issue is on the CMS
side (the shopcurrents-style empty property_manager_id case is a known
example).
Coworker gets "unauthorized" from the Worker:
Their MALL_SCRAPER_TOKEN doesn't match the current SHARED_SECRET. Either
rotate it on their side or wrangler secret put SHARED_SECRET to match.
First scrape takes 2-3 minutes: Puppeteer is downloading Chrome on first run (~170 MB). Subsequent scrapes are normal speed.
npx mallmaverick-store-scraper not found:
They need Node 18+ in PATH. node --version to check.
License
MIT.
