mallmaverick-store-scraper

v0.2.0

Published

a month ago

MCP server + CLI for scraping shopping mall store directories. Hours-first layered pipeline + image classification.

0High
0Medium
0Low

mallmaverick

mcp claude scraper shopping-mall store-directory puppeteer

mall-scraper-mcp

Layered scraper for shopping-mall store directories. Works as:

MCP server — coworkers drive scrapes from Claude Desktop / Claude Code
CLI — direct command-line use (node src/main.js)

Both share the same v5 pipeline: deterministic hours extraction (JSON-LD → DOM patterns → labeled section → sync-with-mall → focused LLM → external follow), per-page image classification with logo/brand/storefront separation, brand-site fallback for problematic logos.

How coworkers install it

Once published to npm and the Cloudflare Worker is deployed, every coworker runs one command in their terminal:

claude mcp add mall-scraper \
  --env MALL_SCRAPER_PROXY_URL=https://mall-scraper-openai-proxy.YOURSUB.workers.dev \
  --env MALL_SCRAPER_TOKEN=YOUR_SHARED_SECRET \
  -- npx -y mallmaverick-store-scraper@latest

Then in Claude they say things like:

Scrape https://grasslands.ca/store-directory/, first 10 stores. Save as CSV.

Claude calls the scrape_directory tool, returns the data, and Claude can do follow-up analysis (write CSV, find missing fields, retry specific stores).

What requires no setup on coworker machines

❌ No git clone
❌ No OpenAI API key (it lives in your Worker)
❌ No zip to download or replace on updates
✅ npm/npx + Node 18+ (most have this; otherwise nodejs.org)
✅ The shared-secret token (you give them)

The first scrape downloads Chromium (~170 MB, one-time, automatic via Puppeteer).

How YOU set it up (one-time)

1. Deploy the Cloudflare Worker (10 min)

See cloudflare-worker/README.md. The short version:

cd cloudflare-worker
npm install
npx wrangler login          # browser auth to your Cloudflare account
npx wrangler deploy
npx wrangler secret put OPENAI_API_KEY     # paste your real OpenAI key
npx wrangler secret put SHARED_SECRET      # paste a long random string

You now have:

MALL_SCRAPER_PROXY_URL = https://mall-scraper-openai-proxy.YOURSUB.workers.dev
MALL_SCRAPER_TOKEN = (whatever you put as SHARED_SECRET)

Free tier covers ~300 mall scrapes/day. Cost = whatever your OpenAI bill is (~$0.005/store at gpt-5.4-mini).

2. Publish the npm package

# Log in to npm
npm login

# Sanity check
npm pack --dry-run                # see exactly what would be published

# First publish
npm publish --access public

If mall-scraper-mcp is taken, edit package.json "name" to something available (or use a scope like @yourname/mall-scraper-mcp — make sure to npm publish --access public for scoped public packages).

3. Share the install command with coworkers

Send them the one-line claude mcp add command above, with your actual proxy URL and shared secret pasted in.

How you ship updates

This is the workflow that makes "easy updates" actually easy:

# Make changes
git commit -am "improve hours layer 4 for X site"

# Bump the version
npm version patch                 # 0.1.0 → 0.1.1   (bug fixes)
npm version minor                 # 0.1.0 → 0.2.0   (new features)

# Publish
npm publish

Coworkers get the new version automatically on their next Claude session because the install command uses npx -y mallmaverick-store-scraper@latest — npx re-resolves to the latest published version every time.

If you want stricter pinning (you publish a buggy version, want time to revert), tell them to use [email protected] instead of @latest.

Worker updates (less frequent)

cd cloudflare-worker
npx wrangler deploy

Live in seconds. No coworker action needed.

CLI usage (you, or fallback)

cd path/to/mall-scraper-mcp
npm install
echo "OPENAI_API_KEY=sk-..." > .env    # or set MALL_SCRAPER_* env vars
./run.sh

CLI prompts for: directory URL, model, max stores, concurrency, threshold, vision yes/no. Output lands in extracted_stores/.

MCP tools exposed

| Tool | Use when | |---|---| | scrape_directory | User wants the full per-store extraction across a directory listing | | get_store_hours | Debugging — quick hours-only check on a single store URL | | validate_image_url | A logo isn't loading in the CMS — confirm whether the URL itself is bad |

All three accept JSON inputs documented in their schemas; Claude figures out the args from the conversation.

File layout

mall-scraper-mcp/
├── package.json             ← bin entry → src/mcp-server.js
├── src/
│   ├── mcp-server.js        ← MCP stdio server (entry for `npx mallmaverick-store-scraper`)
│   ├── main.js              ← CLI entry
│   ├── openai-proxy.js      ← chooses direct OpenAI vs Worker proxy from env
│   ├── browser.js           ← Puppeteer wrapper + XHR intercept
│   ├── discovery.js         ← directory URL discovery + logo map
│   ├── hoursParser.js       ← canonical hours parsing / validation
│   ├── hoursPipeline.js     ← 7-layer hours extraction
│   ├── mallContext.js       ← mall hours + socials + chrome images detection
│   ├── imageExtraction.js   ← logo/brand/storefront classifier
│   ├── brandSiteFallback.js ← brand-site logo when mall has GIF/missing
│   ├── deterministic.js     ← phone, socials, website, status flags
│   ├── storeExtractor.js    ← LLM extraction for non-deterministic fields
│   ├── retryStrategy.js     ← 3-attempt escalating page loads
│   ├── storeModel.js        ← 40-field schema + CSV writer (CRLF/BOM)
│   └── output.js            ← (legacy, unused by mcp server)
├── cloudflare-worker/
│   ├── worker.js            ← OpenAI proxy (30 LOC)
│   ├── wrangler.toml
│   └── README.md
└── test/
    └── hoursParser.test.js  ← 40+ unit tests

Auth modes

The scraper supports two ways to reach OpenAI; it picks the first that's configured:

Proxy mode (production / coworker default). MALL_SCRAPER_PROXY_URL + MALL_SCRAPER_TOKEN set → calls go through the Cloudflare Worker, which holds your real OpenAI key.
Direct mode (your local dev fallback). OPENAI_API_KEY set → calls go straight to api.openai.com. Useful when developing without spinning up the Worker.

If neither is set, the scraper refuses to start with a clear error.

Troubleshooting

Logo URL returns HTML in coworker's CMS: Ask Claude to run validate_image_url on the failing URL. Confirms whether the URL itself returns a real image. If it does, the issue is on the CMS side (the shopcurrents-style empty property_manager_id case is a known example).

Coworker gets "unauthorized" from the Worker: Their MALL_SCRAPER_TOKEN doesn't match the current SHARED_SECRET. Either rotate it on their side or wrangler secret put SHARED_SECRET to match.

First scrape takes 2-3 minutes: Puppeteer is downloading Chrome on first run (~170 MB). Subsequent scrapes are normal speed.

npx mallmaverick-store-scraper not found: They need Node 18+ in PATH. node --version to check.

License

MIT.