@nikx/dory-worker
v1.0.4
Published
Standalone BullMQ worker for Dory – runs on any machine with Docker (including Raspberry Pi)
Maintainers
Readme
dory-worker
BullMQ job consumer for the Dory web scraping platform. Runs on any machine with Docker — including a Raspberry Pi. Pulls scraping jobs off a shared Redis queue and executes them by launching dory-core containers locally.
npm: @nikx/[email protected]
Architecture
dory-api (Railway)
└─ enqueues job → BullMQ (Redis)
└─ dory-worker (your home machine / Pi)
│ GET /api/runs/:id/config
│ POST /api/runs/:id/status (running / completed / failed)
│
├─ Single-container mode (containerCount = 1)
│ └─ docker run dory-core:v2
│ └─ Crawlee in-memory queue
│
└─ Distributed mode (containerCount > 1)
├─ docker run dory-core:v2 × N
│ ├─ REDIS_URL=redis://host.docker.internal:6379
│ ├─ QUEUE_NAME=<runId> ← job-scoped, isolated
│ ├─ WORKER_ID=worker-1..N
│ └─ IDLE_TIMEOUT_SECS=60
│
└─ Shared Redis queue (rq:<queueId>:*)
├─ :meta queue metadata
├─ :requests all URLs ever added (Hash)
├─ :ordering Lua-locked sorted set
└─ :handled completed requestIds (Set)Distributed queue internals
| Concern | Mechanism |
|---------|-----------|
| Deduplication | SHA-256(uniqueKey).slice(0,15) → requestId; HGET :requests guard before any write |
| Atomic locking | Lua script LUA_LIST_AND_LOCK — ZADD score = ±lockExpiresAt; no two containers claim the same URL |
| Retry | Crawlee increments retryCount, re-enqueues until maxRequestRetries; exhausted → SADD :handled + errorMessages |
| Idle shutdown | IDLE_TIMEOUT_SECS = min(60, actorTimeoutSecs / 2); containers exit cleanly when the queue drains |
Prerequisites
- Node.js ≥ 20
- Docker (with access to
dory-core:v2image — build locally or pull from registry) - Redis (local container or remote — same instance used by dory-api)
Quick Start
1. Install
npm install -g @nikx/dory-workerOr run from source:
git clone https://github.com/your-org/dory-worker
cd dory-worker
npm install2. Configure
cp .env.example .envEdit .env:
# Required — public URL of dory-api (must be reachable from Docker containers)
API_BASE_URL=https://your-api.railway.app
# Redis — Option A: full URL (recommended for Railway)
REDIS_URL=redis://default:[email protected]:6379
# Redis — Option B: host + port
REDIS_HOST=localhost
REDIS_PORT=6379
# Optional — single-container mode default (overridden per-actor by dory-api)
CONTAINER_COUNT=1
# Distributed mode — Redis the crawling containers share
# Must be reachable from INSIDE Docker containers on this machine
# e.g. redis://host.docker.internal:6379 for a local Redis
CRAWLER_REDIS_URL=redis://host.docker.internal:6379
# Worker concurrency — keep at 1-2 for Raspberry Pi
MAX_CONCURRENT_RUNS=2
# Fallback image if dory-api doesn't return one
DOCKER_IMAGE=dory-core:v23. Run
# From npm package
dory-worker
# From source
npm run dev
# Built
npm run build && npm startEnvironment Variables
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| API_BASE_URL | ✅ | — | dory-api URL (reachable from Docker containers) |
| REDIS_URL | one of | — | Full Redis URL |
| REDIS_HOST | one of | localhost | Redis hostname |
| REDIS_PORT | — | 6379 | Redis port |
| REDIS_PASSWORD | — | — | Redis password |
| CRAWLER_REDIS_URL | distributed | — | Redis for the per-run crawler queue |
| CONTAINER_COUNT | — | 1 | Containers per job (overridden by dory-api per actor) |
| MAX_CONCURRENT_RUNS | — | 2 | Parallel BullMQ jobs |
| WORKER_ID | — | dory-worker-{pid} | Label shown in logs |
| LOG_LEVEL | — | info | debug \| info \| warn \| error |
| DOCKER_IMAGE | — | — | Fallback image if API doesn't return one |
| GCS_BUCKET | — | — | Passed through to containers for result uploads |
| GCP_PROJECT_ID | — | — | Passed through to containers |
| GOOGLE_APPLICATION_CREDENTIALS | — | — | Path to GCP service account JSON file |
| GOOGLE_APPLICATION_CREDENTIALS_JSON | — | — | Full service account JSON string (Railway / CI — written to /tmp/gcp-dory-credentials.json on startup) |
| STORAGE_EMULATOR_HOST | — | — | fake-gcs-server URL (local dev) |
| QUEUE_RUN_EXECUTION | — | run-execution | BullMQ queue name for scraping jobs — must match the API's value |
How a Job Flows
- dory-api enqueues a BullMQ job
{ runId }onto therun-executionqueue (configurable viaQUEUE_RUN_EXECUTION). - Worker picks up the job — calls
GET /api/runs/:id/configto getactorConfig,dockerImage,containerCount,memoryLimitMb,actorTimeoutSecs. - Worker calls
POST /api/runs/:id/status→{ status: "running" }. - Worker calls
docker run(once for single-container, N times for distributed). Each container receives:ACTOR_CONFIG— base64-encoded actor/user-input JSONAPI_BASE_URL— so the container can POST status callbacksCRAWLEE_MEMORY_MBYTES— frommemoryLimitMb- (distributed only)
REDIS_URL,QUEUE_NAME,WORKER_ID,IDLE_TIMEOUT_SECS
- Worker extends the BullMQ lock every 2 minutes while containers run.
- Worker calls
docker waiton all containers (in parallel). Uses the worst exit code. - Worker calls
POST /api/runs/:id/status→{ status: "completed"|"failed", exitCode }— only if no HTTP callback arrived (fallback).
containerCount precedence: dory-api /config response > CONTAINER_COUNT env var > default 1.
localhost rewriting: API_BASE_URL and CRAWLER_REDIS_URL containing localhost are automatically rewritten to host.docker.internal before being injected into containers.
Source Layout
src/
cli.ts Entry point — loads config, starts BullMQ Worker
config.ts WorkerConfig interface + loadConfig() from env vars
worker.ts BullMQ Worker setup, concurrency, graceful shutdown
processor.ts Core job handler — fetch config, spawn containers, wait
docker.ts docker run / docker wait wrappers; DistributedOpts
logger.ts Structured logger with log levels
test/
harness.ts Standalone test harness — mock API + real worker + Redis inspection
redis-inspector.ts Post-run queue inspector — reads rq:* keys, returns metrics
scripts/
run-all-tests.ts 13-scenario E2E suite runner → writes E2E-TEST-REPORT.md
test-image/
Dockerfile Minimal test image used by the harness in CITesting
Run a single scenario
# Minimal (single container, empty handlers — validates worker lifecycle)
npm test
# Real cheerio crawl (quotes.toscrape.com, 10 pages)
SCENARIO=real-crawl npm test
# Distributed mode (2 containers)
npm run test:distributed
# Distributed, 3 containers, 50 pages
DISTRIBUTED=true CONTAINER_COUNT=3 SCENARIO=dist-large npm test
# Deduplication — triplicate seed URLs
DISTRIBUTED=true SCENARIO=dedup npm test
# Retry on failure — handler throws on page 2
DISTRIBUTED=true SCENARIO=retry-failure npm test
# API failure resilience
SCENARIO=api-error EXPECT_FAILURE=true npm testValid SCENARIO values: minimal, real-crawl, large-crawl, distributed, dist-large, dedup, retry-failure, api-error, missing-redis.
Run the full 13-scenario suite
npm run test:allResults are written to E2E-TEST-REPORT.md.
E2E test results (v1.0.3)
| # | Category | Scenario | Result | Duration | |---|----------|----------|--------|----------| | T01 | happy-path | Single-container · minimal | ✅ | 6.1s | | T02 | happy-path | Single-container · 10-page crawl | ✅ | 6.8s | | T03 | happy-path | Single-container · 50-page crawl | ✅ | 7.3s | | T04 | distribution | Distributed · 2 containers · 10 pages | ✅ 0% skew | 67.3s | | T05 | distribution | Distributed · 3 containers · 10 pages | ✅ 10% skew | 68.1s | | T06 | distribution | Distributed · 2 containers · 50 pages | ✅ 0% skew | 9.1s | | T07 | distribution | Distributed · 3 containers · 50 pages | ✅ 0% skew | 8.7s | | T08 | correctness | Deduplication · triplicate seed | ✅ | 66.6s | | T09 | correctness | Retry on failure · handler throws | ✅ | 67.0s | | T10 | resilience | API /config returns 500 | ✅ | 3.1s | | T11 | resilience | Non-existent Docker image | ✅ | 2.5s | | T12 | resilience | Distributed · missing CRAWLER_REDIS_URL | ✅ | 2.1s | | T13 | resilience | containerCount precedence | ✅ | 7.1s |
13/13 passed — 321.8s total. See E2E-TEST-REPORT.md for full metrics including per-worker URL counts, deduplication proof, and retry traces.
Building & Publishing
npm run build # compile src/ → dist/
npm publish --access publicThe published package exports dist/cli.js as the dory-worker binary.
