@bsbofmusic/cdper-runtime-reddit
v1.1.3
Published
Reddit research runtime for cdper: collect, analyze, and export structured Reddit data
Maintainers
Readme
@bsbofmusic/cdper-runtime-reddit
Reddit research toolkit for authenticated CDP browser sessions.
Version: 8.4.0 | Runtime: Node.js ≥18 | Browser: Puppeteer-compatible CDP endpoint
Release posture
This package is usable for maintainer-operated deployments, but it is not a fully generalized product package yet.
What is verified:
- Live collection through a reachable authenticated CDP browser session
- Database writes through
docker exec -i <container> psql ... - Queue-based scheduling and npm bin packaging presence
- A real smoke run on 2026-03-23 for project
piercing_pillow
What is not fully verified:
- Automated tests are not shipped
- Cross-environment install validation beyond the maintainer environment
- Consistent
run_idworkflows across every downstream report/export/dedup path - Watcher / cron behavior as a polished end-user product feature
Do not market this release as “production-ready for all environments”.
Overview
@bsbofmusic/cdp-reddit collects Reddit posts/comments through an authenticated CDP browser session, stores results in PostgreSQL, and supports a research workflow of collection → annotation → export.
Current scope in this package:
- Collection with nested comment traversal and Reddit-native
depth - Queue-based scheduling with configurable CDP concurrency
- Rule-based / semi-automatic JTBD annotation helpers
- CSV / JSON export and data-quality utilities
Important current limitations:
run_idis written by collector / scheduler and schema support exists, but downstream tooling is still not fully normalized aroundrun_id- Some utilities already share
cdp-reddit-config.js; others still use older inline defaults and are not fully productized - Watcher / cron examples are operational recipes, not auto-installed package features
Installation
npm install @bsbofmusic/cdp-reddit puppeteerThis package expects:
- a reachable CDP browser WebSocket endpoint
- a PostgreSQL instance accessible via
docker exec -i <container> psql ... - an existing schema compatible with the shipped migration files
CLI usage
All binaries are exposed through npm bin entries and can also be run with node inside the package directory.
# Register a new project
cdp-reddit-collect \
--register \
--project=my_project \
--label="Dog toy JTBD research" \
--keywords="dog toy,best chew toy" \
--sub=dogtoys \
--threshold=15
# Collect with stored project config
cdp-reddit-collect --project=my_project
# Collect with explicit version / run id
cdp-reddit-collect \
--project=my_project \
--version=v1_research \
--run-id=my_project_20260323_181500_a7f2 \
--keywords='["dog toy","chew toy"]' \
--threshold=15
# Enqueue through the scheduler
cdp-reddit-scheduler --enqueue --project=my_project --version=v1
# Queue status
cdp-reddit-scheduler --status
# JTBD tagging
cdp-reddit-jtbd --project=my_project --min-score=15
cdp-reddit-jtbd-autonomous --project=my_project --batch-size=30
# Export
cdp-reddit-export --project=my_project --format=csv --min-score=10
cdp-reddit-export --project=my_project --format=json --jtbd-only
# Quality / maintenance
cdp-reddit-freshness --project=my_project --threshold=10
cdp-reddit-backfill-post --project=my_project
cdp-reddit-dedup --project=my_project
cdp-reddit-report --project=my_project --stats-onlyIf you prefer direct node execution, run the corresponding *.js file in this directory.
Configuration
Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| CDP_REDDIT_CDP_WS | none | CDP browser WebSocket endpoint; must be set for real collection |
| CDP_REDDIT_DB_CONTAINER | memos-postgres | PostgreSQL Docker container name |
| CDP_REDDIT_DB_USER | memos | DB username |
| CDP_REDDIT_DB_NAME | memos | Database name |
| CDP_REDDIT_DEFAULT_SUBREDDIT | piercing | Default subreddit when project config is absent |
| CDP_REDDIT_MAX_CDP_SESSIONS | 2 | Scheduler concurrency cap |
| CDP_REDDIT_WORKDIR | package dir | Scheduler working directory |
| CDP_REDDIT_REPORT_FILE | cdp-reddit-report.md | Report append target |
| CDP_REDDIT_WATCHER_STATE_FILE | /tmp/.cdp_watcher_state | Watcher state file |
| CDP_REDDIT_WATCHER_MAX_IDLE_ROUNDS | 2 | Watcher completion threshold |
| CDP_REDDIT_WATCHER_MINUTES | 30 | Watcher idle alert threshold |
| CDP_REDDIT_WATCHER_PROJECTS | empty | Optional comma-separated project list for watcher summary output |
Notes:
cdp-reddit-collect.js,cdp-reddit-scheduler.js, andcdp-reddit-report-gen.jsalready consume shared config fromcdp-reddit-config.js- Other scripts still contain inline defaults and should be reviewed before broader packaging
- Any built-in default CDP endpoint in source should be overridden in real deployments
Database connection
Current implementation shells out to:
docker exec -i <container> psql -U <user> -d <db>This package does not yet provide a direct TCP PostgreSQL client path for all scripts.
Data model
Semantic layers
| Field | Scope | Description |
|-------|-------|-------------|
| project | Research boundary | Data isolation key |
| version | Batch label | Human-readable strategy / keyword-set label per run |
| data_source | Compatibility layer | Currently equals version; stored in DB data_source column |
| run_id | Single execution instance | Unique execution id written by collector / scheduler |
Current run_id status
- ✅
reddit_crawl_log.run_idexists and is written by the collector - ✅
reddit_comments.run_idexists in schema and is written by the collector - ✅
reddit_posts.run_idexists in schema and is written by the collector - ⚠️ Report / export / dedup paths have
--run-idsupport, but maintainers should still verify the exact output path they rely on before external release claims
Key tables
| Table | Status |
|-------|--------|
| reddit_comments | Main collected comments table |
| reddit_posts | Post metadata table |
| crawl_projects | Registered projects and defaults |
| reddit_crawl_log | Per-run execution log; authoritative source for current run_id metrics |
See CHANGELOG_PHASE1.md and migrations/2026-03-23-add-run-id.sql for the current schema transition notes.
Included scripts
| Script | Purpose |
|--------|---------|
| cdp-reddit-collect.js | Main collector |
| cdp-reddit-scheduler.js | Queue / concurrency scheduler |
| cdp-reddit-watcher.sh | Optional watchdog script for local ops use |
| cdp-reddit-keyword.js | Keyword pre-analysis |
| cdp-reddit-jtbd.js | Rule-based JTBD tagging |
| cdp-reddit-jtbd-autonomous.js | Batch JTBD tagging |
| cdp-reddit-export.js | CSV / JSON export |
| cdp-reddit-freshness.js | Freshness detection |
| cdp-reddit-dedup.js | Deduplication / version cleanup |
| cdp-reddit-report-gen.js | Data quality report generator |
| cdp-reddit-backfill-post.js | Historical post metadata backfill |
Smoke checklist
Use this before publishing or tagging a release.
1) Install/package smoke
npm install
npm run smoke:installConfirms the declared npm bin targets exist.
2) Environment smoke
Prepare .env from .env.example and set a real CDP endpoint.
Minimum required values:
CDP_REDDIT_CDP_WSCDP_REDDIT_DB_CONTAINERCDP_REDDIT_DB_USERCDP_REDDIT_DB_NAME
3) Live collector smoke
cd /path/to/cdp-reddit
set -a
source ./.env
set +a
node cdp-reddit-collect.js \
--project=smoke_project \
--version=smoke_v1 \
--run-id=smoke_project_$(date +%Y%m%d_%H%M%S) \
--keywords='["test keyword"]' \
--threshold=1Verify all of the following manually:
- CDP bridge is reachable
- browser is actually ready, not just the bridge process
- collector exits successfully
reddit_crawl_logcontains the expectedrun_id- inserted rows exist in
reddit_commentsand/orreddit_posts
4) Scheduler/report smoke
node cdp-reddit-scheduler.js --status
node cdp-reddit-report-gen.js --project=smoke_project --stats-only5) Optional watcher smoke
CDP_REDDIT_WATCHER_PROJECTS="smoke_project" bash cdp-reddit-watcher.shTreat watcher output as ops assistance, not as release-grade verification by itself.
Release steps
- Confirm package metadata in
package.json - Confirm
.env.examplematches the current env contract - Run the smoke checklist above in the target environment
- Confirm required migrations are present and applied
- Review
CHANGELOG_PHASE1.mdso release notes match current facts - Publish only after repository/homepage fields are filled with real values
Example publish flow:
npm pack
npm publishOnly publish after maintainer review of metadata, access, and release notes.
Operational notes
Watcher
cdp-reddit-watcher.sh is included as an ops helper script. It is shipped as a CLI entry, but it should still be treated as environment-specific operations tooling rather than a polished general-purpose watcher.
Current behavior to know before publishing:
- it reads whole-table counts from
reddit_comments - it can optionally print per-project summaries via
CDP_REDDIT_WATCHER_PROJECTS="proj_a,proj_b" - when no watcher project list is provided, it falls back to whole-table totals only
- it does not prove cron is installed or that a deployment is self-managing
Example:
CDP_REDDIT_WATCHER_PROJECTS="my_project,backup_project" bash cdp-reddit-watcher.shCron metadata
package.json contains cron metadata as suggestions only. They are not proof that cron jobs are installed in the current environment.
Live smoke result
Validated on 2026-03-23 with a real piercing_pillow smoke run:
- CDP bridge reachable but browser initially not ready
- Browser successfully woken via
/control/start?mode=advanced&profile=Default - Collector completed with
run_id = smoke_piercing_pillow_20260323_f6 reddit_crawl_log.comments_new = 3reddit_commentsinserted rows for that run =3
Productization gaps
Before promoting this as a broadly reusable npm package, validate:
- repository / homepage metadata
- whether the default CDP endpoint should be removed from source defaults
- whether watcher behavior should stay shipped, be generalized, or be documented as example ops tooling only
Known limitations
- No automated test suite is shipped yet
- Some scripts still rely on environment-specific assumptions or legacy defaults
- Direct PostgreSQL connectivity is not standardized across scripts
- Release claims should stay limited to maintainer-verified workflows unless broader validation is completed
Files
README.md— package overview, smoke checklist, and release posturepackage.json— npm metadata and CLI entry points.env.example— deploy-time environment templateCHANGELOG_PHASE1.md— current implementation status for Phase 1migrations/2026-03-23-add-run-id.sql— current shipped migration
