reddit-harvest

v0.1.0

Published

16 days ago

Harvest subreddit posts into structured corpus files for product research, with filtering, deduplication, OpenAI analysis, and interactive exploration.

0High
0Medium
0Low

roseandcolumns

reddit snoowrap openai product-research research cli harvest scraper analysis

reddit-harvest

Harvest subreddit posts and comments into structured corpus files for product research, with advanced filtering, deduplication, OpenAI-powered analysis, and an interactive terminal explorer.

Features

📥 Harvest posts from multiple subreddits (hot, new, top, or search)
🔍 Filter by score, comments, date range
🔄 Deduplicate across runs to avoid re-harvesting
📄 Export as plain text or structured JSONL
🤖 Analyze with OpenAI to extract pain points, personas, and product opportunities
🧭 Explore results interactively in your terminal

Installation

npm install -g reddit-harvest

Or with pnpm:

pnpm add -g reddit-harvest

Quick Start

1. Set up credentials

Create a .env file (or copy from env.example):

# Reddit API (required)
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_REFRESH_TOKEN=your_refresh_token
REDDIT_USER_AGENT=reddit-harvest/1.0

# OpenAI (optional, for analysis)
OPENAI_API_KEY=your_openai_key
OPENAI_MODEL=gpt-4o-mini

2. Harvest posts

reddit-harvest harvest --subreddits "startups,Entrepreneur" --limit 50

3. Analyze with OpenAI

reddit-harvest harvest --subreddits "startups" --limit 50 --analyze

4. Explore results

reddit-harvest explore --latest

Commands

`harvest` - Download subreddit content

reddit-harvest harvest --subreddits "startups,SaaS" --listing top --time week --limit 100

Options:

| Flag | Default | Description | |------|---------|-------------| | --subreddits | required | Comma-separated list of subreddits | | --listing | hot | hot, new, or top | | --time | week | Time range for top: hour, day, week, month, year, all | | --limit | 25 | Max posts per subreddit | | --search | - | Search query (uses Reddit search instead of listing) | | --minScore | - | Skip posts below this score | | --minComments | - | Skip posts with fewer comments | | --after | - | Only posts after this date (ISO format) | | --before | - | Only posts before this date (ISO format) | | --includeComments | false | Include top-level comments | | --commentLimit | 50 | Max comments per post | | --format | txt | Output format: txt or jsonl | | --dedupe | false | Skip previously harvested posts | | --analyze | false | Run OpenAI analysis after harvest | | --quoteFidelity | false | Require supporting quotes for all claims |

`analyze` - Analyze existing corpus

reddit-harvest analyze --input outputs/corpus.jsonl

Options:

| Flag | Default | Description | |------|---------|-------------| | --input | required | Path to corpus file (.txt or .jsonl) | | --outDir | outputs | Output directory | | --quoteFidelity | false | Require supporting quotes |

`explore` - Interactive browser

reddit-harvest explore --latest

Options:

| Flag | Default | Description | |------|---------|-------------| | --dir | outputs | Directory containing analysis files | | --latest | false | Auto-select most recent analysis |

Output Files

After running with --analyze, you get:

| File | Description | |------|-------------| | <timestamp>-r_<subreddit>.txt | Raw corpus (or .jsonl) | | <timestamp>-analysis.md | Full research synthesis | | <timestamp>-opportunities.json | Structured product opportunities |

Opportunities JSON structure

[{
  "id": "opp-1",
  "title": "Automated customer discovery tool",
  "targetUser": "Solo founders",
  "problem": "Spending too much time on manual outreach",
  "currentWorkaround": "Cold emails and LinkedIn DMs",
  "proposedSolution": "AI-powered lead qualification",
  "confidence": "medium",
  "supportingQuotes": [{ "text": "I spend 4 hours a day...", "permalink": "..." }],
  "risks": ["Crowded market"],
  "mvpExperiment": "Landing page with email capture"
}]

Examples

Full product research workflow

# Harvest with filters and analysis
reddit-harvest harvest \
  --subreddits "startups,Entrepreneur,SaaS" \
  --listing top \
  --time month \
  --limit 100 \
  --minScore 5 \
  --includeComments \
  --format jsonl \
  --dedupe \
  --analyze \
  --quoteFidelity

# Explore the results
reddit-harvest explore --latest

Daily harvesting with deduplication

# First run
reddit-harvest harvest --subreddits "startups" --limit 100 --dedupe --format jsonl

# Later runs skip already-harvested posts
reddit-harvest harvest --subreddits "startups" --limit 100 --dedupe --format jsonl

Search for specific topics

reddit-harvest harvest \
  --subreddits "startups" \
  --search "finding first customers" \
  --limit 50 \
  --analyze

Programmatic Usage

import {
  createRedditClient,
  harvestSubredditsToFiles,
  analyzeCorpus
} from 'reddit-harvest';

// Harvest
const reddit = createRedditClient();
const result = await harvestSubredditsToFiles({
  reddit,
  subreddits: ['startups'],
  outDir: './outputs',
  limit: 50,
  format: 'jsonl'
});

// Analyze
const analysis = await analyzeCorpus({
  posts: result.allPosts,
  subreddits: ['startups'],
  outDir: './outputs'
});

console.log(analysis.opportunities);

Reddit API Setup

Go to Reddit Apps
Create a "script" type application
Note your client_id and client_secret
Generate a refresh token using the OAuth flow

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | REDDIT_CLIENT_ID | Yes | Reddit app client ID | | REDDIT_CLIENT_SECRET | Yes | Reddit app client secret | | REDDIT_REFRESH_TOKEN | Yes | OAuth refresh token | | REDDIT_USER_AGENT | Yes | User agent string | | OPENAI_API_KEY | For analysis | OpenAI API key | | OPENAI_MODEL | No | Model to use (default: gpt-4o-mini) |

Notes

Rate limits: Reddit rate limits API requests. The default delay is 1100ms between requests.
API costs: OpenAI analysis costs money. Use --limit to control corpus size.
PII: Be careful what you store/share from Reddit content.
Reddit ToS: Don't use for spam, harassment, or violating Reddit's terms.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

reddit-harvest

Features

Installation

Quick Start

1. Set up credentials

2. Harvest posts

3. Analyze with OpenAI

4. Explore results

Commands

harvest - Download subreddit content

analyze - Analyze existing corpus

explore - Interactive browser

Output Files

Opportunities JSON structure

Examples

Full product research workflow

Daily harvesting with deduplication

Search for specific topics

Programmatic Usage

Reddit API Setup

Environment Variables

Notes

Contributing

License

`harvest` - Download subreddit content

`analyze` - Analyze existing corpus

`explore` - Interactive browser