postcrawl

v1.2.0

Published

4 months ago

Node.js/TypeScript SDK for PostCrawl - The Fastest LLM Ready Social Media Crawler

0High
0Medium
0Low

kdcokenny

postcrawl social-media scraping reddit tiktok api sdk typescript nodejs

PostCrawl Node.js SDK

Official Node.js/TypeScript SDK for PostCrawl - The Fastest LLM-Ready Social Media Crawler. Extract and search content from Reddit and TikTok with a simple, type-safe TypeScript interface.

Features

🔍 Search across Reddit and TikTok with advanced filtering
📊 Extract content from social media URLs with optional comments
🚀 Combined search and extract in a single operation
🏷️ Type-safe with full TypeScript support and Zod validation
⚡ Promise-based API with async/await support
🛡️ Comprehensive error handling with detailed exceptions
📈 Rate limiting support with credit tracking
🔄 Automatic retries for network errors
🎯 Platform-specific types for Reddit and TikTok data with strong typing
📝 Rich content formatting with markdown support
🐍 Node.js 18+ with modern ES modules and camelCase naming

Installation

Using npm

npm install postcrawl

Using yarn

yarn add postcrawl

Using pnpm

pnpm add postcrawl

Using bun

bun add postcrawl

Optional: Environment Variables

For loading API keys from .env files:

npm install dotenv
# or
bun add dotenv

Requirements

Node.js 18.0 or higher
PostCrawl API key (Get one for free)

Quick Start

Basic Usage

import { PostCrawlClient } from 'postcrawl';

// Initialize the client with your API key
const pc = new PostCrawlClient({
  apiKey: 'sk_your_api_key_here'
});

// Search for content
const results = await pc.search({
  socialPlatforms: ['reddit'],
  query: 'machine learning',
  results: 10,
  page: 1
});

// Process results
for (const post of results) {
  console.log(`${post.title} - ${post.url}`);
  console.log(`  Date: ${post.date}`);
  console.log(`  Snippet: ${post.snippet.substring(0, 100)}...`);
}

Extract Content

// Extract content from URLs
const posts = await pc.extract({
  urls: [
    'https://reddit.com/r/...',
    'https://tiktok.com/@...'
  ],
  includeComments: true,
  responseMode: 'raw',
  commentFilterConfig: {
    min_score: 10,
    max_depth: 2
  }
});

// Process extracted posts
for (const post of posts) {
  if (post.error) {
    console.error(`Failed to extract ${post.url}: ${post.error}`);
  } else {
    console.log(`Extracted from ${post.source}: ${post.url}`);
  }
}

Search and Extract

const posts = await pc.searchAndExtract({
  socialPlatforms: ['reddit'],
  query: 'search query',
  results: 5,
  page: 1,
  includeComments: true,
  responseMode: 'markdown',
  commentFilterConfig: {
    tier_limits: { "0": 5, "1": 3 },
    preserve_high_quality_threads: true
  }
});

API Reference

Client Initialization

const pc = new PostCrawlClient({
  apiKey: string,           // Required: Your PostCrawl API key (starts with 'sk_')
  timeout?: number,         // Optional: Request timeout in ms (default: 0 = no timeout)
  maxRetries?: number,      // Optional: Max retry attempts (default: 3)
  retryDelay?: number,      // Optional: Delay between retries in ms (default: 1000)
})

Search

const results = await pc.search({
  socialPlatforms: ['reddit', 'tiktok'],
  query: 'your search query',
  results: 10,  // 1-100
  page: 1       // pagination
});

Extract

const posts = await pc.extract({
  urls: ['https://reddit.com/...', 'https://tiktok.com/...'],
  includeComments: true,
  responseMode: 'raw'  // or 'markdown'
});

Search and Extract

const posts = await pc.searchAndExtract({
  socialPlatforms: ['reddit'],
  query: 'search query',
  results: 5,
  page: 1,
  includeComments: true,
  responseMode: 'markdown',
  commentFilterConfig: { ... }
});

Comment Filtering

The commentFilterConfig object allows you to filter comments server-side to reduce data transfer and improve performance:

const posts = await pc.extract({
  urls: ['...'],
  includeComments: true,
  commentFilterConfig: {
    // Limit comments by depth level
    tier_limits: {
      "0": 10, // Max 10 top-level comments
      "1": 5,  // Max 5 replies per comment
      "2": 2   // Max 2 nested replies
    },
    
    // Minimum score/likes threshold
    min_score: 10,
    
    // Minimum quality relative to top comment (0.0-1.0)
    top_comment_percentile: 0.1,
    
    // Maximum depth to traverse
    max_depth: 5,
    
    // Preserve more replies for high-quality threads
    preserve_high_quality_threads: true,
    high_quality_thread_score: 100
  }
});

Examples

Check out the examples/ directory for complete working examples:

search_101.ts - Basic search functionality demo
extract_101.ts - Content extraction demo with Reddit and TikTok
search_and_extract_101.ts - Combined operation demo

Run examples with:

# Using bun (recommended)
bun run examples

# Or run individual examples
bun run example:search
bun run example:extract
bun run example:sne

# Using Node.js with tsx
npx tsx examples/search_101.ts

Response Models

SearchResult

Response from the search endpoint:

title: Title of the search result
url: URL of the search result
snippet: Text snippet from the content
date: Date of the post (e.g., "Dec 28, 2024")
imageUrl: URL of associated image (can be empty string)

ExtractedPost

url: Original URL
source: Platform name ("reddit" or "tiktok")
raw: Raw content data (RedditPost or TiktokPost object) - strongly typed
markdown: Markdown formatted content (when responseMode="markdown")
error: Error message if extraction failed

Working with Platform-Specific Types

The SDK provides type-safe access to platform-specific data:

import { PostCrawlClient, isRedditPost, isTiktokPost } from 'postcrawl';

// Extract content with proper type handling
const posts = await pc.extract({
  urls: ['https://reddit.com/...']
});

for (const post of posts) {
  if (post.error) {
    console.error(`Error: ${post.error}`);
  } else if (isRedditPost(post.raw)) {
    // Access Reddit-specific fields with camelCase
    console.log(`Subreddit: r/${post.raw.subredditName}`);
    console.log(`Score: ${post.raw.score}`);
    console.log(`Title: ${post.raw.title}`);
    console.log(`Upvotes: ${post.raw.upvotes}`);
    console.log(`Created: ${post.raw.createdAt}`);
    if (post.raw.comments) {
      console.log(`Comments: ${post.raw.comments.length}`);
    }
  } else if (isTiktokPost(post.raw)) {
    // Access TikTok-specific fields with camelCase
    console.log(`Username: @${post.raw.username}`);
    console.log(`Likes: ${post.raw.likes}`);
    console.log(`Total Comments: ${post.raw.totalComments}`);
    console.log(`Created: ${post.raw.createdAt}`);
    if (post.raw.hashtags) {
      console.log(`Hashtags: ${post.raw.hashtags.join(', ')}`);
    }
  }
}

Error Handling

import {
  AuthenticationError,      // Invalid API key
  InsufficientCreditsError, // Not enough credits
  RateLimitError,          // Rate limit exceeded
  ValidationError          // Invalid parameters
} from 'postcrawl';

try {
  const results = await pc.search({ ... });
} catch (error) {
  if (error instanceof AuthenticationError) {
    console.error('Invalid API key');
  } else if (error instanceof InsufficientCreditsError) {
    console.error('Insufficient credits:', error.requiredCredits);
  } else if (error instanceof RateLimitError) {
    console.error(`Rate limited. Retry after ${error.retryAfter}s`);
  } else if (error instanceof ValidationError) {
    console.error('Validation errors:', error.details);
  }
}

Development

This project uses Bun as the primary runtime and package manager. See DEVELOPMENT.md for detailed setup and contribution guidelines.

Quick Development Setup

# Clone the repository
git clone https://github.com/post-crawl/node-sdk.git
cd node-sdk

# Install dependencies
bun install

# Run tests
make test

# Run all checks (format, lint, test)
make check

# Build the package
make build

Available Commands

make help         # Show all available commands
make format       # Format code with biome
make lint         # Run linting and type checking
make test         # Run test suite
make check        # Run format, lint, and tests
make build        # Build distribution packages
make examples     # Run all examples

API Key Management

Environment Variables (Recommended)

Store your API key securely in environment variables:

export POSTCRAWL_API_KEY="sk_your_api_key_here"

Or use a .env file:

# .env
POSTCRAWL_API_KEY=sk_your_api_key_here

Then load it in your code:

import { config } from 'dotenv';
import { PostCrawlClient } from 'postcrawl';

config();
const pc = new PostCrawlClient({
  apiKey: process.env.POSTCRAWL_API_KEY!
});

Security Best Practices

Never hardcode API keys in your source code
Add .env to .gitignore to prevent accidental commits
Use environment variables in production
Rotate keys regularly through the PostCrawl dashboard
Set key permissions to limit access to specific operations

Rate Limits & Credits

PostCrawl uses a credit-based system:

Search: ~1 credit per 10 results
Extract: ~1 credit per URL (without comments)
Extract with comments: ~3 credits per URL

Rate limits are tracked automatically:

const pc = new PostCrawlClient({ apiKey: 'sk_...' });
const results = await pc.search({ ... });

console.log(`Rate limit: ${pc.rateLimitInfo.limit}`);
console.log(`Remaining: ${pc.rateLimitInfo.remaining}`);
console.log(`Reset at: ${pc.rateLimitInfo.reset}`);

Support

Documentation: github.com/post-crawl/node-sdk
Issues: github.com/post-crawl/node-sdk/issues
Email: [email protected]

License

MIT License - see LICENSE file for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

PostCrawl Node.js SDK

Features

Installation

Using npm

Using yarn

Using pnpm

Using bun

Optional: Environment Variables

Requirements

Quick Start

Basic Usage

Extract Content

Search and Extract

API Reference

Client Initialization

Search

Extract

Search and Extract

Comment Filtering

Examples

Response Models

SearchResult

ExtractedPost

Working with Platform-Specific Types

Error Handling

Development

Quick Development Setup

Available Commands

API Key Management

Environment Variables (Recommended)

Security Best Practices

Rate Limits & Credits

Support

License