@podx/scraper

v2.0.2

Published

3 months ago

🔍 Advanced Twitter/X scraping, bot detection, and crypto analysis toolkit for PODx

0High
0Medium
0Low

twitter x bot-detection scraper social-media analysis crypto token-shilling pump-dump ml security coordination-detection

@podx/scraper

The scraper package provides comprehensive Twitter/X data collection, analysis, and signal generation capabilities for the PODx ecosystem. It includes advanced scraping algorithms, bot detection, sentiment analysis, cryptocurrency token extraction, and trading signal generation.

📦 Installation

# Install from workspace
bun add @podx/scraper@workspace:*

# Or install from npm (when published)
bun add @podx/scraper

🏗️ Architecture

The scraper package is organized into several specialized modules:

packages/scraper/src/
├── scrapers/         # Core scraping functionality
│   ├── baseScraper.ts    # Base scraper with authentication
│   ├── searchScraper.ts  # Search-based tweet scraping
│   ├── commentScraper.ts # Comment/reply scraping
│   └── index.ts          # Scraper exports
├── services/         # Service layer
│   ├── index.ts          # Main ScraperService
├── auth/             # Authentication handling
├── analyzers/        # Data analysis modules
│   ├── BotDetector.ts    # Bot detection algorithms
│   ├── SentimentAnalyzer.ts # Sentiment analysis
│   ├── SignalGenerator.ts # Trading signal generation
│   ├── TokenExtractor.ts # Cryptocurrency token extraction
│   └── index.ts          # Analyzer exports
├── crypto/           # Cryptocurrency analysis
├── signals/          # Signal processing and generation
├── types/            # TypeScript type definitions
└── index.ts          # Main exports

🚀 Quick Start

import { ScraperService, SentimentAnalyzer, TokenExtractor } from '@podx/scraper';

// Initialize scraper service
const scraper = new ScraperService();

// Scrape tweets from a user
const tweets = await scraper.scrapeAccount({
  targetUsername: 'cryptowhale',
  maxTweets: 100,
  progressCallback: (progress) => {
    console.log(`Scraped ${progress.count}/${progress.max} tweets`);
  }
});

// Analyze sentiment
const analyzer = new SentimentAnalyzer();
const sentiment = await analyzer.analyze(tweets);

// Extract cryptocurrency tokens
const extractor = new TokenExtractor();
const tokens = await extractor.extract(tweets);

// Save results
const result = await scraper.saveTweetsToFile(tweets, 'cryptowhale');
console.log(`Saved ${tweets.length} tweets to ${result.filename}`);

🔐 Authentication

The scraper supports multiple authentication methods for Twitter/X API access:

Environment Variables Setup

# Required credentials
export XSERVE_USERNAME="your_twitter_username"
export XSERVE_PASSWORD="your_twitter_password"
export XSERVE_EMAIL="[email protected]"  # Optional, for account recovery

Authentication Flow

import { ScraperService } from '@podx/scraper';

const scraper = new ScraperService();

// Authentication happens automatically on first API call
try {
  const tweets = await scraper.scrapeAccount({
    targetUsername: 'example',
    maxTweets: 10
  });

  console.log('Authentication successful!');
} catch (error) {
  if (error.code === 'AUTHENTICATION_FAILED') {
    console.error('Please check your Twitter credentials');
  }
}

📊 Core Scraping Features

Account Scraping

import { ScraperService } from '@podx/scraper';

const scraper = new ScraperService();

// Scrape tweets from a specific account
const tweets = await scraper.scrapeAccount({
  targetUsername: 'VitalikButerin',
  maxTweets: 500,
  progressCallback: (progress) => {
    const percent = Math.round((progress.count / progress.max) * 100);
    console.log(`Progress: ${percent}% (${progress.count}/${progress.max})`);
  }
});

// Process scraped tweets
tweets.forEach(tweet => {
  console.log(`@${tweet.username}: ${tweet.text}`);
  console.log(`Likes: ${tweet.likes}, Retweets: ${tweet.retweets}`);
});

Search-Based Scraping

import { SearchScraper } from '@podx/scraper/scrapers';

// Search for tweets with specific criteria
const searchScraper = new SearchScraper();

const tweets = await searchScraper.search({
  query: 'bitcoin OR ethereum',
  maxTweets: 1000,
  filters: {
    language: 'en',
    dateRange: {
      from: new Date('2024-01-01'),
      to: new Date('2024-01-31')
    },
    minLikes: 10,
    minRetweets: 5
  }
});

Comment/Reply Scraping

import { CommentScraper } from '@podx/scraper/scrapers';

const commentScraper = new CommentScraper();

// Scrape replies to a specific tweet
const replies = await commentScraper.scrapeComments({
  tweetId: '1234567890123456789',
  maxReplies: 200,
  includeNested: true  // Include replies to replies
});

// Analyze conversation threads
const threads = commentScraper.buildConversationThreads(replies);

🧠 Advanced Analysis

Bot Detection

import { BotDetector } from '@podx/scraper/analyzers';

const detector = new BotDetector();

// Analyze tweets for bot-like behavior
const analysis = await detector.analyze(tweets);

analysis.results.forEach(result => {
  console.log(`@${result.username}: ${result.botProbability}% bot probability`);
  console.log(`Reasons: ${result.reasons.join(', ')}`);
});

// Filter out likely bots
const humanTweets = analysis.results
  .filter(result => result.botProbability < 30)
  .map(result => result.tweet);

Sentiment Analysis

import { SentimentAnalyzer } from '@podx/scraper/analyzers';

const sentimentAnalyzer = new SentimentAnalyzer();

// Analyze sentiment of tweets
const sentimentResults = await sentimentAnalyzer.analyze(tweets);

sentimentResults.forEach(result => {
  console.log(`Tweet: ${result.text}`);
  console.log(`Sentiment: ${result.sentiment} (${result.confidence}%)`);
  console.log(`Emotions: ${result.emotions.join(', ')}`);
});

// Aggregate sentiment over time
const timeSeries = sentimentAnalyzer.createSentimentTimeSeries(sentimentResults);

Cryptocurrency Token Extraction

import { TokenExtractor } from '@podx/scraper/analyzers';

const tokenExtractor = new TokenExtractor();

// Extract cryptocurrency mentions and addresses
const tokenResults = await tokenExtractor.extract(tweets);

tokenResults.forEach(result => {
  console.log(`Found ${result.tokens.length} tokens in tweet`);
  result.tokens.forEach(token => {
    console.log(`- ${token.symbol}: ${token.address} (${token.blockchain})`);
    console.log(`  Context: ${token.context}`);
    console.log(`  Confidence: ${token.confidence}%`);
  });
});

// Get trending tokens
const trending = tokenExtractor.getTrendingTokens(tokenResults, {
  timeframe: '24h',
  minMentions: 5
});

📈 Signal Generation

Trading Signals

import { SignalGenerator } from '@podx/scraper/analyzers';

const signalGenerator = new SignalGenerator();

// Generate trading signals from tweet analysis
const signals = await signalGenerator.generateSignals({
  tweets,
  sentimentResults,
  tokenResults,
  marketData: {
    btcPrice: 45000,
    ethPrice: 2500
  }
});

signals.forEach(signal => {
  console.log(`Signal: ${signal.type} for ${signal.token}`);
  console.log(`Strength: ${signal.strength}/10`);
  console.log(`Reason: ${signal.reason}`);
  console.log(`Confidence: ${signal.confidence}%`);
  console.log(`Timeframe: ${signal.timeframe}`);
});

// Filter high-confidence signals
const strongSignals = signals.filter(s => s.confidence > 80 && s.strength >= 7);

Market Sentiment Signals

// Generate market sentiment signals
const marketSignals = await signalGenerator.generateMarketSignals({
  tweets,
  sentimentData: sentimentResults,
  tokenData: tokenResults,
  marketContext: {
    overallSentiment: 'bullish',
    fearGreedIndex: 75,
    volume24h: 1250000000
  }
});

marketSignals.forEach(signal => {
  console.log(`Market Signal: ${signal.type}`);
  console.log(`Direction: ${signal.direction}`);
  console.log(`Strength: ${signal.strength}`);
  console.log(`Timeframe: ${signal.timeframe}`);
  console.log(`Rationale: ${signal.rationale}`);
});

🔧 Advanced Configuration

Custom Scraping Options

import { ScraperService } from '@podx/scraper';

// Advanced scraping with custom options
const scraper = new ScraperService();

const tweets = await scraper.scrapeAccount({
  targetUsername: 'crypto_influencer',
  maxTweets: 1000,
  filters: {
    minLikes: 10,
    minRetweets: 5,
    dateRange: {
      from: new Date('2024-01-01'),
      to: new Date('2024-01-31')
    },
    language: 'en',
    excludeReplies: false,
    excludeRetweets: true
  },
  rateLimit: {
    requestsPerMinute: 30,
    delayBetweenRequests: 2000
  },
  retryPolicy: {
    maxRetries: 3,
    backoffMultiplier: 2,
    initialDelay: 1000
  }
});

Custom Analysis Pipeline

import { 
  SentimentAnalyzer, 
  TokenExtractor, 
  BotDetector,
  SignalGenerator 
} from '@podx/scraper/analyzers';

// Create custom analysis pipeline
class CryptoAnalysisPipeline {
  constructor(
    private sentimentAnalyzer = new SentimentAnalyzer(),
    private tokenExtractor = new TokenExtractor(),
    private botDetector = new BotDetector(),
    private signalGenerator = new SignalGenerator()
  ) {}

  async analyze(tweets: Tweet[]): Promise<AnalysisResult> {
    // Step 1: Filter out bots
    const botAnalysis = await this.botDetector.analyze(tweets);
    const humanTweets = botAnalysis.results
      .filter(r => r.botProbability < 50)
      .map(r => r.tweet);

    // Step 2: Analyze sentiment
    const sentiment = await this.sentimentAnalyzer.analyze(humanTweets);

    // Step 3: Extract tokens
    const tokens = await this.tokenExtractor.extract(humanTweets);

    // Step 4: Generate signals
    const signals = await this.signalGenerator.generateSignals({
      tweets: humanTweets,
      sentimentResults: sentiment,
      tokenResults: tokens
    });

    return {
      originalTweetCount: tweets.length,
      humanTweets: humanTweets.length,
      sentiment,
      tokens,
      signals,
      analysisTimestamp: new Date()
    };
  }
}

// Use the pipeline
const pipeline = new CryptoAnalysisPipeline();
const result = await pipeline.analyze(tweets);

💾 Data Storage and Export

File Storage

import { ScraperService } from '@podx/scraper';

const scraper = new ScraperService();

// Scrape and save to file automatically
const result = await scraper.scrapeAndSave({
  targetUsername: 'cryptopunk',
  maxTweets: 200
});

console.log(`Saved ${result.tweets.length} tweets to ${result.filename}`);

// Custom file naming and organization
const customResult = await scraper.saveTweetsToFile(
  tweets, 
  'custom_username',
  {
    format: 'json',
    compress: true,
    includeMetadata: true,
    splitByDate: true
  }
);

Database Integration

import { DatabaseService } from '@podx/core';
import { ScraperService } from '@podx/scraper';

const db = new DatabaseService(config.database);
const scraper = new ScraperService();

// Scrape and store in database
const tweets = await scraper.scrapeAccount({
  targetUsername: 'defi_pulse',
  maxTweets: 100
});

// Store with analysis
for (const tweet of tweets) {
  const analysis = await analyzer.analyze([tweet]);
  const tokens = await tokenExtractor.extract([tweet]);

  await db.save('analyzed_tweets', {
    ...tweet,
    sentiment: analysis[0]?.sentiment,
    tokens: tokens[0]?.tokens || [],
    analyzedAt: new Date()
  });
}

📊 Analytics and Reporting

Generate Reports

import { AnalyticsEngine } from '@podx/scraper/analytics';

const analytics = new AnalyticsEngine();

// Generate comprehensive report
const report = await analytics.generateReport({
  tweets,
  sentimentResults,
  tokenResults,
  signals,
  timeframe: {
    from: new Date('2024-01-01'),
    to: new Date('2024-01-31')
  }
});

// Export report
await analytics.exportReport(report, {
  format: 'pdf',
  includeCharts: true,
  includeRawData: false
});

// Get insights
const insights = analytics.extractInsights(report);
console.log('Key Insights:');
insights.forEach(insight => {
  console.log(`- ${insight.category}: ${insight.description}`);
});

Real-time Monitoring

import { RealtimeMonitor } from '@podx/scraper/monitoring';

const monitor = new RealtimeMonitor({
  targetUsernames: ['cryptowhale', 'defi_pulse'],
  keywords: ['bitcoin', 'ethereum', 'defi'],
  updateInterval: 30000  // 30 seconds
});

// Monitor with callbacks
monitor.onTweet((tweet) => {
  console.log(`New tweet from @${tweet.username}: ${tweet.text}`);
});

monitor.onSignal((signal) => {
  console.log(`New signal: ${signal.type} for ${signal.token}`);
  // Send notification, update dashboard, etc.
});

// Start monitoring
await monitor.start();

🔧 API Reference

ScraperService

`scrapeAccount(options: ScrapingOptions): Promise<Tweet[]>`

Scrapes tweets from a specific Twitter account.

Parameters:

targetUsername: string - Twitter username to scrape
maxTweets: number - Maximum number of tweets to scrape
progressCallback?: (progress: ScrapingProgress) => void - Progress callback

`scrapeAndSave(options: ScrapingOptions): Promise<{ tweets: Tweet[]; filename: string }>`

Scrapes tweets and saves them to file.

`saveTweetsToFile(tweets: Tweet[], username: string): Promise<string>`

Saves tweets to a JSON file.

Analyzers

`SentimentAnalyzer.analyze(tweets: Tweet[]): Promise<SentimentResult[]>`

Analyzes sentiment of tweets.

`TokenExtractor.extract(tweets: Tweet[]): Promise<TokenResult[]>`

Extracts cryptocurrency tokens from tweets.

`BotDetector.analyze(tweets: Tweet[]): Promise<BotAnalysis>`

Detects bot-like behavior in tweets.

`SignalGenerator.generateSignals(params: SignalParams): Promise<Signal[]>`

Generates trading signals from tweet analysis.

📋 Data Types

interface Tweet {
  id: string;
  username: string;
  text: string;
  createdAt: Date;
  likes: number;
  retweets: number;
  replies: number;
  isReply: boolean;
  isRetweet: boolean;
  media?: MediaData[];
  hashtags: string[];
  mentions: string[];
  urls: string[];
}

SentimentResult

interface SentimentResult {
  tweet: Tweet;
  sentiment: 'positive' | 'negative' | 'neutral';
  confidence: number;
  emotions: string[];
  score: number;
}

TokenResult

interface TokenResult {
  tweet: Tweet;
  tokens: TokenMention[];
}

interface TokenMention {
  symbol: string;
  address?: string;
  blockchain: string;
  context: string;
  confidence: number;
}

Signal

interface Signal {
  id: string;
  type: 'buy' | 'sell' | 'hold' | 'alert';
  token: string;
  strength: number;  // 1-10
  confidence: number; // 0-100
  reason: string;
  timeframe: string;
  timestamp: Date;
  supportingTweets: Tweet[];
}

🧪 Testing

import { describe, test, expect, mock } from 'bun:test';
import { ScraperService } from '@podx/scraper';

describe('ScraperService', () => {
  test('should scrape tweets from account', async () => {
    const scraper = new ScraperService();

    // Mock the scraper
    mock.module('agent-twitter-client', () => ({
      Scraper: class {
        async login() {}
        async getTweets() {
          return [
            {
              id: '1',
              username: 'testuser',
              text: 'Hello world!',
              createdAt: new Date(),
              likes: 10,
              retweets: 5,
              replies: 2
            }
          ];
        }
      }
    }));

    const tweets = await scraper.scrapeAccount({
      targetUsername: 'testuser',
      maxTweets: 1
    });

    expect(tweets).toHaveLength(1);
    expect(tweets[0].username).toBe('testuser');
  });

  test('should handle authentication errors', async () => {
    const scraper = new ScraperService();

    // Mock authentication failure
    mock.module('agent-twitter-client', () => ({
      Scraper: class {
        async login() {
          throw new Error('Invalid credentials');
        }
      }
    }));

    expect(async () => {
      await scraper.scrapeAccount({
        targetUsername: 'testuser',
        maxTweets: 1
      });
    }).toThrow('Invalid credentials');
  });
});

⚡ Performance Optimization

Rate Limiting

// Configure rate limiting to avoid Twitter API limits
const scraper = new ScraperService({
  rateLimit: {
    requestsPerMinute: 30,
    delayBetweenRequests: 2000
  }
});

Caching

// Cache analysis results to improve performance
const cache = new AnalysisCache();

const analyzer = new SentimentAnalyzer({
  cache: cache
});

// Results are cached automatically
const result1 = await analyzer.analyze(tweets);
const result2 = await analyzer.analyze(tweets); // Uses cache

Parallel Processing

// Process multiple accounts in parallel
const usernames = ['user1', 'user2', 'user3'];
const results = await Promise.allSettled(
  usernames.map(username =>
    scraper.scrapeAccount({ targetUsername: username, maxTweets: 100 })
  )
);

🔒 Security Considerations

Credential Protection: Never store credentials in code
Rate Limiting: Respect Twitter's API limits
Data Privacy: Handle user data responsibly
Error Handling: Don't expose sensitive information in errors
Logging: Be careful with sensitive data in logs

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

📝 License

This package is licensed under the ISC License. See the LICENSE file for details.

🔗 Related Packages

@podx/core - Core utilities and types
@podx/api - REST API server
@podx/cli - Command-line interface
podx - Main CLI application

📞 Support

For support and questions:

📧 Email: [email protected]
💬 Discord: PODx Community
📖 Documentation: docs.podx.dev/scraper
🐛 Issues: GitHub Issues

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@podx/scraper

📦 Installation

🏗️ Architecture

🚀 Quick Start

🔐 Authentication

Environment Variables Setup

Authentication Flow

📊 Core Scraping Features

Account Scraping

Search-Based Scraping

Comment/Reply Scraping

🧠 Advanced Analysis

Bot Detection

Sentiment Analysis

Cryptocurrency Token Extraction

📈 Signal Generation

Trading Signals

Market Sentiment Signals

🔧 Advanced Configuration

Custom Scraping Options

Custom Analysis Pipeline

💾 Data Storage and Export

File Storage

Database Integration

📊 Analytics and Reporting

Generate Reports

Real-time Monitoring

🔧 API Reference

ScraperService

scrapeAccount(options: ScrapingOptions): Promise<Tweet[]>

scrapeAndSave(options: ScrapingOptions): Promise<{ tweets: Tweet[]; filename: string }>

saveTweetsToFile(tweets: Tweet[], username: string): Promise<string>

Analyzers

SentimentAnalyzer.analyze(tweets: Tweet[]): Promise<SentimentResult[]>

TokenExtractor.extract(tweets: Tweet[]): Promise<TokenResult[]>

BotDetector.analyze(tweets: Tweet[]): Promise<BotAnalysis>

SignalGenerator.generateSignals(params: SignalParams): Promise<Signal[]>

📋 Data Types

Tweet

SentimentResult

TokenResult

Signal

🧪 Testing

⚡ Performance Optimization

Rate Limiting

Caching

Parallel Processing

🔒 Security Considerations

🤝 Contributing

📝 License

🔗 Related Packages

📞 Support

`scrapeAccount(options: ScrapingOptions): Promise<Tweet[]>`

`scrapeAndSave(options: ScrapingOptions): Promise<{ tweets: Tweet[]; filename: string }>`

`saveTweetsToFile(tweets: Tweet[], username: string): Promise<string>`

`SentimentAnalyzer.analyze(tweets: Tweet[]): Promise<SentimentResult[]>`

`TokenExtractor.extract(tweets: Tweet[]): Promise<TokenResult[]>`

`BotDetector.analyze(tweets: Tweet[]): Promise<BotAnalysis>`

`SignalGenerator.generateSignals(params: SignalParams): Promise<Signal[]>`