eliza-scraper

v1.1.8

Published

a year ago

Awesome Scraper for eliza, scrape docs, tweets, and tokens

0High
0Medium
0Low

abheektripathy

blockchain scraper bera cli api

Eliza Scraper

A powerful scraping and RAG (Retrieval Augmented Generation) toolkit for collecting, processing, and serving blockchain ecosystem data. Built with Bun, TypeScript, and modern vector search capabilities.

Features

🤖 Automated Twitter data scraping with engagement filtering
📊 Real-time token price and market data tracking via CoinGecko
📑 Documentation site scraping with caching
🔄 Configurable cron jobs for automated data collection
🔍 Vector search with time-decay relevance scoring
🚀 REST API with Swagger documentation

Architecture

RAG Pipeline for Tweet Processing

graph TD
    A[Twitter] --> B[Apify Scraper]
    B --> C[Filters]
    C --> D[Text Processing]
    D --> E[OpenAi Embedding]
    E --> F[Qdrant Vector DB]
    F --> G[Search API]
    
    subgraph "filters"
    C --> H[Min Replies]
    C --> I[Min Retweets]
    C --> J[Search Tags]
    C --> K[Kaito Most Mindshare handles]
    end
    
    subgraph "Vector Search"
    F --> L[Semantic Search]
    end

System Architecture

graph TB
    CLI[CLI Interface] --> Services
    API[REST API] --> Services
    
    subgraph "Services"
    Tweets[Tweet Service]
    Tokens[Token Service]
    Docs[Site Service]
    end
    
    subgraph "Storage"
    Qdrant[(Qdrant Vector DB)]
    Postgres[(PostgreSQL)]
    end
    
    Services --> Storage

Scraper Quick Start

Install globally:

npm install -g eliza-scraper

Create .env file:

# Required APIs
APIFY_API_KEY=your_apify_key
COINGECKO_API_KEY=your_coingecko_key
OPENAI_API_KEY=

# Database
DATABASE_URL=postgresql://user:password@localhost:5432/token_scraper

# Vector DB
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION_NAME=twitter_embeddings

# Optional
TWITTER_SEARCH_TAGS=berachain,bera
APP_TOPICS=berachain launch,berachain token

Start required services:

docker-compose up -d

CLI Commands

Start API Server

elizascraper server --port <number>

Tweet Scraping Service

elizascraper tweets-cron [options]

Options:
--port Port number (default: 3000)
--tags Custom search tags, can also pass twitter search strings, check out https://github.com/igorbrigadir/twitter-advanced-search (comma-separated)
--start Start date (YYYY-MM-DD)
--end End date (YYYY-MM-DD)
--min-replies Minimum replies threshold (default: 4)
--retweets Minimum retweets threshold (default: 2)
--max-items Maximum tweets to fetch (default: 200)

Disclaimer: make sure you don't run the cron job in a small time interval, with start and end date less than 3 days, fetching less than 50 tweets, would lead to a ban shortly on your apify key.

To prevent ban, same search query, ran multiple times inside an hour would be skipped over.

Tracking Latest Token Data

elizascraper token-cron [options]

Options:
-p, --port <number> Port to listen on (default: 3008)
-i, --interval <number> Interval in minutes (default: 60)
--currency <string> Currency to track (default: 'usd')
--cg_category <string> CoinGecko category (default: 'berachain-ecosystem')

Documentation Scraping Service

elizascraper site-cron [options]

Options:
-p, --port <number> Port to listen on (default: 3010)
-i, --interval <number> Interval in minutes (default: 60)
-l, --link <string> URL to scrape (default: 'https://docs.berachain.com')

API Documentation

Tweets API

GET /tweets/search

Search for tweets based on semantic similarity.

Query Parameters:

searchString (required): Text to search for semantically similar tweets

Response Schema:

{
  status: "success" | "error",
  results: {
    tweet: {
      id: string,
      text: string,
      fullText: string,
      author: {
        name: string,
        userName: string,
        profilePicture: string
      },
      createdAt: string,
      retweetCount: number,
      replyCount: number,
      likeCount: number
    },
    combinedScore: number,
    originalScore: number,
    date: string
  }[]
}

GET /tweets/random

Get random tweets by predefined topics.

Response Schema:

{
  topic: string,
  tweet: {
    tweet: {
      id: string,
      text: string,
      fullText: string,
      author: {
        name: string,
        userName: string,
        profilePicture: string
      },
      createdAt: string,
      retweetCount: number,
      replyCount: number,
      likeCount: number
    },
    combinedScore: number,
    originalScore: number,
    date: string
  }[]
}

Tokens API

GET /tokens

Get token market data.

Query Parameters:

symbol (optional): Token symbol to filter results

Response Schema:

{
  status: "success" | "error",
  results: {
    id: string,
    symbol: string,
    name: string,
    image: string,
    currentPrice: number,
    marketCap: number,
    marketCapRank: number,
    fullyDilutedValuation: number,
    totalVolume: number,
    high24h: number,
    low24h: number,
    priceChange24h: number,
    priceChangePercentage24h: number,
    marketCapChange24h: number,
    marketCapChangePercentage24h: number,
    circulatingSupply: number,
    totalSupply: number,
    maxSupply: number,
    ath: number,
    athChangePercentage: number,
    athDate: string,
    atl: number,
    atlChangePercentage: number,
    atlDate: string,
    lastUpdated: string
  }
}

Documentation API

GET /docs/scrape

Scrape and cache documentation from a given URL.

Query Parameters:

url (required): URL of the documentation site to scrape

Response Schema:

{
  status: "success" | "error",
  data: {
    title: string,
    last_updated: string,
    total_sections: number,
    sections: {
      topic: string,
      source_url: string,
      overview: string,
      subsections: {
        title: string,
        content: string
      }[]
    }[]
  },
  source: "cache" | "fresh"
} | {
  status: "error",
  message: string
}

Cache Behavior:

Documentation is cached for 7 days
Returns cached version if available and not expired
Automatically refreshes cache if data is older than 7 days

Error Handling

All routes use a standardized error response format:

{
  status: "error",
  code: string,
  message: string
}

Contributing

Contributions are welcome! Please read our contributing guidelines before submitting pull requests.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Eliza Scraper

Features

Architecture

RAG Pipeline for Tweet Processing

System Architecture

Scraper Quick Start

CLI Commands

Start API Server

Tweet Scraping Service

Tracking Latest Token Data

Documentation Scraping Service

API Documentation

Tweets API

GET /tweets/search

GET /tweets/random

Tokens API

GET /tokens

Documentation API

GET /docs/scrape

Error Handling

Contributing

License