youtube-channel-video-scraper

v1.0.0

Published

9 months ago

A library to scrape video data from YouTube channels using Playwright.

0High
0Medium
0Low

YouTube Channel Video Scraper Library

A reusable TypeScript library for scraping latest video data and channel statistics from YouTube channels using Playwright.

Features

Scrapes the latest ~10 non-Shorts videos from specified YouTube channel URLs.
Extracts video details: Title, View Count, Upload Date (relative to absolute), Thumbnail URL, Video ID.
Extracts channel details: Name, Subscriber Count, Total Video Count, Canonical Channel ID.
Handles redirects from username URLs (e.g., /@username/videos) to canonical channel URLs (/channel/UC.../videos).
Uses Playwright for scraping, avoiding the need for official API keys.
Configurable options:
- HTTP/S proxy support.
- Customizable timeouts for page operations.
- Adjustable concurrency level for scraping multiple channels.
Returns data as a structured array of objects, including per-channel error details.

Disclaimers - Important!

YouTube Terms of Service: Scraping YouTube programmatically might violate their Terms of Service. This library is provided for educational purposes only and is not intended for sustained use against YouTube's systems. Use this library responsibly and entirely at your own risk. The authors are not liable for any consequences, such as IP blocking or account suspension.
Scraping Fragility: Web scraping relies on the target website's structure. YouTube frequently updates its layout and code. These changes will inevitably break this scraper without warning. This library requires ongoing maintenance to adapt. Do not use this in production environments.

Prerequisites

Node.js
pnpm

Usage (as a Library)

Install the package from npm:

pnpm add rebers/youtube-channel-video-scraper
# or npm install / yarn add

# Remember to install playwright browsers in your project too!
pnpm exec playwright install --with-deps chromium

Then, import and use the scrapeYouTubeChannels function:

import { scrapeYouTubeChannels, ChannelResult, ScraperOptions } from 'youtube-channel-video-scraper';

async function runScraper() {
  const channelUrls = [
    "https://www.youtube.com/@ChannelName1/videos",
    "https://www.youtube.com/channel/UCxxxxxxxxxxxxxxxxxxxxxx/videos", // Example with channel ID
    "https://www.youtube.com/@AnotherChannel/videos"
    // Add more channel URLs
  ];

  const options: ScraperOptions = {
    // Optional: Specify proxy server
    // proxyUrl: 'http://user:password@your-proxy-server:port',

    // Optional: Set a general timeout for page navigation/waits (default 60000ms)
    // timeout: 45000, 

    // Optional: Set concurrency level (how many channels to scrape in parallel, default 1)
    concurrency: 3, 
  };

  try {
    console.log(`Starting scrape for ${channelUrls.length} channels...`);
    const results: ChannelResult[] = await scrapeYouTubeChannels(channelUrls, options);
    console.log("Scraping finished.");

    results.forEach(result => {
      if (result.error) {
        // Handle channels that failed to scrape
        console.error(`Error scraping ${result.inputUrl}: ${result.error}`);
      } else {
        // Process successful results
        console.log(`Successfully scraped: ${result.name} (${result.channelId})`);
        console.log(` Subscribers: ${result.subscriberCount}, Total Videos: ${result.totalVideos}`);
        console.log(` Found ${result.videos.length} recent videos:`);
        result.videos.forEach(video => {
          console.log(`  - [${video.videoId}] ${video.title} (${video.viewCount} views, Uploaded: ${video.uploadDate})`);
        });
      }
    });


  } catch (error) {
    // Catch potential critical errors during the overall process setup
    console.error("Critical scraping error:", error);
  }
}

runScraper();

Configuration Options (`ScraperOptions`)

proxyUrl (string, optional): URL of an HTTP/S proxy server (e.g., http://user:pass@host:port).
timeout (number, optional): Maximum time in milliseconds for page navigations and waits (default: 60000).
concurrency (number, optional): Number of channels to scrape in parallel (default: 1).

Returned Data Structure (`ChannelResult[]`)

The function returns a Promise resolving to an array, where each element represents a channel processed:

interface VideoResult {
  videoId: string;         // YouTube video ID (e.g., "dQw4w9WgXcQ")
  title: string;           // Video title
  viewCount: number | null;// Parsed view count (null if parsing fails)
  uploadDate: string | null; // Estimated upload date as ISO string (null if parsing fails)
  thumbnailUrl: string;    // URL of the video thumbnail
}

interface ChannelResult {
  inputUrl: string;        // The original URL provided for this channel
  channelId: string | null;// The canonical YouTube channel ID (UC...) (null if not found/error)
  name: string | null;     // Channel name (null if not found/error)
  subscriberCount: number | null; // Parsed subscriber count (null if not found/parsing fails)
  totalVideos: number | null; // Parsed total video count (null if not found/parsing fails)
  videos: VideoResult[];   // Array of the latest non-Shorts video results
  error?: string;          // Error message if scraping failed for this specific channel
}

Error Handling

The main scrapeYouTubeChannels function should be wrapped in try...catch for setup errors.
Individual channel scraping errors (e.g., network issues, parsing failures, blocked requests) are caught internally. The corresponding ChannelResult object for that channel will have the error field populated with a message, and other fields might be null. Check the error field for each ChannelResult in the returned array.

Development

Build: Compile TypeScript to JavaScript in the dist directory:
```
pnpm build
```
Dev Mode: Run the entry point (src/index.ts) using ts-node and automatically restart on file changes using nodemon:
```
pnpm dev
```
(Note: The default src/index.ts might not perform any action when run directly unless modified for CLI testing).

Testing

Run the automated tests using Jest:

pnpm test

Tests are located in src/ alongside the code they test (e.g., index.test.ts). Ensure Playwright browsers are installed before running tests.

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme