@duyquangnvx/story-spider

v2.0.2

Published

6 months ago

A TypeScript library for scraping stories from various Vietnamese websites

0High
0Medium
0Low

duyquangnvx

scraper story vietnamese truyenyy truyenfull rungtruyen xalosach crawler

Story Spider

Story Spider is a TypeScript library for scraping stories from popular Vietnamese story websites like truyenyy.vn, truyenfull.vn, etc. It features a modular architecture that makes it easy to add support for additional websites.

Features

Collect story information (title, description, author, genre, status, total chapters)
Get chapter lists with details
Retrieve chapter URLs by chapter number
Get chapter content in HTML or text format
Intelligent content cleaning with html-to-text integration
Advanced caching for reduced bandwidth and faster performance
Rate limiting to avoid overloading servers
Extensible adapter system for supporting multiple websites

Installation

npm install story-spider

Environment Variables

For authenticated websites (metruyencv.com), set up environment variables:

# Metruyencv authentication
export METRUYENCV_EMAIL="[email protected]"
export METRUYENCV_PASSWORD="your-password"

Or create a .env file:

[email protected]
METRUYENCV_PASSWORD=your-password

Usage

Basic Usage

import { StorySpider } from 'story-spider';

// Create story spider
const storySpider = new StorySpider({
    rateLimiterOptions: {
      minTime: 1000,
      maxConcurrent: 1,
    }
});

// Default scrapers are registered automatically:
// - TruyenfullScraper, RungtruyenScraper, XalosachScraper, MailinhwpScraper, MetruyencvScraper

// Get story information (works with any supported website)
const storyInfo = await storySpider.scrapeStoryInfo('https://rungtruyen.com/category/tien-hiep/quang-am-chi-ngoai/');
console.log(storyInfo);

// Get chapter list
const chapters = await storySpider.scrapeChapterList('https://rungtruyen.com/category/tien-hiep/quang-am-chi-ngoai/');
console.log(chapters);

// Get chapter content
const chapterContent = await storySpider.scrapeChapterContent(chapterUrl);
console.log(chapterContent);

Creating a Custom Scraper

You can create your own scraper for any website by extending the StoryScraper class:

import { StoryScraper, StoryInfo, ChapterInfo } from 'story-spider';

export class CustomScraper extends StoryScraper {
  getSiteIdentifier(): string {
    // Implementation for getting site id
  }

  getSupportedDomains(): string[] {
    // Implementation for getting supported domains
  }

  canHandle(url: string): boolean {
    // Implementation for check if url can handle
  }

  async scapeStoryInfo(storyUrl: string): Promise<StoryInfo> {
    // Implementation for getting story info
  }

  async scapeChapterList(storyUrl: string): Promise<ChapterInfo[]> {
    // Implementation for getting chapter list
  }

  async scapeChapterContent(chapterUrl: string): Promise<string> {
    // Implementation for getting chapter content
  }
}

// Register the scraper
storySpider.registerScraper(new CustomScraper());

Using Authenticated Scrapers (Metruyencv)

For websites that require authentication like metruyencv.com, simply set environment variables and use StorySpider normally:

# Set environment variables
export METRUYENCV_EMAIL="[email protected]"
export METRUYENCV_PASSWORD="your-password"

import { StorySpider } from 'story-spider';

// Create story spider (Metruyencv scraper is registered automatically)
const storySpider = new StorySpider({
    rateLimiterOptions: {
      minTime: 2000, // Recommended for authenticated sites
      maxConcurrent: 1,
    }
});

// Use normally - authentication happens automatically
const storyInfo = await storySpider.scrapeStoryInfo('https://metruyencv.com/truyen/example-story/');
const chapters = await storySpider.scrapeChapterList('https://metruyencv.com/truyen/example-story/');

if (chapters.length > 0) {
    const chapterContent = await storySpider.scrapeChapterContent(chapters[0].url);
    console.log(chapterContent);
}

Testing

# Test with specific scrapers
npm run test:truyenfull
npm run test:rungtruyen
npm run test:xalosach
npm run test:mailinhwp

# Test Metruyencv (requires environment variables)
export METRUYENCV_EMAIL="[email protected]"
export METRUYENCV_PASSWORD="your-password"
npm run test:story-spider-metruyencv

Supported Websites

Truyenfull (truyenfull.vn, truyenfull.com, truyenfull.vision, truyenfull.vip)
Rungtruyen (rungtruyen.com)
Metruyencv (metruyencv.com) - Requires authentication

Additional website scrapers can be implemented separately or contributed to this project.

Dependencies

html-to-text: For advanced HTML content cleaning
axios: For network requests
cheerio: For HTML parsing
winston: For logging
node-cache: For caching
playwright: For browser automation (authenticated scrapers)
crawlee: For advanced web scraping capabilities

License

ISC