story-spider

v1.0.4

Published

6 months ago

A TypeScript library for scraping stories from various Vietnamese websites

0High
0Medium
0Low

quangvd2

scraper story vietnamese truyenyy truyenfull crawler

Story Spider

Story Spider is a TypeScript library for scraping stories from popular Vietnamese story websites like truyenyy.vn, truyenfull.vn, etc. It features a modular architecture that makes it easy to add support for additional websites.

Features

Collect story information (title, description, author, genre, status, total chapters)
Get chapter lists with details
Retrieve chapter URLs by chapter number
Get chapter content in HTML or text format
Intelligent content cleaning with html-to-text integration
Advanced caching for reduced bandwidth and faster performance
Rate limiting to avoid overloading servers
Extensible adapter system for supporting multiple websites

Installation

npm install story-spider

Usage

Basic Usage

import { StorySpider, TruyenfullScraper } from 'story-spider';

// Create story spider
const storySpider = new StorySpider({
    rateLimiterOptions: {
      minTime: 1000,
      maxConcurrent: 1,
    }
});

// Register a scraper
storySpider.registerScraper(new TruyenfullScraper());

// Get story information
const storyInfo = await storySpider.scapeStoryInfo('https://truyenyy.vn/truyen/example-story/');
console.log(storyInfo);

// Get chapter list
const chapters = await storySpider.scapeChapterList('https://truyenyy.vn/truyen/example-story/');
console.log(chapters);

// Get chapter content
const chapterContent = await storySpider.scapeChapterContent(chapterUrl);
console.log(chapterContent);

Creating a Custom Scraper

You can create your own scraper for any website by extending the StoryScraper class:

import { StoryScraper, StoryInfo, ChapterInfo } from 'story-spider';

export class CustomScraper extends StoryScraper {
  getSiteIdentifier(): string {
    // Implementation for getting site id
  }

  getSupportedDomains(): string[] {
    // Implementation for getting supported domains
  }

  canHandle(url: string): boolean {
    // Implementation for check if url can handle
  }

  async scapeStoryInfo(storyUrl: string): Promise<StoryInfo> {
    // Implementation for getting story info
  }

  async scapeChapterList(storyUrl: string): Promise<ChapterInfo[]> {
    // Implementation for getting chapter list
  }

  async scapeChapterContent(chapterUrl: string): Promise<string> {
    // Implementation for getting chapter content
  }
}

// Register the scraper
storySpider.registerScraper(new CustomScraper());

Supported Websites

The base library provides infrastructure for scrapers. Specific website scrapers can be implemented separately or contributed to this project.

Dependencies

html-to-text: For advanced HTML content cleaning
axios: For network requests
cheerio: For HTML parsing
winston: For logging
node-cache: For caching

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme