@duyquangnvx/story-spider
v2.0.2
Published
A TypeScript library for scraping stories from various Vietnamese websites
Maintainers
Readme
Story Spider
Story Spider is a TypeScript library for scraping stories from popular Vietnamese story websites like truyenyy.vn, truyenfull.vn, etc. It features a modular architecture that makes it easy to add support for additional websites.
Features
- Collect story information (title, description, author, genre, status, total chapters)
- Get chapter lists with details
- Retrieve chapter URLs by chapter number
- Get chapter content in HTML or text format
- Intelligent content cleaning with html-to-text integration
- Advanced caching for reduced bandwidth and faster performance
- Rate limiting to avoid overloading servers
- Extensible adapter system for supporting multiple websites
Installation
npm install story-spiderEnvironment Variables
For authenticated websites (metruyencv.com), set up environment variables:
# Metruyencv authentication
export METRUYENCV_EMAIL="[email protected]"
export METRUYENCV_PASSWORD="your-password"Or create a .env file:
[email protected]
METRUYENCV_PASSWORD=your-passwordUsage
Basic Usage
import { StorySpider } from 'story-spider';
// Create story spider
const storySpider = new StorySpider({
rateLimiterOptions: {
minTime: 1000,
maxConcurrent: 1,
}
});
// Default scrapers are registered automatically:
// - TruyenfullScraper, RungtruyenScraper, XalosachScraper, MailinhwpScraper, MetruyencvScraper
// Get story information (works with any supported website)
const storyInfo = await storySpider.scrapeStoryInfo('https://rungtruyen.com/category/tien-hiep/quang-am-chi-ngoai/');
console.log(storyInfo);
// Get chapter list
const chapters = await storySpider.scrapeChapterList('https://rungtruyen.com/category/tien-hiep/quang-am-chi-ngoai/');
console.log(chapters);
// Get chapter content
const chapterContent = await storySpider.scrapeChapterContent(chapterUrl);
console.log(chapterContent);Creating a Custom Scraper
You can create your own scraper for any website by extending the StoryScraper class:
import { StoryScraper, StoryInfo, ChapterInfo } from 'story-spider';
export class CustomScraper extends StoryScraper {
getSiteIdentifier(): string {
// Implementation for getting site id
}
getSupportedDomains(): string[] {
// Implementation for getting supported domains
}
canHandle(url: string): boolean {
// Implementation for check if url can handle
}
async scapeStoryInfo(storyUrl: string): Promise<StoryInfo> {
// Implementation for getting story info
}
async scapeChapterList(storyUrl: string): Promise<ChapterInfo[]> {
// Implementation for getting chapter list
}
async scapeChapterContent(chapterUrl: string): Promise<string> {
// Implementation for getting chapter content
}
}
// Register the scraper
storySpider.registerScraper(new CustomScraper());Using Authenticated Scrapers (Metruyencv)
For websites that require authentication like metruyencv.com, simply set environment variables and use StorySpider normally:
# Set environment variables
export METRUYENCV_EMAIL="[email protected]"
export METRUYENCV_PASSWORD="your-password"import { StorySpider } from 'story-spider';
// Create story spider (Metruyencv scraper is registered automatically)
const storySpider = new StorySpider({
rateLimiterOptions: {
minTime: 2000, // Recommended for authenticated sites
maxConcurrent: 1,
}
});
// Use normally - authentication happens automatically
const storyInfo = await storySpider.scrapeStoryInfo('https://metruyencv.com/truyen/example-story/');
const chapters = await storySpider.scrapeChapterList('https://metruyencv.com/truyen/example-story/');
if (chapters.length > 0) {
const chapterContent = await storySpider.scrapeChapterContent(chapters[0].url);
console.log(chapterContent);
}Testing
# Test with specific scrapers
npm run test:truyenfull
npm run test:rungtruyen
npm run test:xalosach
npm run test:mailinhwp
# Test Metruyencv (requires environment variables)
export METRUYENCV_EMAIL="[email protected]"
export METRUYENCV_PASSWORD="your-password"
npm run test:story-spider-metruyencvSupported Websites
- Truyenfull (truyenfull.vn, truyenfull.com, truyenfull.vision, truyenfull.vip)
- Rungtruyen (rungtruyen.com)
- Metruyencv (metruyencv.com) - Requires authentication
Additional website scrapers can be implemented separately or contributed to this project.
Dependencies
- html-to-text: For advanced HTML content cleaning
- axios: For network requests
- cheerio: For HTML parsing
- winston: For logging
- node-cache: For caching
- playwright: For browser automation (authenticated scrapers)
- crawlee: For advanced web scraping capabilities
License
ISC
