story-scraper
v2.0.0
Published
TypeScript library for scraping and parsing stories from various Vietnamese novel websites
Maintainers
Readme
Story Scraper
A TypeScript library for scraping and parsing stories from various Vietnamese novel websites. Extract story metadata, chapter lists, and chapter content with ease.
Features
- 🔍 Dual Mode: Parse HTML strings OR scrape directly from URLs
- ⚡ Type-safe: Built with TypeScript for full type safety
- 🏗️ Modular: Clean, extensible architecture with base parser and scraper classes
- 🌐 Multiple sites: Built-in support for TruyenFull.vision and TangThuVien
- 🔧 Easy to extend: Simple API to add custom parsers/scrapers for new websites
- 📦 Lightweight: Minimal dependencies (cheerio for parsing, html-to-text for conversion, native fetch for scraping)
- 🌳 Tree-shakeable: Built with tsup for optimal bundle size
- ⚡ Fast: Uses cheerio for efficient HTML parsing and DOM traversal
- 📝 Smart text extraction: Automatic conversion to clean, readable plain text using html-to-text
- ⏱️ Rate limiting: Built-in delay, retry, and progress tracking for scraping
- ✅ Well-tested: Comprehensive test suite
Supported Websites
- TruyenFull: truyenfull.vision, truyenfull.vn, truyenfull.com
- TangThuVien: tangthuvien.vn, tangthuvien.com, wikidich.com
Installation
npm install story-scraperor
yarn add story-scraperor
pnpm add story-scraperUsage
Quick Start: Scraping
The easiest way to get started is using the built-in scrapers:
import { TruyenFullScraper } from 'story-scraper';
const scraper = new TruyenFullScraper({
delay: 1000, // 1 second between requests
timeout: 30000, // 30 second timeout
maxRetries: 3, // Retry up to 3 times on failure
});
// Scrape story metadata
const result = await scraper.scrapeMetadata('https://truyenfull.vision/story-url/');
if (result.success) {
console.log('Title:', result.data.title);
console.log('Author:', result.data.author);
console.log('Duration:', result.duration, 'ms');
}
// Scrape a chapter
const chapterResult = await scraper.scrapeChapter('https://truyenfull.vision/story/chapter-1/');
// Scrape complete story (metadata + chapter list)
const storyResult = await scraper.scrapeStory('https://truyenfull.vision/story-url/');
// Scrape multiple chapters with progress tracking
const chaptersResult = await scraper.scrapeChapters(chapterUrls, {
concurrency: 1,
onProgress: (progress) => {
console.log(`Progress: ${progress.percentage}%`);
},
});
// **NEW**: Scrape ALL chapters from paginated chapter lists
const allChaptersResult = await scraper.scrapeAllChapterPages(url, {
maxPages: 100, // Safety limit
onProgress: (progress) => {
console.log(`Scraping page ${progress.current}...`);
},
});Handling Paginated Chapter Lists
Many websites split chapter lists across multiple pages. The library automatically handles pagination:
import { TruyenFullScraper } from 'story-scraper';
const scraper = new TruyenFullScraper({ delay: 1000 });
// Get ALL chapters across ALL pages (recommended)
const result = await scraper.scrapeAllChapterPages(storyUrl, {
maxPages: 100, // Safety limit
onProgress: (progress) => {
console.log(`Page ${progress.current} scraped`);
},
});
if (result.success) {
console.log(`Total chapters: ${result.data.length}`);
// result.data contains ALL chapters from ALL pages
}Parsing Only (No HTTP Requests)
If you already have the HTML and just need to parse it:
import { TruyenFullParser } from 'story-scraper';
const parser = new TruyenFullParser();
// Parse story metadata
const htmlContent = '...'; // HTML string from the story page
const metadataResult = parser.parseMetadata(htmlContent);
if (metadataResult.success && metadataResult.data) {
console.log('Title:', metadataResult.data.title);
console.log('Author:', metadataResult.data.author);
console.log('Description:', metadataResult.data.description);
}
// Parse chapter list
const chapterListResult = parser.parseChapterList(htmlContent);
if (chapterListResult.success && chapterListResult.data) {
chapterListResult.data.forEach(chapter => {
console.log(`Chapter ${chapter.chapterNumber}: ${chapter.title}`);
});
}
// Parse chapter content
const chapterResult = parser.parseChapter(htmlContent);
if (chapterResult.success && chapterResult.data) {
console.log('Chapter content:', chapterResult.data.content);
}Using Scraper Registry
Auto-detect and use the right scraper for any supported URL:
import { getScraperByURL, getScraper } from 'story-scraper';
// Auto-detect scraper from URL
const scraper = getScraperByURL('https://truyenfull.vision/some-story/');
if (scraper) {
const result = await scraper.scrapeMetadata(url);
}
// Or get scraper by domain
const scraper2 = getScraper('truyenfull.vision', {
delay: 500,
timeout: 20000,
});Using Parser Registry (Parse-only mode)
import { getParser, isSupported } from 'story-scraper';
// Check if a domain is supported
if (isSupported('truyenfull.vision')) {
const parser = getParser('truyenfull.vision');
if (parser) {
const result = parser.parseMetadata(htmlContent);
// ...
}
}Using with URLs (Parse-only mode)
import { getParserByURL } from 'story-scraper';
const url = 'https://truyenfull.vision/some-story/';
const parser = getParserByURL(url);
if (parser) {
// Use parser to parse content
}Creating Custom Parsers
import { BaseParser, ParserResult, StoryMetadata } from 'story-scraper';
class MyCustomParser extends BaseParser {
getSupportedDomains(): string[] {
return ['example.com'];
}
parseMetadata(html: string): ParserResult<StoryMetadata> {
const doc = this.safeParseHTML(html);
if (!doc) {
return this.error('Failed to parse HTML');
}
// Your custom parsing logic here
const title = doc.querySelector('.title')?.textContent || '';
return this.success({
title,
author: 'Unknown',
description: '',
});
}
parseChapterList(html: string): ParserResult<ChapterListItem[]> {
// Your implementation
}
parseChapter(html: string): ParserResult<Chapter> {
// Your implementation
}
}
// Register your custom parser
import { defaultRegistry } from 'story-scraper';
defaultRegistry.register(new MyCustomParser());API Reference
Types
StoryMetadata
interface StoryMetadata {
title: string;
author: string;
description: string;
coverImage?: string;
status?: string;
genres?: string[];
tags?: string[];
rating?: number;
views?: number;
lastUpdated?: Date;
}ChapterListItem
interface ChapterListItem {
chapterNumber: number;
title: string;
url?: string;
id?: string;
publishedAt?: Date;
}Chapter
interface Chapter {
chapterNumber: number;
title: string;
content: string; // HTML format
contentText?: string; // Plain text format (NEW in v1.4.0)
previousChapter?: {
chapterNumber: number;
url?: string;
};
nextChapter?: {
chapterNumber: number;
url?: string;
};
publishedAt?: Date;
}ParserResult<T>
interface ParserResult<T> {
success: boolean;
data?: T;
error?: string;
}ScrapeResult<T>
interface ScrapeResult<T> {
success: boolean;
data?: T;
error?: string;
statusCode?: number;
duration?: number; // Request duration in milliseconds
}ScraperOptions
interface ScraperOptions {
headers?: Record<string, string>;
timeout?: number; // Request timeout in ms
delay?: number; // Delay between requests in ms
maxRetries?: number; // Max retry attempts
userAgent?: string; // Custom user agent
}BulkScrapeOptions
interface BulkScrapeOptions extends ScraperOptions {
onProgress?: (progress: ScrapeProgress) => void;
concurrency?: number; // Number of concurrent requests
continueOnError?: boolean; // Continue scraping on errors
}Classes
BaseParser
Abstract base class for all parsers.
Methods:
parseMetadata(html: string): ParserResult<StoryMetadata>parseChapterList(html: string): ParserResult<ChapterListItem[]>parseChapter(html: string): ParserResult<Chapter>getSupportedDomains(): string[]canParse(domain: string): boolean
TruyenFullParser
Parser for TruyenFull websites.
TangThuVienParser
Parser for TangThuVien websites.
ParserRegistry
Registry for managing parsers.
Methods:
register(parser: BaseParser): voidgetParser(domain: string): BaseParser | undefinedgetParserByURL(url: string): BaseParser | undefinedisSupported(domain: string): booleangetSupportedDomains(): string[]
BaseScraper
Base scraper class that combines parsing with HTTP fetching.
Methods:
scrapeMetadata(url: string): Promise<ScrapeResult<StoryMetadata>>scrapeChapterList(url: string): Promise<ScrapeResult<ChapterListItem[]>>scrapeChapter(url: string): Promise<ScrapeResult<Chapter>>scrapeStory(url: string): Promise<ScrapeResult<{metadata, chapters}>>scrapeChapters(urls: string[], options?: BulkScrapeOptions): Promise<ScrapeResult<Chapter[]>>getParser(): BaseParsersetOptions(options: Partial<ScraperOptions>): void
TruyenFullScraper
Scraper for TruyenFull websites.
TangThuVienScraper
Scraper for TangThuVien websites.
ScraperRegistry
Registry for managing scrapers.
Methods:
register(domains: string[], factory: (options?) => BaseScraper): voidgetScraper(domain: string, options?: ScraperOptions): BaseScraper | undefinedgetScraperByURL(url: string, options?: ScraperOptions): BaseScraper | undefinedisSupported(domain: string): booleangetSupportedDomains(): string[]
Functions
Parser Functions
getParser(domain: string): BaseParser | undefined- Get parser for a domaingetParserByURL(url: string): BaseParser | undefined- Get parser by URLisSupported(domain: string): boolean- Check if domain is supportedgetSupportedDomains(): string[]- Get all supported domains
Scraper Functions
getScraper(domain: string, options?: ScraperOptions): BaseScraper | undefined- Get scraper for a domaingetScraperByURL(url: string, options?: ScraperOptions): BaseScraper | undefined- Get scraper by URLisScrapingSupported(domain: string): boolean- Check if scraping is supportedgetScrapingSupportedDomains(): string[]- Get all domains that support scraping
Examples
The library includes working examples that demonstrate both parsing and scraping:
Run Examples
# Scraper usage (recommended - shows all scraper features)
npm run example:scraper
# Pagination example (scrape ALL chapters from multi-page lists)
npm run example:pagination
# TruyenFull example (parse-only mode with manual fetching)
npm run example:truyenfull
# TangThuVien example (parse-only mode with manual fetching)
npm run example:tangthuvienThese examples will:
- Fetch HTML content from real story websites
- Parse metadata, chapter lists, and chapter content
- Save the parsed data to
tmp/directory as JSON and HTML files
Example output files:
tmp/truyenfull/metadata.json- Parsed story metadatatmp/truyenfull/chapters.json- List of all chapterstmp/truyenfull/chapter-1.json- Parsed chapter content (JSON)tmp/truyenfull/chapter-1.html- Formatted chapter content (HTML)
Development
Setup
# Install dependencies
npm install
# Run type checking
npm run typecheck
# Build the library
npm run build
# Run tests
npm test
# Watch mode for development
npm run devProject Structure
src/
├── types/ # TypeScript interfaces and types
│ └── index.ts
├── parsers/ # Parser implementations
│ ├── base/ # Base parser class
│ ├── truyenfull/ # TruyenFull parser
│ ├── tangthuvien/ # TangThuVien parser
│ ├── ParserRegistry.ts
│ └── index.ts
├── utils/ # Utility functions
│ └── dom.ts
└── index.ts # Main entry pointContributing
Contributions are welcome! To add support for a new website:
- Create a new parser class extending
BaseParser - Implement the required methods:
parseMetadata,parseChapterList,parseChapter, andgetSupportedDomains - Register the parser in
src/index.ts - Add tests for your parser
- Submit a pull request
License
ISC
Notes
This library only parses HTML content. You need to fetch the HTML yourself (e.g., using fetch, axios, etc.). This design keeps the library lightweight and gives you full control over how you fetch content.
Example: Complete Workflow
import { getParser } from 'story-scraper';
async function parseStory(url: string) {
// 1. Fetch HTML (you need to implement this)
const response = await fetch(url);
const html = await response.text();
// 2. Get appropriate parser
const parser = getParserByURL(url);
if (!parser) {
throw new Error('No parser found for this URL');
}
// 3. Parse metadata
const metadataResult = parser.parseMetadata(html);
if (!metadataResult.success) {
throw new Error(metadataResult.error);
}
// 4. Parse chapter list
const chaptersResult = parser.parseChapterList(html);
if (!chaptersResult.success) {
throw new Error(chaptersResult.error);
}
return {
metadata: metadataResult.data,
chapters: chaptersResult.data,
};
}