story-scraper

v2.0.0

Published

8 months ago

TypeScript library for scraping and parsing stories from various Vietnamese novel websites

0High
0Medium
0Low

duyquangnvx

parser story novel scraper truyenfull tangthuvien typescript

Story Scraper

A TypeScript library for scraping and parsing stories from various Vietnamese novel websites. Extract story metadata, chapter lists, and chapter content with ease.

Features

🔍 Dual Mode: Parse HTML strings OR scrape directly from URLs
⚡ Type-safe: Built with TypeScript for full type safety
🏗️ Modular: Clean, extensible architecture with base parser and scraper classes
🌐 Multiple sites: Built-in support for TruyenFull.vision and TangThuVien
🔧 Easy to extend: Simple API to add custom parsers/scrapers for new websites
📦 Lightweight: Minimal dependencies (cheerio for parsing, html-to-text for conversion, native fetch for scraping)
🌳 Tree-shakeable: Built with tsup for optimal bundle size
⚡ Fast: Uses cheerio for efficient HTML parsing and DOM traversal
📝 Smart text extraction: Automatic conversion to clean, readable plain text using html-to-text
⏱️ Rate limiting: Built-in delay, retry, and progress tracking for scraping
✅ Well-tested: Comprehensive test suite

Supported Websites

TruyenFull: truyenfull.vision, truyenfull.vn, truyenfull.com
TangThuVien: tangthuvien.vn, tangthuvien.com, wikidich.com

Installation

npm install story-scraper

yarn add story-scraper

pnpm add story-scraper

Usage

Quick Start: Scraping

The easiest way to get started is using the built-in scrapers:

import { TruyenFullScraper } from 'story-scraper';

const scraper = new TruyenFullScraper({
  delay: 1000,      // 1 second between requests
  timeout: 30000,   // 30 second timeout
  maxRetries: 3,    // Retry up to 3 times on failure
});

// Scrape story metadata
const result = await scraper.scrapeMetadata('https://truyenfull.vision/story-url/');

if (result.success) {
  console.log('Title:', result.data.title);
  console.log('Author:', result.data.author);
  console.log('Duration:', result.duration, 'ms');
}

// Scrape a chapter
const chapterResult = await scraper.scrapeChapter('https://truyenfull.vision/story/chapter-1/');

// Scrape complete story (metadata + chapter list)
const storyResult = await scraper.scrapeStory('https://truyenfull.vision/story-url/');

// Scrape multiple chapters with progress tracking
const chaptersResult = await scraper.scrapeChapters(chapterUrls, {
  concurrency: 1,
  onProgress: (progress) => {
    console.log(`Progress: ${progress.percentage}%`);
  },
});

// **NEW**: Scrape ALL chapters from paginated chapter lists
const allChaptersResult = await scraper.scrapeAllChapterPages(url, {
  maxPages: 100,  // Safety limit
  onProgress: (progress) => {
    console.log(`Scraping page ${progress.current}...`);
  },
});

Handling Paginated Chapter Lists

Many websites split chapter lists across multiple pages. The library automatically handles pagination:

import { TruyenFullScraper } from 'story-scraper';

const scraper = new TruyenFullScraper({ delay: 1000 });

// Get ALL chapters across ALL pages (recommended)
const result = await scraper.scrapeAllChapterPages(storyUrl, {
  maxPages: 100,           // Safety limit
  onProgress: (progress) => {
    console.log(`Page ${progress.current} scraped`);
  },
});

if (result.success) {
  console.log(`Total chapters: ${result.data.length}`);
  // result.data contains ALL chapters from ALL pages
}

Parsing Only (No HTTP Requests)

If you already have the HTML and just need to parse it:

import { TruyenFullParser } from 'story-scraper';

const parser = new TruyenFullParser();

// Parse story metadata
const htmlContent = '...'; // HTML string from the story page
const metadataResult = parser.parseMetadata(htmlContent);

if (metadataResult.success && metadataResult.data) {
  console.log('Title:', metadataResult.data.title);
  console.log('Author:', metadataResult.data.author);
  console.log('Description:', metadataResult.data.description);
}

// Parse chapter list
const chapterListResult = parser.parseChapterList(htmlContent);

if (chapterListResult.success && chapterListResult.data) {
  chapterListResult.data.forEach(chapter => {
    console.log(`Chapter ${chapter.chapterNumber}: ${chapter.title}`);
  });
}

// Parse chapter content
const chapterResult = parser.parseChapter(htmlContent);

if (chapterResult.success && chapterResult.data) {
  console.log('Chapter content:', chapterResult.data.content);
}

Using Scraper Registry

Auto-detect and use the right scraper for any supported URL:

import { getScraperByURL, getScraper } from 'story-scraper';

// Auto-detect scraper from URL
const scraper = getScraperByURL('https://truyenfull.vision/some-story/');

if (scraper) {
  const result = await scraper.scrapeMetadata(url);
}

// Or get scraper by domain
const scraper2 = getScraper('truyenfull.vision', {
  delay: 500,
  timeout: 20000,
});

Using Parser Registry (Parse-only mode)

import { getParser, isSupported } from 'story-scraper';

// Check if a domain is supported
if (isSupported('truyenfull.vision')) {
  const parser = getParser('truyenfull.vision');

  if (parser) {
    const result = parser.parseMetadata(htmlContent);
    // ...
  }
}

Using with URLs (Parse-only mode)

import { getParserByURL } from 'story-scraper';

const url = 'https://truyenfull.vision/some-story/';
const parser = getParserByURL(url);

if (parser) {
  // Use parser to parse content
}

Creating Custom Parsers

import { BaseParser, ParserResult, StoryMetadata } from 'story-scraper';

class MyCustomParser extends BaseParser {
  getSupportedDomains(): string[] {
    return ['example.com'];
  }

  parseMetadata(html: string): ParserResult<StoryMetadata> {
    const doc = this.safeParseHTML(html);
    if (!doc) {
      return this.error('Failed to parse HTML');
    }

    // Your custom parsing logic here
    const title = doc.querySelector('.title')?.textContent || '';

    return this.success({
      title,
      author: 'Unknown',
      description: '',
    });
  }

  parseChapterList(html: string): ParserResult<ChapterListItem[]> {
    // Your implementation
  }

  parseChapter(html: string): ParserResult<Chapter> {
    // Your implementation
  }
}

// Register your custom parser
import { defaultRegistry } from 'story-scraper';
defaultRegistry.register(new MyCustomParser());

API Reference

Types

`StoryMetadata`

interface StoryMetadata {
  title: string;
  author: string;
  description: string;
  coverImage?: string;
  status?: string;
  genres?: string[];
  tags?: string[];
  rating?: number;
  views?: number;
  lastUpdated?: Date;
}

`ChapterListItem`

interface ChapterListItem {
  chapterNumber: number;
  title: string;
  url?: string;
  id?: string;
  publishedAt?: Date;
}

`Chapter`

interface Chapter {
  chapterNumber: number;
  title: string;
  content: string;          // HTML format
  contentText?: string;      // Plain text format (NEW in v1.4.0)
  previousChapter?: {
    chapterNumber: number;
    url?: string;
  };
  nextChapter?: {
    chapterNumber: number;
    url?: string;
  };
  publishedAt?: Date;
}

`ParserResult<T>`

interface ParserResult<T> {
  success: boolean;
  data?: T;
  error?: string;
}

`ScrapeResult<T>`

interface ScrapeResult<T> {
  success: boolean;
  data?: T;
  error?: string;
  statusCode?: number;
  duration?: number;  // Request duration in milliseconds
}

`ScraperOptions`

interface ScraperOptions {
  headers?: Record<string, string>;
  timeout?: number;          // Request timeout in ms
  delay?: number;            // Delay between requests in ms
  maxRetries?: number;       // Max retry attempts
  userAgent?: string;        // Custom user agent
}

`BulkScrapeOptions`

interface BulkScrapeOptions extends ScraperOptions {
  onProgress?: (progress: ScrapeProgress) => void;
  concurrency?: number;      // Number of concurrent requests
  continueOnError?: boolean; // Continue scraping on errors
}

Classes

`BaseParser`

Abstract base class for all parsers.

Methods:

parseMetadata(html: string): ParserResult<StoryMetadata>
parseChapterList(html: string): ParserResult<ChapterListItem[]>
parseChapter(html: string): ParserResult<Chapter>
getSupportedDomains(): string[]
canParse(domain: string): boolean

`TruyenFullParser`

Parser for TruyenFull websites.

`TangThuVienParser`

Parser for TangThuVien websites.

`ParserRegistry`

Registry for managing parsers.

Methods:

register(parser: BaseParser): void
getParser(domain: string): BaseParser | undefined
getParserByURL(url: string): BaseParser | undefined
isSupported(domain: string): boolean
getSupportedDomains(): string[]

`BaseScraper`

Base scraper class that combines parsing with HTTP fetching.

Methods:

scrapeMetadata(url: string): Promise<ScrapeResult<StoryMetadata>>
scrapeChapterList(url: string): Promise<ScrapeResult<ChapterListItem[]>>
scrapeChapter(url: string): Promise<ScrapeResult<Chapter>>
scrapeStory(url: string): Promise<ScrapeResult<{metadata, chapters}>>
scrapeChapters(urls: string[], options?: BulkScrapeOptions): Promise<ScrapeResult<Chapter[]>>
getParser(): BaseParser
setOptions(options: Partial<ScraperOptions>): void

`TruyenFullScraper`

Scraper for TruyenFull websites.

`TangThuVienScraper`

Scraper for TangThuVien websites.

`ScraperRegistry`

Registry for managing scrapers.

Methods:

register(domains: string[], factory: (options?) => BaseScraper): void
getScraper(domain: string, options?: ScraperOptions): BaseScraper | undefined
getScraperByURL(url: string, options?: ScraperOptions): BaseScraper | undefined
isSupported(domain: string): boolean
getSupportedDomains(): string[]

Functions

Parser Functions

getParser(domain: string): BaseParser | undefined - Get parser for a domain
getParserByURL(url: string): BaseParser | undefined - Get parser by URL
isSupported(domain: string): boolean - Check if domain is supported
getSupportedDomains(): string[] - Get all supported domains

Scraper Functions

getScraper(domain: string, options?: ScraperOptions): BaseScraper | undefined - Get scraper for a domain
getScraperByURL(url: string, options?: ScraperOptions): BaseScraper | undefined - Get scraper by URL
isScrapingSupported(domain: string): boolean - Check if scraping is supported
getScrapingSupportedDomains(): string[] - Get all domains that support scraping

Examples

The library includes working examples that demonstrate both parsing and scraping:

Run Examples

# Scraper usage (recommended - shows all scraper features)
npm run example:scraper

# Pagination example (scrape ALL chapters from multi-page lists)
npm run example:pagination

# TruyenFull example (parse-only mode with manual fetching)
npm run example:truyenfull

# TangThuVien example (parse-only mode with manual fetching)
npm run example:tangthuvien

These examples will:

Fetch HTML content from real story websites
Parse metadata, chapter lists, and chapter content
Save the parsed data to tmp/ directory as JSON and HTML files

Example output files:

tmp/truyenfull/metadata.json - Parsed story metadata
tmp/truyenfull/chapters.json - List of all chapters
tmp/truyenfull/chapter-1.json - Parsed chapter content (JSON)
tmp/truyenfull/chapter-1.html - Formatted chapter content (HTML)

Development

Setup

# Install dependencies
npm install

# Run type checking
npm run typecheck

# Build the library
npm run build

# Run tests
npm test

# Watch mode for development
npm run dev

Project Structure

src/
├── types/              # TypeScript interfaces and types
│   └── index.ts
├── parsers/            # Parser implementations
│   ├── base/          # Base parser class
│   ├── truyenfull/    # TruyenFull parser
│   ├── tangthuvien/   # TangThuVien parser
│   ├── ParserRegistry.ts
│   └── index.ts
├── utils/             # Utility functions
│   └── dom.ts
└── index.ts           # Main entry point

Contributing

Contributions are welcome! To add support for a new website:

Create a new parser class extending BaseParser
Implement the required methods: parseMetadata, parseChapterList, parseChapter, and getSupportedDomains
Register the parser in src/index.ts
Add tests for your parser
Submit a pull request

License

ISC

Notes

This library only parses HTML content. You need to fetch the HTML yourself (e.g., using fetch, axios, etc.). This design keeps the library lightweight and gives you full control over how you fetch content.

Example: Complete Workflow

import { getParser } from 'story-scraper';

async function parseStory(url: string) {
  // 1. Fetch HTML (you need to implement this)
  const response = await fetch(url);
  const html = await response.text();

  // 2. Get appropriate parser
  const parser = getParserByURL(url);
  if (!parser) {
    throw new Error('No parser found for this URL');
  }

  // 3. Parse metadata
  const metadataResult = parser.parseMetadata(html);
  if (!metadataResult.success) {
    throw new Error(metadataResult.error);
  }

  // 4. Parse chapter list
  const chaptersResult = parser.parseChapterList(html);
  if (!chaptersResult.success) {
    throw new Error(chaptersResult.error);
  }

  return {
    metadata: metadataResult.data,
    chapters: chaptersResult.data,
  };
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Story Scraper

Features

Supported Websites

Installation

Usage

Quick Start: Scraping

Handling Paginated Chapter Lists

Parsing Only (No HTTP Requests)

Using Scraper Registry

Using Parser Registry (Parse-only mode)

Using with URLs (Parse-only mode)

Creating Custom Parsers

API Reference

Types

StoryMetadata

ChapterListItem

Chapter

ParserResult<T>

ScrapeResult<T>

ScraperOptions

BulkScrapeOptions

Classes

BaseParser

TruyenFullParser

TangThuVienParser

ParserRegistry

BaseScraper

TruyenFullScraper

TangThuVienScraper

ScraperRegistry

Functions

Parser Functions

Scraper Functions

Examples

Run Examples

Development

Setup

Project Structure

Contributing

License

Notes

Example: Complete Workflow

`StoryMetadata`

`ChapterListItem`

`Chapter`

`ParserResult<T>`

`ScrapeResult<T>`

`ScraperOptions`

`BulkScrapeOptions`

`BaseParser`

`TruyenFullParser`

`TangThuVienParser`

`ParserRegistry`

`BaseScraper`

`TruyenFullScraper`

`TangThuVienScraper`

`ScraperRegistry`