npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

story-scraper

v2.0.0

Published

TypeScript library for scraping and parsing stories from various Vietnamese novel websites

Readme

Story Scraper

A TypeScript library for scraping and parsing stories from various Vietnamese novel websites. Extract story metadata, chapter lists, and chapter content with ease.

Features

  • 🔍 Dual Mode: Parse HTML strings OR scrape directly from URLs
  • ⚡ Type-safe: Built with TypeScript for full type safety
  • 🏗️ Modular: Clean, extensible architecture with base parser and scraper classes
  • 🌐 Multiple sites: Built-in support for TruyenFull.vision and TangThuVien
  • 🔧 Easy to extend: Simple API to add custom parsers/scrapers for new websites
  • 📦 Lightweight: Minimal dependencies (cheerio for parsing, html-to-text for conversion, native fetch for scraping)
  • 🌳 Tree-shakeable: Built with tsup for optimal bundle size
  • ⚡ Fast: Uses cheerio for efficient HTML parsing and DOM traversal
  • 📝 Smart text extraction: Automatic conversion to clean, readable plain text using html-to-text
  • ⏱️ Rate limiting: Built-in delay, retry, and progress tracking for scraping
  • ✅ Well-tested: Comprehensive test suite

Supported Websites

  • TruyenFull: truyenfull.vision, truyenfull.vn, truyenfull.com
  • TangThuVien: tangthuvien.vn, tangthuvien.com, wikidich.com

Installation

npm install story-scraper

or

yarn add story-scraper

or

pnpm add story-scraper

Usage

Quick Start: Scraping

The easiest way to get started is using the built-in scrapers:

import { TruyenFullScraper } from 'story-scraper';

const scraper = new TruyenFullScraper({
  delay: 1000,      // 1 second between requests
  timeout: 30000,   // 30 second timeout
  maxRetries: 3,    // Retry up to 3 times on failure
});

// Scrape story metadata
const result = await scraper.scrapeMetadata('https://truyenfull.vision/story-url/');

if (result.success) {
  console.log('Title:', result.data.title);
  console.log('Author:', result.data.author);
  console.log('Duration:', result.duration, 'ms');
}

// Scrape a chapter
const chapterResult = await scraper.scrapeChapter('https://truyenfull.vision/story/chapter-1/');

// Scrape complete story (metadata + chapter list)
const storyResult = await scraper.scrapeStory('https://truyenfull.vision/story-url/');

// Scrape multiple chapters with progress tracking
const chaptersResult = await scraper.scrapeChapters(chapterUrls, {
  concurrency: 1,
  onProgress: (progress) => {
    console.log(`Progress: ${progress.percentage}%`);
  },
});

// **NEW**: Scrape ALL chapters from paginated chapter lists
const allChaptersResult = await scraper.scrapeAllChapterPages(url, {
  maxPages: 100,  // Safety limit
  onProgress: (progress) => {
    console.log(`Scraping page ${progress.current}...`);
  },
});

Handling Paginated Chapter Lists

Many websites split chapter lists across multiple pages. The library automatically handles pagination:

import { TruyenFullScraper } from 'story-scraper';

const scraper = new TruyenFullScraper({ delay: 1000 });

// Get ALL chapters across ALL pages (recommended)
const result = await scraper.scrapeAllChapterPages(storyUrl, {
  maxPages: 100,           // Safety limit
  onProgress: (progress) => {
    console.log(`Page ${progress.current} scraped`);
  },
});

if (result.success) {
  console.log(`Total chapters: ${result.data.length}`);
  // result.data contains ALL chapters from ALL pages
}

Parsing Only (No HTTP Requests)

If you already have the HTML and just need to parse it:

import { TruyenFullParser } from 'story-scraper';

const parser = new TruyenFullParser();

// Parse story metadata
const htmlContent = '...'; // HTML string from the story page
const metadataResult = parser.parseMetadata(htmlContent);

if (metadataResult.success && metadataResult.data) {
  console.log('Title:', metadataResult.data.title);
  console.log('Author:', metadataResult.data.author);
  console.log('Description:', metadataResult.data.description);
}

// Parse chapter list
const chapterListResult = parser.parseChapterList(htmlContent);

if (chapterListResult.success && chapterListResult.data) {
  chapterListResult.data.forEach(chapter => {
    console.log(`Chapter ${chapter.chapterNumber}: ${chapter.title}`);
  });
}

// Parse chapter content
const chapterResult = parser.parseChapter(htmlContent);

if (chapterResult.success && chapterResult.data) {
  console.log('Chapter content:', chapterResult.data.content);
}

Using Scraper Registry

Auto-detect and use the right scraper for any supported URL:

import { getScraperByURL, getScraper } from 'story-scraper';

// Auto-detect scraper from URL
const scraper = getScraperByURL('https://truyenfull.vision/some-story/');

if (scraper) {
  const result = await scraper.scrapeMetadata(url);
}

// Or get scraper by domain
const scraper2 = getScraper('truyenfull.vision', {
  delay: 500,
  timeout: 20000,
});

Using Parser Registry (Parse-only mode)

import { getParser, isSupported } from 'story-scraper';

// Check if a domain is supported
if (isSupported('truyenfull.vision')) {
  const parser = getParser('truyenfull.vision');

  if (parser) {
    const result = parser.parseMetadata(htmlContent);
    // ...
  }
}

Using with URLs (Parse-only mode)

import { getParserByURL } from 'story-scraper';

const url = 'https://truyenfull.vision/some-story/';
const parser = getParserByURL(url);

if (parser) {
  // Use parser to parse content
}

Creating Custom Parsers

import { BaseParser, ParserResult, StoryMetadata } from 'story-scraper';

class MyCustomParser extends BaseParser {
  getSupportedDomains(): string[] {
    return ['example.com'];
  }

  parseMetadata(html: string): ParserResult<StoryMetadata> {
    const doc = this.safeParseHTML(html);
    if (!doc) {
      return this.error('Failed to parse HTML');
    }

    // Your custom parsing logic here
    const title = doc.querySelector('.title')?.textContent || '';

    return this.success({
      title,
      author: 'Unknown',
      description: '',
    });
  }

  parseChapterList(html: string): ParserResult<ChapterListItem[]> {
    // Your implementation
  }

  parseChapter(html: string): ParserResult<Chapter> {
    // Your implementation
  }
}

// Register your custom parser
import { defaultRegistry } from 'story-scraper';
defaultRegistry.register(new MyCustomParser());

API Reference

Types

StoryMetadata

interface StoryMetadata {
  title: string;
  author: string;
  description: string;
  coverImage?: string;
  status?: string;
  genres?: string[];
  tags?: string[];
  rating?: number;
  views?: number;
  lastUpdated?: Date;
}

ChapterListItem

interface ChapterListItem {
  chapterNumber: number;
  title: string;
  url?: string;
  id?: string;
  publishedAt?: Date;
}

Chapter

interface Chapter {
  chapterNumber: number;
  title: string;
  content: string;          // HTML format
  contentText?: string;      // Plain text format (NEW in v1.4.0)
  previousChapter?: {
    chapterNumber: number;
    url?: string;
  };
  nextChapter?: {
    chapterNumber: number;
    url?: string;
  };
  publishedAt?: Date;
}

ParserResult<T>

interface ParserResult<T> {
  success: boolean;
  data?: T;
  error?: string;
}

ScrapeResult<T>

interface ScrapeResult<T> {
  success: boolean;
  data?: T;
  error?: string;
  statusCode?: number;
  duration?: number;  // Request duration in milliseconds
}

ScraperOptions

interface ScraperOptions {
  headers?: Record<string, string>;
  timeout?: number;          // Request timeout in ms
  delay?: number;            // Delay between requests in ms
  maxRetries?: number;       // Max retry attempts
  userAgent?: string;        // Custom user agent
}

BulkScrapeOptions

interface BulkScrapeOptions extends ScraperOptions {
  onProgress?: (progress: ScrapeProgress) => void;
  concurrency?: number;      // Number of concurrent requests
  continueOnError?: boolean; // Continue scraping on errors
}

Classes

BaseParser

Abstract base class for all parsers.

Methods:

  • parseMetadata(html: string): ParserResult<StoryMetadata>
  • parseChapterList(html: string): ParserResult<ChapterListItem[]>
  • parseChapter(html: string): ParserResult<Chapter>
  • getSupportedDomains(): string[]
  • canParse(domain: string): boolean

TruyenFullParser

Parser for TruyenFull websites.

TangThuVienParser

Parser for TangThuVien websites.

ParserRegistry

Registry for managing parsers.

Methods:

  • register(parser: BaseParser): void
  • getParser(domain: string): BaseParser | undefined
  • getParserByURL(url: string): BaseParser | undefined
  • isSupported(domain: string): boolean
  • getSupportedDomains(): string[]

BaseScraper

Base scraper class that combines parsing with HTTP fetching.

Methods:

  • scrapeMetadata(url: string): Promise<ScrapeResult<StoryMetadata>>
  • scrapeChapterList(url: string): Promise<ScrapeResult<ChapterListItem[]>>
  • scrapeChapter(url: string): Promise<ScrapeResult<Chapter>>
  • scrapeStory(url: string): Promise<ScrapeResult<{metadata, chapters}>>
  • scrapeChapters(urls: string[], options?: BulkScrapeOptions): Promise<ScrapeResult<Chapter[]>>
  • getParser(): BaseParser
  • setOptions(options: Partial<ScraperOptions>): void

TruyenFullScraper

Scraper for TruyenFull websites.

TangThuVienScraper

Scraper for TangThuVien websites.

ScraperRegistry

Registry for managing scrapers.

Methods:

  • register(domains: string[], factory: (options?) => BaseScraper): void
  • getScraper(domain: string, options?: ScraperOptions): BaseScraper | undefined
  • getScraperByURL(url: string, options?: ScraperOptions): BaseScraper | undefined
  • isSupported(domain: string): boolean
  • getSupportedDomains(): string[]

Functions

Parser Functions

  • getParser(domain: string): BaseParser | undefined - Get parser for a domain
  • getParserByURL(url: string): BaseParser | undefined - Get parser by URL
  • isSupported(domain: string): boolean - Check if domain is supported
  • getSupportedDomains(): string[] - Get all supported domains

Scraper Functions

  • getScraper(domain: string, options?: ScraperOptions): BaseScraper | undefined - Get scraper for a domain
  • getScraperByURL(url: string, options?: ScraperOptions): BaseScraper | undefined - Get scraper by URL
  • isScrapingSupported(domain: string): boolean - Check if scraping is supported
  • getScrapingSupportedDomains(): string[] - Get all domains that support scraping

Examples

The library includes working examples that demonstrate both parsing and scraping:

Run Examples

# Scraper usage (recommended - shows all scraper features)
npm run example:scraper

# Pagination example (scrape ALL chapters from multi-page lists)
npm run example:pagination

# TruyenFull example (parse-only mode with manual fetching)
npm run example:truyenfull

# TangThuVien example (parse-only mode with manual fetching)
npm run example:tangthuvien

These examples will:

  1. Fetch HTML content from real story websites
  2. Parse metadata, chapter lists, and chapter content
  3. Save the parsed data to tmp/ directory as JSON and HTML files

Example output files:

  • tmp/truyenfull/metadata.json - Parsed story metadata
  • tmp/truyenfull/chapters.json - List of all chapters
  • tmp/truyenfull/chapter-1.json - Parsed chapter content (JSON)
  • tmp/truyenfull/chapter-1.html - Formatted chapter content (HTML)

Development

Setup

# Install dependencies
npm install

# Run type checking
npm run typecheck

# Build the library
npm run build

# Run tests
npm test

# Watch mode for development
npm run dev

Project Structure

src/
├── types/              # TypeScript interfaces and types
│   └── index.ts
├── parsers/            # Parser implementations
│   ├── base/          # Base parser class
│   ├── truyenfull/    # TruyenFull parser
│   ├── tangthuvien/   # TangThuVien parser
│   ├── ParserRegistry.ts
│   └── index.ts
├── utils/             # Utility functions
│   └── dom.ts
└── index.ts           # Main entry point

Contributing

Contributions are welcome! To add support for a new website:

  1. Create a new parser class extending BaseParser
  2. Implement the required methods: parseMetadata, parseChapterList, parseChapter, and getSupportedDomains
  3. Register the parser in src/index.ts
  4. Add tests for your parser
  5. Submit a pull request

License

ISC

Notes

This library only parses HTML content. You need to fetch the HTML yourself (e.g., using fetch, axios, etc.). This design keeps the library lightweight and gives you full control over how you fetch content.

Example: Complete Workflow

import { getParser } from 'story-scraper';

async function parseStory(url: string) {
  // 1. Fetch HTML (you need to implement this)
  const response = await fetch(url);
  const html = await response.text();

  // 2. Get appropriate parser
  const parser = getParserByURL(url);
  if (!parser) {
    throw new Error('No parser found for this URL');
  }

  // 3. Parse metadata
  const metadataResult = parser.parseMetadata(html);
  if (!metadataResult.success) {
    throw new Error(metadataResult.error);
  }

  // 4. Parse chapter list
  const chaptersResult = parser.parseChapterList(html);
  if (!chaptersResult.success) {
    throw new Error(chaptersResult.error);
  }

  return {
    metadata: metadataResult.data,
    chapters: chaptersResult.data,
  };
}