subtexty

v0.1.0

Published

7 months ago

Extract clean plain-text from subtitle files

Downloads

0High
0Medium
0Low

bytesnack114

subtitle caption text-extraction utility cli vtt srt ttml sbv json3

Subtexty

Extract clean plain-text from subtitle files with intelligent deduplication and format support.

Overview

Subtexty is a lightweight, open-source CLI tool and TypeScript library that extracts clean, deduplicated plain-text from subtitle files. It intelligently handles styling tags, timing metadata, and removes redundant content while preserving the original text flow.

Features

🎯 Smart Text Extraction: Removes timing, positioning, and style tags while preserving content
🔄 Intelligent Deduplication: Eliminates redundant lines and prefix duplicates
🌐 Multi-Format Support: WebVTT (.vtt), SRT (.srt), TTML (.ttml/.xml), SBV (.sbv), JSON3 (.json/.json3)
🔤 Encoding Handling: UTF-8 by default with fallback encoding detection and manual override support
📝 Dual Interface: Both CLI tool and programmatic library
⚡ Performance: Stream processing for memory efficiency
🧪 Well Tested: 80%+ test coverage with comprehensive test suite

Installation

NPM (Global CLI)

npm install -g subtexty

NPM (Project Dependency)

npm install subtexty

Quick Start

CLI Usage

# Extract text to stdout
subtexty input.vtt

# Save to file
subtexty input.srt -o clean-text.txt

# Specify encoding
subtexty input.vtt --encoding utf-8

Library Usage

import { extractText } from 'subtexty';

// Basic extraction
const cleanText = await extractText('subtitles.vtt');
console.log(cleanText);

// With options
const cleanText = await extractText('subtitles.srt', {
  encoding: 'utf-8'
});

CLI Reference

Basic Usage

subtexty [options] <input-file>

Arguments

input-file - Subtitle file to process (required)

Options

-v, --version - Display version number
-o, --output <file> - Output file (default: stdout)
--encoding <encoding> - File encoding (default: utf-8)
-h, --help - Display help for command

Examples

# Basic text extraction
subtexty movie-subtitles.vtt

# Multiple file processing with output
subtexty episode1.srt -o episode1-text.txt
subtexty episode2.srt -o episode2-text.txt

# Handle different encodings
subtexty foreign-film.srt --encoding latin1

# Pipe to other tools
subtexty subtitles.vtt | wc -w  # Word count
subtexty subtitles.vtt | grep "keyword"  # Search

Exit Codes

0 - Success
1 - File error (not found, permissions, etc.)
2 - Parsing error (invalid format, corrupted data)

Library API

`extractText(filePath, options?)`

Extracts clean text from a subtitle file.

Parameters:

filePath (string) - Path to the subtitle file
options (object, optional) - Extraction options
- encoding (string) - File encoding (default: utf-8)

Returns:

Promise<string> - Clean extracted text

Example:

import { extractText } from 'subtexty';

try {
  const text = await extractText('./subtitles.vtt');
  console.log(text);
} catch (error) {
  console.error('Extraction failed:', error.message);
}

Error Handling

import { extractText, isSubtextyError } from 'subtexty';

try {
  const text = await extractText('file.vtt', { encoding: 'utf-8' });
  // Process text...
} catch (error) {
  if (isSubtextyError(error)) {
    // Handle specific subtexty errors
    switch (error.code) {
      case 'FILE_NOT_FOUND':
        console.error('Subtitle file does not exist');
        break;
      case 'UNSUPPORTED_FORMAT':
        console.error('File format not supported');
        break;
      case 'FILE_NOT_READABLE':
        console.error('Cannot read the file');
        break;
      default:
        console.error('Extraction error:', error.message);
    }
  } else {
    console.error('Unexpected error:', error.message);
  }
}

Supported Formats

| Format | Extensions | Description | |--------|------------|-------------| | WebVTT | .vtt | Web Video Text Tracks | | SRT | .srt | SubRip Subtitle | | TTML | .ttml, .xml | Timed Text Markup Language | | SBV | .sbv | YouTube SBV format | | JSON3 | .json, .json3 | JSON-based subtitle format |

Text Processing Features

Tag Removal

Removes HTML, XML, and styling tags:

Input:  <b>Bold text</b> and <i>italic</i>
Output: Bold text and italic

Entity Conversion

Converts HTML entities:

Input:  Tom &amp; Jerry say &quot;Hello&quot;
Output: Tom & Jerry say "Hello"

Smart Deduplication

Removes redundant content intelligently:

Exact Duplicates:

Input:  Same line
        Same line
        Different line
Output: Same line
        Different line

Prefix Removal:

Input:  I love coding
        I love coding with TypeScript
        Amazing results
Output: I love coding with TypeScript
        Amazing results

Whitespace Normalization

Cleans up spacing issues:

Input:  Multiple   spaces    and	tabs
Output: Multiple spaces and tabs

Development

Prerequisites

Node.js ≥14.0.0
pnpm (recommended) or npm

Installation

git clone https://github.com/bytesnack114/subtexty.git
cd subtexty
pnpm install

Development Scripts

# Development
pnpm dev input.vtt              # Run CLI in development mode
pnpm build                      # Build TypeScript
pnpm clean                      # Clean build artifacts

# Testing
pnpm test                       # Run test suite
pnpm test:watch                 # Watch mode testing
pnpm test:coverage              # Coverage report

# Code Quality
pnpm lint                       # Run ESLint
pnpm lint:fix                   # Fix linting issues

Project Structure

subtexty/
├── src/
│   ├── cli.ts              # CLI interface
│   ├── constants.ts        # Application constants
│   ├── errors.ts           # Custom error classes
│   ├── index.ts            # Library entry point
│   ├── validation.ts       # Input validation
│   ├── cli/                # CLI-specific modules
│   ├── parsers/            # Format-specific parsers
│   ├── types/              # TypeScript definitions
│   ├── utils/              # Text cleaning utilities
│   └── __tests__/          # Test suite
├── coverage/               # Coverage Report (if run `pnpm test:coverage`)
├── dist/                   # Built files (if run `pnpm build`)
└── example/                # Example input files

Contributing

Quick Contribution Steps

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make changes and add tests
Run tests with coverage: pnpm test:coverage
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

Testing

Subtexty has comprehensive test coverage:

# Run all tests
pnpm test

# Generate coverage report
pnpm test:coverage

# View coverage report
open coverage/lcov-report/index.html

Test Categories

Unit Tests: Individual component testing
Integration Tests: End-to-end workflow testing
Parser Tests: Format-specific parsing validation
CLI Tests: Command-line interface testing

Performance

Memory Efficient: Stream processing for large files
Fast Processing: Optimized text cleaning pipeline
Minimal Dependencies: Only essential packages included

Troubleshooting

Common Issues

File Not Found Error

Error: Input file not found: subtitle.vtt

Solution: Check file path and permissions

Unsupported Format

Error: Unsupported file format: .txt

Solution: Use supported subtitle formats (.vtt, .srt, .ttml, .sbv, .json)

Encoding Issues

# Specify encoding manually
subtexty file.srt --encoding latin1

Permission Errors

# Check file permissions
ls -la subtitle-file.vtt
chmod +r subtitle-file.vtt

License

MIT License - see LICENSE.md file for details.

Support

🐛 Bug Reports: GitHub Issues
📧 Email: [email protected]

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Subtexty

Overview

Features

Installation

NPM (Global CLI)

NPM (Project Dependency)

Quick Start

CLI Usage

Library Usage

CLI Reference

Basic Usage

Arguments

Options

Examples

Exit Codes

Library API

extractText(filePath, options?)

Error Handling

Supported Formats

Text Processing Features

Tag Removal

Entity Conversion

Smart Deduplication

Whitespace Normalization

Development

Prerequisites

Installation

Development Scripts

Project Structure

Contributing

Quick Contribution Steps

Testing

Test Categories

Performance

Troubleshooting

Common Issues

License

Support

`extractText(filePath, options?)`