subtexty
v0.1.0
Published
Extract clean plain-text from subtitle files
Downloads
7
Maintainers
Readme
Subtexty
Extract clean plain-text from subtitle files with intelligent deduplication and format support.
Overview
Subtexty is a lightweight, open-source CLI tool and TypeScript library that extracts clean, deduplicated plain-text from subtitle files. It intelligently handles styling tags, timing metadata, and removes redundant content while preserving the original text flow.
Features
- 🎯 Smart Text Extraction: Removes timing, positioning, and style tags while preserving content
- 🔄 Intelligent Deduplication: Eliminates redundant lines and prefix duplicates
- 🌐 Multi-Format Support: WebVTT (.vtt), SRT (.srt), TTML (.ttml/.xml), SBV (.sbv), JSON3 (.json/.json3)
- 🔤 Encoding Handling: UTF-8 by default with fallback encoding detection and manual override support
- 📝 Dual Interface: Both CLI tool and programmatic library
- ⚡ Performance: Stream processing for memory efficiency
- 🧪 Well Tested: 80%+ test coverage with comprehensive test suite
Installation
NPM (Global CLI)
npm install -g subtextyNPM (Project Dependency)
npm install subtextyQuick Start
CLI Usage
# Extract text to stdout
subtexty input.vtt
# Save to file
subtexty input.srt -o clean-text.txt
# Specify encoding
subtexty input.vtt --encoding utf-8Library Usage
import { extractText } from 'subtexty';
// Basic extraction
const cleanText = await extractText('subtitles.vtt');
console.log(cleanText);
// With options
const cleanText = await extractText('subtitles.srt', {
encoding: 'utf-8'
});CLI Reference
Basic Usage
subtexty [options] <input-file>Arguments
input-file- Subtitle file to process (required)
Options
-v, --version- Display version number-o, --output <file>- Output file (default: stdout)--encoding <encoding>- File encoding (default: utf-8)-h, --help- Display help for command
Examples
# Basic text extraction
subtexty movie-subtitles.vtt
# Multiple file processing with output
subtexty episode1.srt -o episode1-text.txt
subtexty episode2.srt -o episode2-text.txt
# Handle different encodings
subtexty foreign-film.srt --encoding latin1
# Pipe to other tools
subtexty subtitles.vtt | wc -w # Word count
subtexty subtitles.vtt | grep "keyword" # SearchExit Codes
0- Success1- File error (not found, permissions, etc.)2- Parsing error (invalid format, corrupted data)
Library API
extractText(filePath, options?)
Extracts clean text from a subtitle file.
Parameters:
filePath(string) - Path to the subtitle fileoptions(object, optional) - Extraction optionsencoding(string) - File encoding (default: utf-8)
Returns:
Promise<string>- Clean extracted text
Example:
import { extractText } from 'subtexty';
try {
const text = await extractText('./subtitles.vtt');
console.log(text);
} catch (error) {
console.error('Extraction failed:', error.message);
}Error Handling
import { extractText, isSubtextyError } from 'subtexty';
try {
const text = await extractText('file.vtt', { encoding: 'utf-8' });
// Process text...
} catch (error) {
if (isSubtextyError(error)) {
// Handle specific subtexty errors
switch (error.code) {
case 'FILE_NOT_FOUND':
console.error('Subtitle file does not exist');
break;
case 'UNSUPPORTED_FORMAT':
console.error('File format not supported');
break;
case 'FILE_NOT_READABLE':
console.error('Cannot read the file');
break;
default:
console.error('Extraction error:', error.message);
}
} else {
console.error('Unexpected error:', error.message);
}
}Supported Formats
| Format | Extensions | Description |
|--------|------------|-------------|
| WebVTT | .vtt | Web Video Text Tracks |
| SRT | .srt | SubRip Subtitle |
| TTML | .ttml, .xml | Timed Text Markup Language |
| SBV | .sbv | YouTube SBV format |
| JSON3 | .json, .json3 | JSON-based subtitle format |
Text Processing Features
Tag Removal
Removes HTML, XML, and styling tags:
Input: <b>Bold text</b> and <i>italic</i>
Output: Bold text and italicEntity Conversion
Converts HTML entities:
Input: Tom & Jerry say "Hello"
Output: Tom & Jerry say "Hello"Smart Deduplication
Removes redundant content intelligently:
Exact Duplicates:
Input: Same line
Same line
Different line
Output: Same line
Different linePrefix Removal:
Input: I love coding
I love coding with TypeScript
Amazing results
Output: I love coding with TypeScript
Amazing resultsWhitespace Normalization
Cleans up spacing issues:
Input: Multiple spaces and tabs
Output: Multiple spaces and tabsDevelopment
Prerequisites
- Node.js ≥14.0.0
- pnpm (recommended) or npm
Installation
git clone https://github.com/bytesnack114/subtexty.git
cd subtexty
pnpm installDevelopment Scripts
# Development
pnpm dev input.vtt # Run CLI in development mode
pnpm build # Build TypeScript
pnpm clean # Clean build artifacts
# Testing
pnpm test # Run test suite
pnpm test:watch # Watch mode testing
pnpm test:coverage # Coverage report
# Code Quality
pnpm lint # Run ESLint
pnpm lint:fix # Fix linting issuesProject Structure
subtexty/
├── src/
│ ├── cli.ts # CLI interface
│ ├── constants.ts # Application constants
│ ├── errors.ts # Custom error classes
│ ├── index.ts # Library entry point
│ ├── validation.ts # Input validation
│ ├── cli/ # CLI-specific modules
│ ├── parsers/ # Format-specific parsers
│ ├── types/ # TypeScript definitions
│ ├── utils/ # Text cleaning utilities
│ └── __tests__/ # Test suite
├── coverage/ # Coverage Report (if run `pnpm test:coverage`)
├── dist/ # Built files (if run `pnpm build`)
└── example/ # Example input filesContributing
Quick Contribution Steps
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make changes and add tests
- Run tests with coverage:
pnpm test:coverage - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
Testing
Subtexty has comprehensive test coverage:
# Run all tests
pnpm test
# Generate coverage report
pnpm test:coverage
# View coverage report
open coverage/lcov-report/index.htmlTest Categories
- Unit Tests: Individual component testing
- Integration Tests: End-to-end workflow testing
- Parser Tests: Format-specific parsing validation
- CLI Tests: Command-line interface testing
Performance
- Memory Efficient: Stream processing for large files
- Fast Processing: Optimized text cleaning pipeline
- Minimal Dependencies: Only essential packages included
Troubleshooting
Common Issues
File Not Found Error
Error: Input file not found: subtitle.vttSolution: Check file path and permissions
Unsupported Format
Error: Unsupported file format: .txtSolution: Use supported subtitle formats (.vtt, .srt, .ttml, .sbv, .json)
Encoding Issues
# Specify encoding manually
subtexty file.srt --encoding latin1Permission Errors
# Check file permissions
ls -la subtitle-file.vtt
chmod +r subtitle-file.vttLicense
MIT License - see LICENSE.md file for details.
Support
- 🐛 Bug Reports: GitHub Issues
- 📧 Email: [email protected]
