paragrafs

v1.5.1

Published

4 months ago

A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions.

0High
0Medium
0Low

ragaeeb

transcription wit.ai audio video nodejs formatting paragraphs text-processing typescript

paragrafs

GitHub License GitHub Release npm GitHub issues CodeRabbit Pull Request Reviews

A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.

Features

Segment Recognition: Intelligently groups text into logical paragraphs
Filler Removal: Identifies and removes common speech fillers (uh, umm, etc.)
Gap Detection: Detects significant pauses to identify paragraph breaks
Timestamp Formatting: Converts seconds to readable timestamps (HH:MM:SS)
Punctuation Awareness: Uses punctuation to identify natural segment breaks
Customizable Parameters: Configure minimum words per segment, max segment length, etc.
Arabic Support: Handles Arabic question marks and other non-Latin punctuation
Transcript Formatting: Converts raw token streams into readable text with appropriate line breaks
Ground-Truth Token Mapping: Aligns AI-generated word timestamps to human-edited transcript text using an LCS-based algorithm with intelligent interpolation

Installation

npm install paragrafs

pnpm install paragrafs

yarn add paragrafs

bun add paragrafs

Usage

Basic Example

import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';

// Example token from transcription
const token = {
    start: 0,
    end: 5,
    text: 'This is a sample text. It should be properly segmented.',
};

// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);

// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);

console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."

Working with Transcriptions

import {
    markAndCombineSegments,
    mapSegmentsIntoFormattedSegments,
    formatSegmentsToTimestampedTranscript,
} from 'paragrafs';

// Example transcription segments
const segments = [
    {
        start: 0,
        end: 6.5,
        text: 'The quick brown fox!',
        tokens: [
            { start: 0, end: 1, text: 'The' },
            { start: 1, end: 2, text: 'quick' },
            { start: 2, end: 3, text: 'brown' },
            { start: 3, end: 6.5, text: 'fox!' },
        ],
    },
    {
        start: 8,
        end: 13,
        text: 'Jumps right over the',
        tokens: [
            { start: 8, end: 9, text: 'Jumps' },
            { start: 9, end: 10, text: 'right' },
            { start: 10, end: 11, text: 'over' },
            { start: 12, end: 13, text: 'the' },
        ],
    },
];

// Options for segment formatting
const options = {
    fillers: ['uh', 'umm', 'hmmm'],
    gapThreshold: 3,
    maxSecondsPerSegment: 12,
    minWordsPerSegment: 3,
};

// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);

// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);

console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the

Aligning AI Tokens to Human-Edited Text

import { updateSegmentWithGroundTruth } from 'paragrafs';

const rawSegment = {
    start: 0,
    end: 10,
    text: 'The Buick crown flock jumps right over the crazy dog.',
    tokens: [
        /* AI-generated word timestamps */
    ],
};

const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.

API Reference

Core Functions

`estimateSegmentFromToken(token: Token): Segment`

Splits a single token into word-level tokens and estimates timing for each word.

`markTokensWithDividers(tokens: Token[], options): MarkedToken[]`

Marks tokens with segment breaks based on fillers, gaps, and punctuation.

`groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]`

Groups marked tokens into logical segments based on maximum segment length.

`mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]`

Merges segments with too few words into the previous segment.

`mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]`

Converts marked segments into clean, formatted segments with proper text representation.

`formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number): string`

Formats segments into a human-readable transcript with timestamps.

`markAndCombineSegments(segments: Segment[], options): MarkedSegment[]`

Combined utility that processes segments through all the necessary steps.

`mapTokensToGroundTruth(segment: Segment): Segment`

Synchronizes AI-generated word timestamps with the human-edited transcript (segment.text):

Uses a longest-common-subsequence (LCS) to find matching words and preserve their original timing.
Evenly interpolates timestamps for runs of missing words (only when two or more are missing).
Falls back to estimateSegmentFromToken if no matches are found.

Types

type Token = {
    start: number; // Start time in seconds
    end: number; // End time in seconds
    text: string; // The transcribed text
};

type Segment = Token & {
    tokens: Token[]; // Word-by-word breakdown with timings
};

type MarkedToken = 'SEGMENT_BREAK' | Token;

type MarkedSegment = {
    start: number;
    end: number;
    tokens: MarkedToken[];
};

Utility Functions

`isEndingWithPunctuation(text: string): boolean`

Checks if the text ends with punctuation (including Arabic punctuation).

`formatSecondsToTimestamp(seconds: number): string`

Formats seconds into a human-readable timestamp (H:MM:SS).

Use Cases

Transcript Formatting: Convert raw transcriptions into readable text
Subtitle Generation: Create properly formatted subtitles from audio transcriptions
Document Reconstruction: Rebuild properly formatted documents from extracted text

Contributing

Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.

To get started:

Fork the repository
Install dependencies: bun install (requires Bun)
Make your changes
Run tests: bun test
Submit a pull request

License

paragrafs is released under the MIT License. See the LICENSE.MD file for more details.

Author

Ragaeeb Haq

Built with TypeScript and Bun. Uses ESM module format.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

paragrafs

Features

Installation

Usage

Basic Example

Working with Transcriptions

Aligning AI Tokens to Human-Edited Text

API Reference

Core Functions

estimateSegmentFromToken(token: Token): Segment

markTokensWithDividers(tokens: Token[], options): MarkedToken[]

groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]

mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]

mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]

formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number): string

markAndCombineSegments(segments: Segment[], options): MarkedSegment[]

mapTokensToGroundTruth(segment: Segment): Segment

Types

Utility Functions

isEndingWithPunctuation(text: string): boolean

formatSecondsToTimestamp(seconds: number): string

Use Cases

Contributing

License

Author

`estimateSegmentFromToken(token: Token): Segment`

`markTokensWithDividers(tokens: Token[], options): MarkedToken[]`

`groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]`

`mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]`

`mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]`

`formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number): string`

`markAndCombineSegments(segments: Segment[], options): MarkedSegment[]`

`mapTokensToGroundTruth(segment: Segment): Segment`

`isEndingWithPunctuation(text: string): boolean`

`formatSecondsToTimestamp(seconds: number): string`