paragrafs
v1.6.0
Published
A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions.
Downloads
31
Maintainers
Readme
paragrafs
A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.
Features
- Segment reconstruction – marks filler words, hints, and time gaps to create natural paragraph boundaries and merges overly short segments back into their predecessors.【F:src/transcript.ts†L40-L204】【F:src/transcript.ts†L236-L300】
- Timestamped formatting – produces human-friendly transcripts with optional custom formatting callbacks and automatic timestamp rendering.【F:src/transcript.ts†L212-L300】
- Ground-truth alignment – synchronises AI generated tokens with human edited text, interpolating timings for missing words and removing unknown tokens when applying the ground truth.【F:src/utils/transcriptUtils.ts†L1-L226】【F:src/transcript.ts†L328-L395】
- Selection helpers – exposes utilities to find tokens for string queries or cursor selections, enabling rich text editors to jump to precise timestamps.【F:src/transcript.ts†L424-L493】
- Hint system (Arabic-first) – robust multi-word hint matching using normalization (diacritics/punctuation tolerant), plus hard boundary insertion via
ALWAYS_BREAK.【F:src/utils/textUtils.ts†L59-L156】【F:src/transcript.ts†L40-L121】 - Auto-hint generation – mines frequent repeated phrases from
Token[]orSegment[]and returns sorted hint candidates for Arabic-heavy transcripts.【F:src/utils/hints.ts†L303-L379】 - Utility toolkit – includes helpers for timestamp formatting, punctuation detection (including Arabic punctuation), ground-truth tokenization, and normalization utilities.【F:src/utils/textUtils.ts†L4-L185】
- Bun-native toolchain – powered by the upstream
tsdownCLI for bundling and Biome for linting, so the same commands run locally and in CI without any custom wrappers.【F:package.json†L7-L41】【F:tsdown.config.ts†L1-L9】【F:biome.json†L1-L16】
Breaking changes (recent)
- Hints are normalized by default:
createHints(...)now uses Arabic-first normalization for matching and mining. If you relied on exact string matching, update your expectations and/or pass explicit normalization options.【F:src/utils/textUtils.ts†L121-L156】 ALWAYS_BREAKis a true hard boundary: segments/lines after anALWAYS_BREAKmust not be merged into previous segments.【F:src/transcript.ts†L95-L167】【F:src/transcript.ts†L173-L211】
Installation
npm install paragrafsor
pnpm install paragrafsor
yarn add paragrafsor
bun add paragrafsUsage
Basic Example
import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';
// Example token from transcription
const token = {
start: 0,
end: 5,
text: 'This is a sample text. It should be properly segmented.',
};
// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);
// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);
console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."Working with Transcriptions
import {
markAndCombineSegments,
mapSegmentsIntoFormattedSegments,
formatSegmentsToTimestampedTranscript,
} from 'paragrafs';
// Example transcription segments
const segments = [
{
start: 0,
end: 6.5,
text: 'The quick brown fox!',
tokens: [
{ start: 0, end: 1, text: 'The' },
{ start: 1, end: 2, text: 'quick' },
{ start: 2, end: 3, text: 'brown' },
{ start: 3, end: 6.5, text: 'fox!' },
],
},
{
start: 8,
end: 13,
text: 'Jumps right over the',
tokens: [
{ start: 8, end: 9, text: 'Jumps' },
{ start: 9, end: 10, text: 'right' },
{ start: 10, end: 11, text: 'over' },
{ start: 12, end: 13, text: 'the' },
],
},
];
// Options for segment formatting
const options = {
fillers: ['uh', 'umm', 'hmmm'],
gapThreshold: 3,
maxSecondsPerSegment: 12,
minWordsPerSegment: 3,
};
// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);
// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);
console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over theAligning AI Tokens to Human-Edited Text
import { updateSegmentWithGroundTruth } from 'paragrafs';
const rawSegment = {
start: 0,
end: 10,
text: 'The Buick crown flock jumps right over the crazy dog.',
tokens: [
/* AI-generated word timestamps */
],
};
const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.Auto-generate hint candidates (Arabic-first)
Use this when you have a corpus of tokens/segments and want to discover repeated phrases like "احسن الله اليكم".
import { createHints, generateHintsFromTokens, markTokensWithDividers } from 'paragrafs';
const tokens = [
{ start: 0, end: 1, text: 'أَحْسَنَ' },
{ start: 1, end: 2, text: 'الله' },
{ start: 2, end: 3, text: 'إليكم،' },
// ... repeated in the stream ...
];
const mined = generateHintsFromTokens(tokens, {
minN: 2,
maxN: 4,
minCount: 2,
dedupe: 'closed',
normalization: { normalizeAlef: true },
});
// Turn mined phrases into matching hints
const hints = createHints({ normalizeAlef: true }, ...mined.slice(0, 25).map((h) => h.phrase));
const marked = markTokensWithDividers(tokens, { fillers: [], gapThreshold: 999, hints });Commands
bun run build– compiles the library with the official tsdown pipeline configured intsdown.config.ts.【F:package.json†L33-L41】【F:tsdown.config.ts†L1-L9】bun run lint– runs Biome’s formatter and linter against the repository root.【F:package.json†L33-L41】【F:biome.json†L1-L16】bun test– executes the Bun test suite.bun test --coverage– runs tests with coverage reporting (useful for refactors of segmentation/matching logic).
Demo app (Svelte + Vite)
This repo includes a minimal static demo app in demo/ that exercises the major exported functions with configurable JSON/text inputs. It’s intended to be deployed to paragrafs.surge.sh.
Live demo: paragrafs.surge.sh
Install:
bun run demo:installDev:
bun run demo:devBuild:
bun run demo:buildDeploy to Surge:
bun run demo:deploy
Notes:
- The demo depends on the local package via
file:.., sodemo:buildrunsbun run buildfirst to ensuredist/exists. - Deploy target folder is
demo/dist.
API Reference
Transcript builders
estimateSegmentFromToken(token: Token): Segment– splits multi-word tokens into per-word timings so they can participate in downstream processing.【F:src/transcript.ts†L15-L39】markTokensWithDividers(tokens: Token[], options: MarkTokensWithDividersOptions): MarkedToken[]– inserts divider markers based on fillers, hints, punctuation, and timing gaps.【F:src/transcript.ts†L44-L121】groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]– chunks marked tokens into bounded-length segments.【F:src/transcript.ts†L123-L171】mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]– merges segments that contain fewer than the required word count into their predecessors.【F:src/transcript.ts†L173-L211】cleanupIsolatedTokens(markedTokens: MarkedToken[]): MarkedToken[]– removes redundant divider markers that would isolate a single token on a line.【F:src/transcript.ts†L314-L326】markAndCombineSegments(segments: Segment[], options): MarkedSegment[]– convenience pipeline that flattens tokens, marks dividers, groups, and merges short runs in one call.【F:src/transcript.ts†L302-L326】mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]– flattens marked segments into readable text while respecting optional line duration caps.【F:src/transcript.ts†L236-L300】formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number, formatTokens?: (buffer: Token) => string): string– emits newline separated transcript lines with timestamps or a custom formatter.【F:src/transcript.ts†L204-L234】
Ground-truth alignment
updateSegmentWithGroundTruth(segment: Segment, groundTruth: string): GroundedSegment– applies LCS-based alignment to replace tokens with the ground-truth words while flagging unmatched entries.【F:src/transcript.ts†L328-L359】applyGroundTruthToSegment(segment: Segment, groundTruth: string): Segment– wrapsupdateSegmentWithGroundTruthand filters unknown tokens for production-ready output.【F:src/transcript.ts†L361-L395】mergeSegments(segments: Segment[], delimiter?: string): Segment– concatenates sequential segments into one continuous block, preserving timing.【F:src/transcript.ts†L397-L411】splitSegment(segment: Segment, splitTime: number): Segment[]– divides a segment into two at a specific timestamp.【F:src/transcript.ts†L413-L448】
Editor helpers
getFirstMatchingToken(tokens: Token[], query: string): Token | null– scans for the first occurrence of a hint sequence produced bycreateHints.【F:src/transcript.ts†L450-L493】getFirstTokenForSelection(segment: Segment, selectionStart: number, selectionEnd: number): Token | null– maps character selections withinsegment.textback to the corresponding timed token.【F:src/transcript.ts†L495-L546】
Utility functions
createHints(first: ArabicNormalizationOptions | string, ...rest: string[]): Hints– creates normalized hints for robust Arabic matching (diacritics/punctuation tolerant).【F:src/utils/textUtils.ts†L121-L156】formatSecondsToTimestamp(seconds: number): string– renders numeric durations intom:ssorh:mm:ssstrings.【F:src/utils/textUtils.ts†L14-L33】isEndingWithPunctuation(text: string): boolean– checks for trailing punctuation, including Arabic variants.【F:src/utils/textUtils.ts†L4-L12】tokenizeGroundTruth(groundTruth: string): string[]– tokenises human transcripts while attaching punctuation to the preceding word.【F:src/utils/textUtils.ts†L75-L112】normalizeTokenText(text: string, options?: ArabicNormalizationOptions): string– Arabic-first normalization used by hint matching and hint mining.【F:src/utils/textUtils.ts†L59-L103】
Auto-hint generation
generateHintsFromTokens(tokens: Token[], options?: GenerateHintsOptions): GeneratedHint[]– mines frequent n-grams from a token stream and returns candidates sorted by count/length.【F:src/utils/hints.ts†L303-L331】generateHintsFromSegments(segments: Segment[], options?: GenerateHintsOptions): GeneratedHint[]– mines frequent n-grams from segments; by default phrases do not cross segment boundaries.【F:src/utils/hints.ts†L333-L379】
Types
type Token = {
start: number;
end: number;
text: string;
};
type Segment = Token & {
tokens: Token[];
};
type MarkedToken = Token | typeof SEGMENT_BREAK | typeof ALWAYS_BREAK;
type MarkedSegment = {
start: number;
end: number;
tokens: MarkedToken[];
};
type GroundedToken = Token & { isUnknown?: boolean };
type GroundedSegment = Omit<Segment, 'tokens'> & { tokens: GroundedToken[] };Use Cases
- Transcript Formatting: Convert raw transcriptions into readable text
- Subtitle Generation: Create properly formatted subtitles from audio transcriptions
- Document Reconstruction: Rebuild properly formatted documents from extracted text
Contributing
Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.
To get started:
- Fork the repository
- Install dependencies:
bun install(requires Bun) - Make your changes
- Run linting:
bun run lint - Build the package:
bun run build - Run tests:
bun test --coverage - Submit a pull request
License
paragrafs is released under the MIT License. See the LICENSE.MD file for more details.
Author
Ragaeeb Haq
Built with TypeScript and Bun. Uses ESM module format.
