npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

paragrafs

v1.6.0

Published

A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions.

Downloads

31

Readme

paragrafs

wakatime Bun Node.js CI GitHub License GitHub Release codecov Size typescript npm GitHub issues GitHub stars CodeRabbit Pull Request Reviews

A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.

Features

  • Segment reconstruction – marks filler words, hints, and time gaps to create natural paragraph boundaries and merges overly short segments back into their predecessors.【F:src/transcript.ts†L40-L204】【F:src/transcript.ts†L236-L300】
  • Timestamped formatting – produces human-friendly transcripts with optional custom formatting callbacks and automatic timestamp rendering.【F:src/transcript.ts†L212-L300】
  • Ground-truth alignment – synchronises AI generated tokens with human edited text, interpolating timings for missing words and removing unknown tokens when applying the ground truth.【F:src/utils/transcriptUtils.ts†L1-L226】【F:src/transcript.ts†L328-L395】
  • Selection helpers – exposes utilities to find tokens for string queries or cursor selections, enabling rich text editors to jump to precise timestamps.【F:src/transcript.ts†L424-L493】
  • Hint system (Arabic-first) – robust multi-word hint matching using normalization (diacritics/punctuation tolerant), plus hard boundary insertion via ALWAYS_BREAK.【F:src/utils/textUtils.ts†L59-L156】【F:src/transcript.ts†L40-L121】
  • Auto-hint generation – mines frequent repeated phrases from Token[] or Segment[] and returns sorted hint candidates for Arabic-heavy transcripts.【F:src/utils/hints.ts†L303-L379】
  • Utility toolkit – includes helpers for timestamp formatting, punctuation detection (including Arabic punctuation), ground-truth tokenization, and normalization utilities.【F:src/utils/textUtils.ts†L4-L185】
  • Bun-native toolchain – powered by the upstream tsdown CLI for bundling and Biome for linting, so the same commands run locally and in CI without any custom wrappers.【F:package.json†L7-L41】【F:tsdown.config.ts†L1-L9】【F:biome.json†L1-L16】

Breaking changes (recent)

  • Hints are normalized by default: createHints(...) now uses Arabic-first normalization for matching and mining. If you relied on exact string matching, update your expectations and/or pass explicit normalization options.【F:src/utils/textUtils.ts†L121-L156】
  • ALWAYS_BREAK is a true hard boundary: segments/lines after an ALWAYS_BREAK must not be merged into previous segments.【F:src/transcript.ts†L95-L167】【F:src/transcript.ts†L173-L211】

Installation

npm install paragrafs

or

pnpm install paragrafs

or

yarn add paragrafs

or

bun add paragrafs

Usage

Basic Example

import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';

// Example token from transcription
const token = {
    start: 0,
    end: 5,
    text: 'This is a sample text. It should be properly segmented.',
};

// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);

// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);

console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."

Working with Transcriptions

import {
    markAndCombineSegments,
    mapSegmentsIntoFormattedSegments,
    formatSegmentsToTimestampedTranscript,
} from 'paragrafs';

// Example transcription segments
const segments = [
    {
        start: 0,
        end: 6.5,
        text: 'The quick brown fox!',
        tokens: [
            { start: 0, end: 1, text: 'The' },
            { start: 1, end: 2, text: 'quick' },
            { start: 2, end: 3, text: 'brown' },
            { start: 3, end: 6.5, text: 'fox!' },
        ],
    },
    {
        start: 8,
        end: 13,
        text: 'Jumps right over the',
        tokens: [
            { start: 8, end: 9, text: 'Jumps' },
            { start: 9, end: 10, text: 'right' },
            { start: 10, end: 11, text: 'over' },
            { start: 12, end: 13, text: 'the' },
        ],
    },
];

// Options for segment formatting
const options = {
    fillers: ['uh', 'umm', 'hmmm'],
    gapThreshold: 3,
    maxSecondsPerSegment: 12,
    minWordsPerSegment: 3,
};

// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);

// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);

console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the

Aligning AI Tokens to Human-Edited Text

import { updateSegmentWithGroundTruth } from 'paragrafs';

const rawSegment = {
    start: 0,
    end: 10,
    text: 'The Buick crown flock jumps right over the crazy dog.',
    tokens: [
        /* AI-generated word timestamps */
    ],
};

const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.

Auto-generate hint candidates (Arabic-first)

Use this when you have a corpus of tokens/segments and want to discover repeated phrases like "احسن الله اليكم".

import { createHints, generateHintsFromTokens, markTokensWithDividers } from 'paragrafs';

const tokens = [
    { start: 0, end: 1, text: 'أَحْسَنَ' },
    { start: 1, end: 2, text: 'الله' },
    { start: 2, end: 3, text: 'إليكم،' },
    // ... repeated in the stream ...
];

const mined = generateHintsFromTokens(tokens, {
    minN: 2,
    maxN: 4,
    minCount: 2,
    dedupe: 'closed',
    normalization: { normalizeAlef: true },
});

// Turn mined phrases into matching hints
const hints = createHints({ normalizeAlef: true }, ...mined.slice(0, 25).map((h) => h.phrase));

const marked = markTokensWithDividers(tokens, { fillers: [], gapThreshold: 999, hints });

Commands

  • bun run build – compiles the library with the official tsdown pipeline configured in tsdown.config.ts.【F:package.json†L33-L41】【F:tsdown.config.ts†L1-L9】
  • bun run lint – runs Biome’s formatter and linter against the repository root.【F:package.json†L33-L41】【F:biome.json†L1-L16】
  • bun test – executes the Bun test suite.
  • bun test --coverage – runs tests with coverage reporting (useful for refactors of segmentation/matching logic).

Demo app (Svelte + Vite)

This repo includes a minimal static demo app in demo/ that exercises the major exported functions with configurable JSON/text inputs. It’s intended to be deployed to paragrafs.surge.sh.

  • Live demo: paragrafs.surge.sh

  • Install: bun run demo:install

  • Dev: bun run demo:dev

  • Build: bun run demo:build

  • Deploy to Surge: bun run demo:deploy

Notes:

  • The demo depends on the local package via file:.., so demo:build runs bun run build first to ensure dist/ exists.
  • Deploy target folder is demo/dist.

API Reference

Transcript builders

  • estimateSegmentFromToken(token: Token): Segment – splits multi-word tokens into per-word timings so they can participate in downstream processing.【F:src/transcript.ts†L15-L39】
  • markTokensWithDividers(tokens: Token[], options: MarkTokensWithDividersOptions): MarkedToken[] – inserts divider markers based on fillers, hints, punctuation, and timing gaps.【F:src/transcript.ts†L44-L121】
  • groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[] – chunks marked tokens into bounded-length segments.【F:src/transcript.ts†L123-L171】
  • mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[] – merges segments that contain fewer than the required word count into their predecessors.【F:src/transcript.ts†L173-L211】
  • cleanupIsolatedTokens(markedTokens: MarkedToken[]): MarkedToken[] – removes redundant divider markers that would isolate a single token on a line.【F:src/transcript.ts†L314-L326】
  • markAndCombineSegments(segments: Segment[], options): MarkedSegment[] – convenience pipeline that flattens tokens, marks dividers, groups, and merges short runs in one call.【F:src/transcript.ts†L302-L326】
  • mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[] – flattens marked segments into readable text while respecting optional line duration caps.【F:src/transcript.ts†L236-L300】
  • formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number, formatTokens?: (buffer: Token) => string): string – emits newline separated transcript lines with timestamps or a custom formatter.【F:src/transcript.ts†L204-L234】

Ground-truth alignment

  • updateSegmentWithGroundTruth(segment: Segment, groundTruth: string): GroundedSegment – applies LCS-based alignment to replace tokens with the ground-truth words while flagging unmatched entries.【F:src/transcript.ts†L328-L359】
  • applyGroundTruthToSegment(segment: Segment, groundTruth: string): Segment – wraps updateSegmentWithGroundTruth and filters unknown tokens for production-ready output.【F:src/transcript.ts†L361-L395】
  • mergeSegments(segments: Segment[], delimiter?: string): Segment – concatenates sequential segments into one continuous block, preserving timing.【F:src/transcript.ts†L397-L411】
  • splitSegment(segment: Segment, splitTime: number): Segment[] – divides a segment into two at a specific timestamp.【F:src/transcript.ts†L413-L448】

Editor helpers

  • getFirstMatchingToken(tokens: Token[], query: string): Token | null – scans for the first occurrence of a hint sequence produced by createHints.【F:src/transcript.ts†L450-L493】
  • getFirstTokenForSelection(segment: Segment, selectionStart: number, selectionEnd: number): Token | null – maps character selections within segment.text back to the corresponding timed token.【F:src/transcript.ts†L495-L546】

Utility functions

  • createHints(first: ArabicNormalizationOptions | string, ...rest: string[]): Hints – creates normalized hints for robust Arabic matching (diacritics/punctuation tolerant).【F:src/utils/textUtils.ts†L121-L156】
  • formatSecondsToTimestamp(seconds: number): string – renders numeric durations into m:ss or h:mm:ss strings.【F:src/utils/textUtils.ts†L14-L33】
  • isEndingWithPunctuation(text: string): boolean – checks for trailing punctuation, including Arabic variants.【F:src/utils/textUtils.ts†L4-L12】
  • tokenizeGroundTruth(groundTruth: string): string[] – tokenises human transcripts while attaching punctuation to the preceding word.【F:src/utils/textUtils.ts†L75-L112】
  • normalizeTokenText(text: string, options?: ArabicNormalizationOptions): string – Arabic-first normalization used by hint matching and hint mining.【F:src/utils/textUtils.ts†L59-L103】

Auto-hint generation

  • generateHintsFromTokens(tokens: Token[], options?: GenerateHintsOptions): GeneratedHint[] – mines frequent n-grams from a token stream and returns candidates sorted by count/length.【F:src/utils/hints.ts†L303-L331】
  • generateHintsFromSegments(segments: Segment[], options?: GenerateHintsOptions): GeneratedHint[] – mines frequent n-grams from segments; by default phrases do not cross segment boundaries.【F:src/utils/hints.ts†L333-L379】

Types

type Token = {
    start: number;
    end: number;
    text: string;
};

type Segment = Token & {
    tokens: Token[];
};

type MarkedToken = Token | typeof SEGMENT_BREAK | typeof ALWAYS_BREAK;

type MarkedSegment = {
    start: number;
    end: number;
    tokens: MarkedToken[];
};

type GroundedToken = Token & { isUnknown?: boolean };

type GroundedSegment = Omit<Segment, 'tokens'> & { tokens: GroundedToken[] };

Use Cases

  • Transcript Formatting: Convert raw transcriptions into readable text
  • Subtitle Generation: Create properly formatted subtitles from audio transcriptions
  • Document Reconstruction: Rebuild properly formatted documents from extracted text

Contributing

Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.

To get started:

  1. Fork the repository
  2. Install dependencies: bun install (requires Bun)
  3. Make your changes
  4. Run linting: bun run lint
  5. Build the package: bun run build
  6. Run tests: bun test --coverage
  7. Submit a pull request

License

paragrafs is released under the MIT License. See the LICENSE.MD file for more details.

Author

Ragaeeb Haq


Built with TypeScript and Bun. Uses ESM module format.