stt-post-processor
v0.1.0
Published
Post-processing pipeline for Speech-to-Text output — correction, alignment, miscue detection, and oral reading fluency analysis
Maintainers
Readme
stt-post-processor
Post-processing pipeline for Speech-to-Text output — correction, alignment, miscue detection, and oral reading fluency analysis.
Works with any STT engine (Google Cloud STT, OpenAI Whisper, Azure Speech, Deepgram, etc.). Just convert your STT output to SttWord[] and feed it in.
Install
pnpm add stt-post-processor
# or
npm install stt-post-processorQuick Start
import { analyze } from "stt-post-processor";
// Your STT output — any engine works, just map to this shape
const sttWords = [
{ word: "The", start: 0.0, end: 0.3 },
{ word: "cat", start: 0.3, end: 0.6 },
{ word: "set", start: 0.6, end: 0.9 }, // STT heard "set" instead of "sat"
{ word: "on", start: 0.9, end: 1.1 },
{ word: "the", start: 1.1, end: 1.3 },
{ word: "mat", start: 1.3, end: 1.6 },
];
const passage = "The cat sat on the mat";
const result = await analyze(sttWords, passage);
console.log(result.oralFluencyScore); // 83.3
console.log(result.classificationLevel); // "INSTRUCTIONAL"
console.log(result.wordsPerMinute); // 225
console.log(result.miscues); // [{ miscueType: "MISPRONUNCIATION", ... }]What It Does
The pipeline runs 7 processing layers in sequence:
- Passage-guided correction — Needleman-Wunsch global alignment corrects STT noise when words are similar to the passage
- Edit-distance correction — Fixes single-character typos with unambiguous candidates
- Word alignment — Aligns spoken words against the expected passage using Needleman-Wunsch
- Phonetic correction — Uses the CMU Pronouncing Dictionary (134k words) to catch homophones ("these"→"this", "their"→"there")
- Miscue detection — Classifies errors: omissions, substitutions, mispronunciations, reversals, transpositions, insertions, repetitions, self-corrections
- Behavior detection — Identifies word-by-word reading and punctuation dismissal
- Score computation — Calculates WPM, accuracy, oral fluency score, and classification level
API
analyze(sttWords, passageText, options?)
The main entry point. Runs the full pipeline and returns an OralFluencyAnalysis.
const result = await analyze(sttWords, passage, {
language: "en", // default: "en", also supports "fil"/"tl" for Tagalog
similarityThreshold: 0.55, // default: 0.55, threshold for passage correction
});Returns:
{
transcript: string; // Raw transcript from STT words
wordsPerMinute: number; // Reading speed
accuracy: number; // Percentage of exact matches
totalWords: number; // Words in the passage
totalMiscues: number; // Non-self-corrected miscues
duration: number; // Reading duration in seconds
oralFluencyScore: number; // (totalWords - miscues) / totalWords × 100
classificationLevel: "INDEPENDENT" | "INSTRUCTIONAL" | "FRUSTRATION";
miscues: MiscueResult[];
behaviors: BehaviorResult[];
alignedWords: AlignedWord[];
}Individual Functions
Every layer is exported individually so you can use just the pieces you need:
import {
// Correction
correctWithPassage,
postCorrectTranscription,
phoneticPostCorrection,
// Alignment
alignWords,
// Miscue detection
detectMiscues,
detectRepetitions,
detectSelfCorrections,
detectTranspositions,
// Behavior detection
detectBehaviors,
// Utilities
normalizeWord,
similarityRatio,
editDistance,
soundsSimilar,
initPhoneticDict,
// Types
type SttWord,
type AlignedWord,
type MiscueResult,
type BehaviorResult,
type OralFluencyAnalysis,
} from "stt-post-processor";Using with Google Cloud STT
import { analyze } from "stt-post-processor";
// Map Google's protobuf word info to SttWord
const sttWords = googleResult.results.flatMap((r) =>
r.alternatives[0].words.map((w) => ({
word: w.word,
start: Number(w.startOffset?.seconds ?? 0) + (w.startOffset?.nanos ?? 0) / 1e9,
end: Number(w.endOffset?.seconds ?? 0) + (w.endOffset?.nanos ?? 0) / 1e9,
}))
);
const result = await analyze(sttWords, passageText);Using with OpenAI Whisper
import { analyze } from "stt-post-processor";
// Whisper's word-level timestamps map directly
const sttWords = whisperResult.words.map((w) => ({
word: w.word,
start: w.start,
end: w.end,
}));
const result = await analyze(sttWords, passageText);Adding Pitch Analysis (Monotonous Reading)
Pitch analysis requires audio buffers, so it stays in your app. Pass the pitch coefficient of variation to detectBehaviors:
import { detectBehaviors, alignWords } from "stt-post-processor";
const pitchCoV = analyzePitchInYourApp(audioBuffer); // your pitch analysis
const behaviors = detectBehaviors(alignedWords, passageWords, pitchCoV);Classification Levels
| Score | Level | |-----------|-----------------| | ≥ 97% | INDEPENDENT | | 90–96% | INSTRUCTIONAL | | < 90% | FRUSTRATION |
Miscue Types
| Type | Description | |--------------------|----------------------------------------------------| | OMISSION | Student skipped a word | | MISPRONUNCIATION | Similar but not exact (similarity ≥ 0.5) | | SUBSTITUTION | Different word entirely (similarity < 0.5) | | REVERSAL | Letters reversed ("was" → "saw") | | TRANSPOSITION | Adjacent words swapped ("the cat" → "cat the") | | INSERTION | Student added a word not in the passage | | REPETITION | Student repeated a word or phrase | | SELF_CORRECTION | Student corrected themselves (not counted as error)|
License
MIT
