spoken-number-normalizer
v1.0.0
Published
A high-performance, matcher-based normalization engine for converting spoken or written number expressions into structured numeric values — with **first-class support for the Indian numbering system**.
Maintainers
Readme
spoken-number-normalizer
A high-performance, matcher-based normalization engine for converting spoken or written number expressions into structured numeric values — with first-class support for the Indian numbering system.
Designed for Speech-to-Text (STT) pipelines, streaming workloads, and large-scale data processing.
📚 Table of Contents
- Introduction
- Features
- Installation
- Quick Start
- API Reference
- Streaming Usage
- Architecture Overview
- Why Not Regex?
- Examples
- Roadmap
- Troubleshooting
- License
Introduction
spoken-number-normalizer is a deterministic normalization engine built specifically for converting spoken numeric expressions into structured numeric values, with native support for Indian units such as lakh and crore.
Unlike regex-heavy solutions, this library is designed to be stream-safe, confidence-aware, and resilient to noisy STT output—making it suitable for production voice systems.
✨ Features
Indian Number System Support
- Native handling of units like lakh and crore
- Example:
one crore two lakh five→10200005
Matcher-Based Architecture
- Deterministic matchers instead of brittle regular expressions
- Better handling of nested and long numeric expressions
Confidence Scoring
- Each normalization returns a confidence score
- Useful for rejecting or flagging low-quality STT transcripts
- Performance First
- Tree-shakable ESM and CJS builds
- Zero runtime dependencies
📦 Installation
npm install spoken-number-normalizer🚀 Quick Start
JavaScript
import { normalizeNumber } from "spoken-number-normalizer";
const result = normalizeNumber("one crore two lakh five");
console.log(result);
/*
{
output: 10200005,
confidence: 0.97
}
*/🧠 API
normalizeNumber(input: string)
Normalizes a spoken number string into a structured object.
Interface
interface NormalizationResult {
input: string;
output?: number;
confidence: number;
error?: {
code: string;
message: string;
};
}🧩 Architecture Overview
The engine follows a linear pipeline to ensure speed and predictability:
Unicode Normalization
Standardizes character encoding.Text Cleanup
Removes fillers and linguistic noise.Scanner
Tokenizes the string using a dictionary-based approach.Matcher Engine
- Number Matcher (active)
- Currency Matcher (in development)
Best Match Selection
Weights competing parses based on confidence.Structured Output
Returns the final numeric value.
🤔 Why not regex?
Regex-based normalization solutions often:
- Break on long or complex numeric strings
- Fail to express grammar or handle nested units
(e.g."one hundred two crore") - Are not stream-safe, requiring the full string to be present in memory
- Cannot provide confidence scores for noisy STT data
This library uses deterministic matchers, making it robust enough for production-grade voice applications.
🛣 Roadmap
- [ ] Currency normalization (INR, USD, etc.)
- [ ] Date and time expression parsing
- [ ] Percentage handling
- [ ] STT filler removal (
"uh","actually","yaani")
📜 License
ISC
