spoken-number-normalizer

v1.0.0

Published

6 months ago

A high-performance, matcher-based normalization engine for converting spoken or written number expressions into structured numeric values — with **first-class support for the Indian numbering system**.

Downloads

0High
0Medium
0Low

r0han44

speech-to-text normalization number-parser indian-numbers nlp streaming

spoken-number-normalizer

A high-performance, matcher-based normalization engine for converting spoken or written number expressions into structured numeric values — with first-class support for the Indian numbering system.

Designed for Speech-to-Text (STT) pipelines, streaming workloads, and large-scale data processing.

📚 Table of Contents

Introduction

spoken-number-normalizer is a deterministic normalization engine built specifically for converting spoken numeric expressions into structured numeric values, with native support for Indian units such as lakh and crore.

Unlike regex-heavy solutions, this library is designed to be stream-safe, confidence-aware, and resilient to noisy STT output—making it suitable for production voice systems.

✨ Features

Indian Number System Support
- Native handling of units like lakh and crore
- Example:
  one crore two lakh five → 10200005
Matcher-Based Architecture
- Deterministic matchers instead of brittle regular expressions
- Better handling of nested and long numeric expressions
Confidence Scoring
- Each normalization returns a confidence score
- Useful for rejecting or flagging low-quality STT transcripts

Performance First
- Tree-shakable ESM and CJS builds
- Zero runtime dependencies

📦 Installation

npm install spoken-number-normalizer

🚀 Quick Start

JavaScript

import { normalizeNumber } from "spoken-number-normalizer";

const result = normalizeNumber("one crore two lakh five");

console.log(result);
/*
{
  output: 10200005,
  confidence: 0.97
}
*/

🧠 API

`normalizeNumber(input: string)`

Normalizes a spoken number string into a structured object.

Interface

interface NormalizationResult {
  input: string;
  output?: number;
  confidence: number;
  error?: {
    code: string;
    message: string;
  };
}

🧩 Architecture Overview

The engine follows a linear pipeline to ensure speed and predictability:

Unicode Normalization
Standardizes character encoding.
Text Cleanup
Removes fillers and linguistic noise.
Scanner
Tokenizes the string using a dictionary-based approach.
Matcher Engine
- Number Matcher (active)
- Currency Matcher (in development)
Best Match Selection
Weights competing parses based on confidence.
Structured Output
Returns the final numeric value.

🤔 Why not regex?

Regex-based normalization solutions often:

Break on long or complex numeric strings
Fail to express grammar or handle nested units
(e.g. "one hundred two crore")
Are not stream-safe, requiring the full string to be present in memory
Cannot provide confidence scores for noisy STT data

This library uses deterministic matchers, making it robust enough for production-grade voice applications.

🛣 Roadmap

[ ] Currency normalization (INR, USD, etc.)
[ ] Date and time expression parsing
[ ] Percentage handling
[ ] STT filler removal ("uh", "actually", "yaani")

📜 License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

spoken-number-normalizer

📚 Table of Contents

Introduction

✨ Features

📦 Installation

🚀 Quick Start

JavaScript

🧠 API

normalizeNumber(input: string)

Interface

🧩 Architecture Overview

🤔 Why not regex?

🛣 Roadmap

📜 License

`normalizeNumber(input: string)`