@mailwoman/normalize
v4.11.0
Published
Stage 1 of the runtime pipeline — deterministic input preprocessing (Unicode NFC, punctuation, whitespace, abbreviation). Pure functions, no ML.
Readme
@mailwoman/normalize
Stage 1 of the Mailwoman runtime pipeline — deterministic input preprocessing.
Pure-function text normalization that prepares free-text address strings for
downstream parsing stages. Every transform produces a load-bearing offsetMap
so downstream stages can map normalized-string spans back to raw-string character
offsets.
import { normalize } from "@mailwoman/normalize"
const result = normalize("123 Main St.")
// result.normalized → "123 Main St."
// result.offsetMap → maps each normalized char back to rawWhat it does
| Transform | Purpose |
| ----------------------------- | --------------------------------------------------------------------------- |
| NFC normalization | Unicode canonical composition |
| Punctuation normalization | Smart-quotes → straight, fullwidth → ASCII, elision/apostrophe preservation |
| Whitespace collapse | Multi-space, tab, non-breaking → single space; leading/trailing trim |
| Abbreviation expansion | Opt-in — "St." → "Street", "Ave" → "Avenue" etc. |
| CJK normalization | CJK-specific whitespace and punctuation handling |
API
// Full normalization pipeline (NFC → punctuation → whitespace)
normalize(input: string, opts?: NormalizeOpts): NormalizedInput
// Individual transforms (if you need only one)
applyNfc(input: string): NormalizedInput
applyPunctuation(input: string): NormalizedInput
collapseWhitespace(input: string): NormalizedInput
expandAbbreviations(input: string, opts?: ExpandOpts): NormalizedInput
applyCjkNormalization(input: string): CjkResult
// Offset map utilities
composeMaps(inner: OffsetMap, outer: OffsetMap): OffsetMap
identityMap(length: number): OffsetMapPipeline position
raw string → normalize → query-shape → locale-gate → kind-classifier → phrase-grouper → ...Stage 1 in the Staged Pipeline Contract. No runtime dependencies.
Design
- Pure functions, no side effects, no ML. The output is byte-for-byte deterministic for the same input.
offsetMapis load-bearing. Every transform tracks how normalized positions map back to raw input positions. This is essential for the parser to report spans in the original string.- Configurable via
NormalizeOpts: toggleexpandAbbreviations,normalizeCase, andcjk.
Related
@mailwoman/query-shape— Stage 1.5, structural priors that consume the normalized output- Staged Pipeline Contract
- Tokenization concepts
