@mera-vansh/ms-ltd
v2.2.1
Published
Zero-dependency TypeScript NLP engine for multilingual Indian-language applications
Maintainers
Readme
ms-ltd
Multilingual Semantic Language Toolkit — Deterministic
Zero-dependency TypeScript NLP engine for multilingual Indian-language applications. Provides TF-IDF retrieval, emotion/tone classification, Unicode script detection, language identification, Sanskrit grammar tools, and a ~2200-entry curated lexicon across 18 Indian languages — all without any external runtime dependencies.
Why ms-ltd?
Most NLP libraries require heavy models, Python environments, or large dependency trees. ms-ltd is different:
- Zero runtime dependencies — ships nothing but your own code
- Deterministic — no ML models; every classification is explainable and auditable
- Multilingual by design — built specifically for the 18 official Indian languages
- Works offline — no API calls, no network, no cloud
- Tree-shakeable — import only what you need; full TypeScript types included
Supported Languages
| Code | Language | Script | Code | Language | Script |
|------|-----------|------------|------|-----------|------------|
| en | English | Latin | pa | Punjabi | Gurmukhi |
| hi | Hindi | Devanagari | or | Odia | Odia |
| mr | Marathi | Devanagari | sa | Sanskrit | Devanagari |
| ne | Nepali | Devanagari | kok| Konkani | Devanagari |
| ma | Maithili | Devanagari | ta | Tamil | Tamil |
| bn | Bengali | Bengali | te | Telugu | Telugu |
| as | Assamese | Bengali | ml | Malayalam | Malayalam |
| gu | Gujarati | Gujarati | kn | Kannada | Kannada |
| ur | Urdu | Arabic | sd | Sindhi | Arabic |
Installation
npm install @mera-vansh/ms-ltd
# or
yarn add @mera-vansh/ms-ltd
# or
pnpm add @mera-vansh/ms-ltdRequirements: Node.js ≥ 22
Quick Start
import { LTD } from "@mera-vansh/ms-ltd";
// 1. Create engine
const ltd = new LTD();
// 2. Ingest your knowledge base
ltd.ingest([
{ id: "g1", text: "भारद्वाज गोत्र के बारे में जानकारी", metadata: { topic: "gotra" } },
{ id: "g2", text: "Bharadwaj gotra pravara rishis", metadata: { topic: "gotra" } },
{ id: "r1", text: "माता पिता का रिश्ता पारिवारिक संबंध", metadata: { topic: "family" } },
{ id: "r2", text: "mother father relation family tree", metadata: { topic: "family" } },
]);
// 3. Query
const result = ltd.call("मेरा गोत्र भारद्वाज है, पिताजी का नाम राम है");
console.log(result);
// {
// input: "मेरा गोत्र भारद्वाज है, पिताजी का नाम राम है",
// lang: "hi",
// script: "Devanagari",
// emotion: "NEUTRAL",
// tone: "NEUTRAL",
// candidates: [{ id: "g1", score: 0.82, metadata: { topic: "gotra" } }, ...],
// confidence: 0.82
// }Core Concepts
The LTD Pipeline
Every call to ltd.call(input) runs a six-stage pipeline:
Raw text
→ [1] Normalise NFKC + lowercase + zero-width strip
→ [2] Detect Script Unicode block counting (10 scripts)
→ [3] Detect Language Whole-word seed-keyword disambiguation (18 languages)
→ [4] Retrieve TF-IDF cosine similarity against ingested docs
→ [5] Emotion Deterministic keyword + regex rules
→ [6] Tone Deterministic honorific + regex rules
→ LTDResponseDocuments (Knowledge Base)
You train the engine by ingesting IngestDocument objects. Each document has:
id— unique string key (use for feedback, remove, lookup)text— the content to vectorise and searchmetadata— arbitrary key/value pairs returned with retrieval results
Retrieval (TF-IDF + Cosine Similarity)
Documents are represented as sparse TF-IDF vectors. Queries are vectorised using the same vocabulary and ranked by cosine similarity weighted by each document's feedback weight.
Feedback Loop
After a retrieval, you can signal whether a result was helpful:
ltd.feedback("g1", "positive"); // boost: weight × 1.1, max 3.0
ltd.feedback("g1", "negative"); // penalise: weight × 0.9, min 0.1API Reference
new LTD(options?)
const ltd = new LTD({ defaultTopK: 5 });| Option | Type | Default | Description |
|---|---|---|---|
| defaultTopK | number | 5 | Default number of candidates returned |
ltd.ingest(docs)
Indexes a batch of documents. Fits the TF-IDF vocabulary on the entire batch so IDF scores are globally calibrated. Prefer one large ingest() call over many small ones.
ltd.ingest([
{ id: "q1", text: "namaste greetings hello", metadata: { lang: "en" } },
{ id: "q2", text: "नमस्ते प्रणाम", metadata: { lang: "hi" } },
{ id: "q3", text: "வணக்கம் நன்றி", metadata: { lang: "ta" } },
]);Calling ingest() again with different documents adds to the store (does not replace previous entries). Entries with duplicate IDs are overwritten.
ltd.add(doc)
Adds a single document without re-fitting the vocabulary. Useful for incremental additions after initial ingestion. Terms not in the fitted vocabulary are silently ignored.
ltd.add({ id: "q4", text: "bonjour", metadata: { lang: "fr" } });ltd.call(input, targetLang?, topK?)
Runs the full NLP pipeline.
const res = ltd.call("guruji ka ashirwad chahiye", undefined, 3);
// res.tone === "REVERENTIAL"
// res.emotion === "REVERENCE"
// res.lang === "hi" (or null for mixed script)Returns LTDResponse:
interface LTDResponse {
input: string; // normalised input
lang: LangCode | null; // detected language
script: Script; // dominant script
emotion: Emotion; // emotional register
tone: Tone; // formality register
candidates: LTDCandidate[]; // ranked retrieval results
confidence: number; // top candidate score [0, 1]
}ltd.suggest(input, maxResults?)
Searches the built-in ~2200-entry LEXICON for entries whose text or romanized form starts with the given prefix. Uses a lazy-built LexiconTrie (memoized on first call per instance).
import { LTD } from "@mera-vansh/ms-ltd";
import type { LexiconSuggestion } from "@mera-vansh/ms-ltd";
const ltd = new LTD();
// Devanagari prefix
const hits = ltd.suggest("नमस्ते");
// → [{ entry: { text: "नमस्ते", lang: "hi", romanized: "namaste", gloss: "hello", category: "salutation" }, matchedPrefix: "नमस्ते" }]
// IAST romanized prefix (diacritics supported)
const kinship = ltd.suggest("dādā", 3);
// → up to 3 results matching paternal-grandfather entries across languages
// Geography via IAST
const rivers = ltd.suggest("gaṃgā");
// → Ganga entries (Hindi, Sanskrit, etc.)
// Limit results
const few: LexiconSuggestion[] = ltd.suggest("न", 5);
// Returns empty array for no match (never null/undefined)
ltd.suggest("xyz_no_match"); // → []| Parameter | Type | Default | Description |
|---|---|---|---|
| input | string | — | Devanagari, IAST-romanized, or plain-ASCII prefix |
| maxResults | number | 10 | Maximum suggestions returned |
Returns LexiconSuggestion[]:
interface LexiconSuggestion {
entry: LexiconEntry; // the matched lexicon entry
matchedPrefix: string; // the lowercased prefix that matched
}ltd.feedback(id, signal)
Adjusts a document's retrieval weight based on user feedback.
ltd.feedback("q1", "positive"); // weight × 1.1, capped at 3.0
ltd.feedback("q1", "negative"); // weight × 0.9, floored at 0.1
ltd.feedback("q1", "neutral"); // no changeltd.export() / ltd.import(state)
Persist and restore the full engine state. The snapshot is fully JSON-serialisable.
// Save to MongoDB / Redis / disk
const snapshot = ltd.export();
await db.collection("brain").replaceOne({ _id: "v1" }, snapshot, { upsert: true });
// Restore in a new process
const saved = await db.collection("brain").findOne({ _id: "v1" });
const ltd2 = new LTD();
ltd2.import(saved);ltd.reset()
Clears all documents and resets the vectoriser.
ltd.reset();
ltd.storeSize(); // → 0ltd.storeSize() / ltd.hasDocument(id)
ltd.storeSize(); // number of indexed documents
ltd.hasDocument("q1"); // → true / falseEmotion Detection
detectEmotion(text) classifies text into one of six emotional registers in strict priority order:
| Priority | Emotion | Triggers |
|---|---|---|
| 1 | REVERENCE | namaste, pranam, jai, श्री, ওম, ੴ, ನಮಸ್ಕಾರ, हरे कृष्ण, राम राम, ॐ नमः शिवाय, and 20+ cross-script equivalents |
| 2 | JOY | good, great, धन्यवाद, நன்றி, ধন্যবাদ, 😊🎉 |
| 3 | GRIEF | died, मृत्यु, மரணம், మరణం, مرحوم, 😢💔 |
| 4 | ANGER | wrong, error, गलत, தவறு, ভুল, غلط, 😠😡 |
| 5 | CONFUSION | confused, समझ नहीं, புரியவில்லை, అర్థం కాలేదు, ?? |
| 6 | NEUTRAL | (fallback — no rule matched) |
import { detectEmotion } from "@mera-vansh/ms-ltd";
detectEmotion("नमस्ते, आपका स्वागत है"); // → "REVERENCE"
detectEmotion("हरे कृष्ण"); // → "REVERENCE"
detectEmotion("ॐ नमः शिवाय"); // → "REVERENCE"
detectEmotion("bahut acha kiya!"); // → "JOY"
detectEmotion("wrong answer!!"); // → "ANGER"
detectEmotion("what do you mean??"); // → "CONFUSION"
detectEmotion("my gotra is Bharadwaj"); // → "NEUTRAL"Hindu devotional invocations (added in Sprint 2)
The following high-frequency devotional phrases are recognised as REVERENCE keywords:
| Phrase | Transliteration |
|--------|----------------|
| हरे कृष्ण | Hare Krishna |
| राम राम | Ram Ram |
| जय श्री राम | Jai Shri Ram |
| हर हर महादेव | Har Har Mahadev |
| जय माता दी | Jai Mata Di |
| ॐ नमः शिवाय | Om Namah Shivay |
| जय जय | Jai Jai |
| ॐ | Om |
These take REVERENCE priority, so co-occurring joy or anger keywords are overridden.
Using EMOTION_RULES directly
import { EMOTION_RULES } from "@mera-vansh/ms-ltd";
// Inspect or extend the rule set
EMOTION_RULES.forEach(rule => {
console.log(rule.emotion, "—", rule.keywords.length, "keywords");
});Tone Detection
detectTone(text) classifies formality/register in strict priority order:
| Priority | Tone | Triggers |
|---|---|---|
| 1 | REVERENTIAL | param pujya, guruji, swami, माता श्री, ஐயா, అయ్యా |
| 2 | FORMAL | Dr., Mr., aap, आप, shriman, full proper names |
| 3 | URGENT | abhi, jaldi, immediately, asap, right now, !! |
| 4 | CURIOUS | ?, why, how, what, kyun, kaise, batao |
| 5 | INFORMAL | tu, tum, yaar, dost, bhai, lol, haha |
| 6 | NEUTRAL | (fallback — no rule matched) |
import { detectTone } from "@mera-vansh/ms-ltd";
detectTone("Param Pujya Guruji ka ashirwad"); // → "REVERENTIAL"
detectTone("Dr. Sharma please help"); // → "FORMAL"
detectTone("abhi batao, urgent!"); // → "URGENT"
detectTone("gotra kya hota hai?"); // → "CURIOUS"
detectTone("yaar bata na"); // → "INFORMAL"
detectTone("my gotra is Bharadwaj"); // → "NEUTRAL"Priority in action
When multiple rules fire, the highest-priority rule wins:
detectTone("aap ko pujya mata shri pranam");
// "aap" → FORMAL, "pujya mata shri" → REVERENTIAL
// → "REVERENTIAL" (priority 1 beats priority 2)
detectTone("Dr. Sharma abhi aao");
// "Dr." → FORMAL, "abhi" → URGENT
// → "FORMAL" (priority 2 beats priority 3)Script & Language Detection
ScriptDetector.detectScript(text)
Identifies the dominant Unicode writing system:
import { ScriptDetector } from "@mera-vansh/ms-ltd";
ScriptDetector.detectScript("नमस्ते"); // → "Devanagari"
ScriptDetector.detectScript("hello world"); // → "Latin"
ScriptDetector.detectScript("வணக்கம்"); // → "Tamil"
ScriptDetector.detectScript("hello नमस्ते"); // → "Mixed"
ScriptDetector.detectScript("123 !@#"); // → "Unknown"Returns one of: Devanagari | Tamil | Telugu | Bengali | Gujarati | Odia | Malayalam | Kannada | Gurmukhi | Arabic | Latin | Mixed | Unknown
ScriptDetector.detectLanguage(text)
Narrows the script to a specific language using whole-word seed-keyword disambiguation. Characters are compared as complete tokens (split on whitespace and punctuation) — never as substrings — so Devanagari words shared between Hindi, Nepali, Marathi, and Sanskrit are correctly disambiguated.
ScriptDetector.detectLanguage("मेरा गोत्र भारद्वाज है"); // → "hi"
ScriptDetector.detectLanguage("माझ्या घरी आहे"); // → "mr"
ScriptDetector.detectLanguage("मेरो नाम के छ"); // → "ne"
ScriptDetector.detectLanguage("भवति करोति"); // → "sa" (Sanskrit)
ScriptDetector.detectLanguage("মোৰ গোত্ৰ কি"); // → "as" (Assamese, not Bengali)
ScriptDetector.detectLanguage("hello मेरा"); // → null (Mixed script)ScriptDetector.detectMixedScripts(text, threshold?)
Returns all scripts present above a share threshold (default 10%):
ScriptDetector.detectMixedScripts("hello नमस्ते world गोत्र");
// → ["Devanagari", "Latin"]
ScriptDetector.detectMixedScripts("hello नमस्ते world गोत्र test", 0.5);
// → ["Devanagari"] (Latin < 50% with high threshold)LEXICON
The built-in LEXICON contains ~2200 curated entries across 7 semantic categories covering 18 languages, with IAST romanizations and English glosses.
import { LEXICON } from "@mera-vansh/ms-ltd";
import type { LexiconCategory, LexiconEntry } from "@mera-vansh/ms-ltd";Categories
| Category | Description | Example entries |
|---|---|---|
| salutation | Greetings and farewells | नमस्ते, வணக்கம், నమస్కారం |
| kinship | Family relations (24 roles × 18 langs) | पिता, माता, दादा, नानी |
| emotion_rasa | The 9 Sanskrit rasas | SHRINGAR, KARUNA, VEERA, SHANTA… |
| geography | Sacred rivers, pilgrimage sites | गंगा (gaṃgā), काशी (kāśī) |
| literature | Epics and canonical texts | रामायण (rāmāyaṇa), महाभारत |
| time | Vikrama Samvat months, days, tithis | चैत्र (caitra), सोमवार |
| number | Numerals in Devanagari, Tamil, etc. | ०१२, ௦௧௨ |
Accessing entries
// All salutations
const salutations = LEXICON.salutation;
// → [{ text: "नमस्ते", lang: "hi", romanized: "namaste", gloss: "hello", category: "salutation" }, ...]
// Spot-check: Hindi father
LEXICON.kinship.find(e => e.subcategory === "father" && e.lang === "hi");
// → { text: "पिता", lang: "hi", romanized: "pitā", gloss: "father", category: "kinship", subcategory: "father" }
// All 9 rasas
const rasas = LEXICON.emotion_rasa
.filter(e => e.lang === "sa")
.map(e => e.subcategory);
// → ["SHRINGAR", "HASYA", "KARUNA", "RAUDRA", "VEERA", "BHAYANAK", "BIBHATSA", "ADBHUTA", "SHANTA"]LexiconEntry shape
interface LexiconEntry {
text: string; // native script form
lang: LangCode; // BCP-47 language code
romanized: string; // IAST transliteration
gloss: string; // English definition
category: LexiconCategory; // one of the 7 categories above
subcategory?: string; // e.g. "father", "SHRINGAR", "samvat_month"
}LexiconTrie
LexiconTrie is a Unicode-safe BFS prefix trie for fast autocomplete over any set of LexiconEntry objects.
import { LexiconTrie, LEXICON } from "@mera-vansh/ms-ltd";
import type { LexiconEntry } from "@mera-vansh/ms-ltd";
const trie = new LexiconTrie();
// Build from full LEXICON (index by both native text and IAST romanized)
for (const cat of Object.values(LEXICON)) {
for (const entry of cat) {
trie.insert(entry.text, entry);
trie.insert(entry.romanized, entry);
}
}
// Prefix search
const results = trie.suggest("नम", 5);
// → up to 5 LexiconSuggestion objects for entries starting with "नम"
// Trie size
console.log(trie.size); // number of entries indexed
// Case-insensitive prefix (suggest() lowercases the prefix)
trie.suggest("NAMASTE"); // same as trie.suggest("namaste")Empty / invalid input
trie.insert("", entry); // silently ignored — empty keys are rejected
trie.suggest(""); // → []
trie.suggest("xyz_none"); // → []BFS ordering
Results are returned in breadth-first order (shallower / shorter matches first), so exact matches bubble to the top of the result list.
Grammar Tools
Transliterator
Bidirectional transliteration between IAST romanization and 9 Indic scripts.
import { Transliterator } from "@mera-vansh/ms-ltd";Transliterator.iastToScript(iast, scheme)
Converts an IAST string to the target Indic script. Each phoneme maps to its standalone glyph (vowels output independent vowel characters, not matra vowel signs).
Transliterator.iastToScript("k", "Devanagari"); // → "क"
Transliterator.iastToScript("kh", "Devanagari"); // → "ख"
Transliterator.iastToScript("ā", "Devanagari"); // → "आ" (standalone vowel)
Transliterator.iastToScript("rāma", "Devanagari"); // → "रआमअ" (standalone vowels, not matras)
Transliterator.iastToScript("k", "Bengali"); // → "ক"
Transliterator.iastToScript("k", "Tamil"); // → "க" (Tamil is lossy — voiced/aspirated → same glyph)
Transliterator.iastToScript("kh", "Tamil"); // → "க" (same as k)
Transliterator.iastToScript("k", "Telugu"); // → "క"
Transliterator.iastToScript("k", "Gujarati"); // → "ક"
Transliterator.iastToScript("k", "Gurmukhi"); // → "ਕ"
Transliterator.iastToScript("k", "Odia"); // → "କ"
Transliterator.iastToScript("k", "Kannada"); // → "ಕ"
Transliterator.iastToScript("k", "Malayalam"); // → "ക"
// Non-IAST characters pass through unchanged
Transliterator.iastToScript("rāma 123", "Devanagari"); // → "रआमअ 123"
Transliterator.iastToScript("rāma!", "Devanagari"); // → "रआमअ!"Supported schemes: Devanagari | Bengali | Tamil | Telugu | Gujarati | Gurmukhi | Odia | Kannada | Malayalam
Note: The output uses standalone vowel characters (e.g.,
अ,आ), not Devanagari matra vowel signs (U+093E–U+094C). This is intentional for phoneme-level mapping. For fully-formed syllabic Devanagari (consonant + matra), a syllabification pass is needed.
Transliterator.devanagariToIast(text)
Maps standalone Devanagari characters to IAST. Characters not in the mapping (matras, conjuncts) pass through unchanged.
Transliterator.devanagariToIast("क"); // → "k"
Transliterator.devanagariToIast("आ"); // → "ā"
Transliterator.devanagariToIast("म"); // → "m"
Transliterator.devanagariToIast("ा"); // → "ा" (U+093E matra — passthrough, not IAST ā)Transliterator.isIAST(text)
Returns true if the string contains any IAST diacritic character.
Transliterator.isIAST("rāma"); // → true (ā)
Transliterator.isIAST("saṃskṛta"); // → true (ṃ, ṛ)
Transliterator.isIAST("rama"); // → false (no diacritics)
Transliterator.isIAST("नमस्ते"); // → false (Devanagari, no IAST diacritics)VibhaktiEngine
Sanskrit nominal inflection across all 8 vibhaktis (cases) and 3 vacanas (numbers) for 6 stem classes.
import { VibhaktiEngine } from "@mera-vansh/ms-ltd";
import type { Vibhakti, Vacana, StemClass, InflectedForm } from "@mera-vansh/ms-ltd";Stem classes
| Class | Characteristic | Example | Note |
|---|---|---|---|
| a_m | -a masculine | rām → rāmaḥ | Pass stem WITHOUT thematic -a |
| aa_f | -ā feminine | sīt → sītā | Pass stem WITHOUT final -ā |
| i_m | -i masculine | kav → kaviḥ | Pass stem WITHOUT final -i |
| ii_f | -ī feminine | nad → nadī | Pass stem WITHOUT final -ī |
| u_m | -u masculine | bandh → bandhuḥ | Pass stem WITHOUT final -u |
| cons | consonant-final | rāj → rāj | Nominative sg has empty suffix |
VibhaktiEngine.inflect(stem, stemClass, vibhakti, vacana)
VibhaktiEngine.inflect("rām", "a_m", 1, "sg");
// → { form: "rāmaḥ", vibhakti: 1, vacana: "sg", linga: "m", kāraka: "kartā" }
VibhaktiEngine.inflect("sīt", "aa_f", 3, "sg");
// → { form: "sītayā", vibhakti: 3, vacana: "sg", linga: "f", kāraka: "karaṇa" }
VibhaktiEngine.inflect("kav", "i_m", 4, "sg");
// → { form: "kavaye", vibhakti: 4, vacana: "sg", linga: "m", kāraka: "sampradāna" }Vibhakti numbers 1–8 correspond to: Nominative (kartā), Accusative (karma), Instrumental (karaṇa), Dative (sampradāna), Ablative (apādāna), Genitive (sambandha), Locative (adhikaraṇa), Vocative (sambodhan).
VibhaktiEngine.paradigm(stem, stemClass)
Returns all 24 forms (8 vibhaktis × 3 vacanas):
const paradigm = VibhaktiEngine.paradigm("rām", "a_m");
paradigm.length; // → 24
paradigm[0]!.form; // → "rāmaḥ" (Nom sg)
paradigm[0]!.kāraka; // → "kartā"GenderAgreement
Adjective agreement and honorific pronoun detection for Hindi, Marathi, and Sanskrit.
import { GenderAgreement } from "@mera-vansh/ms-ltd";GenderAgreement.agreeAdjective(adjStem, linga, vacana, lang)
Inflects an adjective to agree with its noun in gender, number, and language.
// Hindi / Marathi (lang: "hi" | "mr")
GenderAgreement.agreeAdjective("acchā", "m", "sg", "hi"); // → "acchā"
GenderAgreement.agreeAdjective("acchā", "f", "sg", "hi"); // → "acchī"
GenderAgreement.agreeAdjective("acchā", "m", "pl", "hi"); // → "acche"
// Sanskrit (lang: "sa")
GenderAgreement.agreeAdjective("sundarā", "m", "sg", "sa"); // → "sundaraḥ"
GenderAgreement.agreeAdjective("sundarā", "f", "sg", "sa"); // → "sundarā"
GenderAgreement.agreeAdjective("sundarā", "n", "sg", "sa"); // → "sundaram"
// Unsupported language — returns stem unchanged
GenderAgreement.agreeAdjective("acchā", "m", "sg", "ta"); // → "acchā"GenderAgreement.isHonorificPronoun(word, lang)
Returns true if the word is the honorific 2nd-person pronoun in the given language.
GenderAgreement.isHonorificPronoun("आप", "hi"); // → true (Hindi)
GenderAgreement.isHonorificPronoun("aap", "hi"); // → true (romanized)
GenderAgreement.isHonorificPronoun("तुम", "hi"); // → false
GenderAgreement.isHonorificPronoun("तपाईं", "ne"); // → true (Nepali)
GenderAgreement.isHonorificPronoun("आपण", "mr"); // → true (Marathi)
GenderAgreement.isHonorificPronoun("भवान्", "sa"); // → true (Sanskrit)
GenderAgreement.isHonorificPronoun("आप", "ta"); // → false (unsupported lang)SovReorder
Heuristic English SVO → SOV word-order reordering, useful for generating Hindi-style training prompts from English sentences.
import { SovReorder } from "@mera-vansh/ms-ltd";SovReorder.reorder(text)
Moves the auxiliary verb and any trailing negations to the end of the sentence (SOV order). Operates only on ASCII-Latin input; non-Latin scripts and mixed-script text are returned unchanged.
SovReorder.reorder("Ram is studying"); // → "Ram studying is"
SovReorder.reorder("She is eating rice"); // → "She eating rice is"
SovReorder.reorder("I have finished work"); // → "I finished work have"
SovReorder.reorder("They will go home"); // → "They go home will"
SovReorder.reorder("She is not eating"); // → "She eating is not"
SovReorder.reorder("They did not finish"); // → "They finish did not"
// Short sentences / no auxiliary → unchanged
SovReorder.reorder("Ram"); // → "Ram"
SovReorder.reorder("I am happy"); // → "I am happy" ("am" is not in AUXILIARIES)
// Non-Latin → unchanged
SovReorder.reorder("मेरा नाम राम है"); // → "मेरा नाम राम है"Supported auxiliaries: is, are, was, were, be, been, being, have, has, had, do, does, did, will, would, shall, should, may, might, can, could, must, need, dare.
Note:
"am"is intentionally excluded from the auxiliaries set — it is used as a prefix of other common words and is excluded to prevent false reorders.
Tokenizer
import { Tokenizer } from "@mera-vansh/ms-ltd";
// Normalise: NFKC + lowercase + strip zero-width chars
Tokenizer.normalize("NAMASTE\u200C"); // → "namaste"
// Tokenise: Unicode-aware word extraction (handles all scripts)
Tokenizer.tokenize("मेरा गोत्र भारद्वाज है");
// → ["मेरा", "गोत्र", "भारद्वाज", "है"]
Tokenizer.tokenize("மகிழ்ச்சி நன்றி");
// → ["மகிழ்ச்சி", "நன்றி"]
// Stem: lightweight suffix stripping
Tokenizer.stem("जाता"); // → "जा" (Hindi -ता rule)
Tokenizer.stem("running"); // → "runn" (English -ing rule)
Tokenizer.stem("trees"); // → "tree" (English -s rule)
// Full pipeline: tokenise + stem
Tokenizer.tokenizeAndStem("running trees called");
// → ["runn", "tree", "call"]Hindi stemming rules (longest-first): -ाना, -ता, -ते, -ती, -ना, -ने, -कर
English stemming rules (longest-first): -tion (len≥7), -ing (len≥6), -ed (len≥5), -s (len≥5)
TF-IDF Vectorizer
Use TFIDFVectorizer directly when you need vector representations outside of the full LTD pipeline.
import { TFIDFVectorizer, cosineSimilarity } from "@mera-vansh/ms-ltd";
const v = new TFIDFVectorizer();
// Fit on corpus
v.fit([
"gotra bharadwaj family",
"gotra kashyap family",
"mata pita relation"
]);
// Transform a query
const queryVec = v.transform("gotra bharadwaj");
// Transform a document
const docVec = v.transform("gotra kashyap family");
// Compute similarity
const sim = cosineSimilarity(queryVec, docVec);
console.log(sim); // 0.0 – 1.0
// IDF inspection
v.getIDF("gotra"); // lower IDF (appears in 2/3 docs)
v.getIDF("bharadwaj"); // higher IDF (appears in 1/3 docs)
v.getIDF("xyz"); // null (OOV)
// Serialise
const state = v.exportState();
// { vocab: [[term, idx], ...], idf: [[term, score], ...], docCount: 3 }
const v2 = new TFIDFVectorizer();
v2.importState(state); // restore without re-fittingMemoryStore
MemoryStore is the retrieval layer used internally by LTD. Use it directly for custom retrieval pipelines.
import { MemoryStore } from "@mera-vansh/ms-ltd";
const store = new MemoryStore();
store.ingest([
{ id: "d1", text: "gotra bharadwaj", metadata: { type: "gotra" } },
{ id: "d2", text: "mata pita relation", metadata: { type: "family" } },
]);
// Retrieve
const results = store.retrieve("gotra family", 3);
// results[0].id === "d1", results[0].score > 0
// Add incrementally
store.add({ id: "d3", text: "gotra kashyap" });
// Feedback
store.adjustWeight("d1", "positive"); // weight → 1.1
store.adjustWeight("d2", "negative"); // weight → 0.9
// Inspect
store.size(); // → 3
store.has("d1"); // → true
// Remove
store.remove("d3"); // → true
// Persist
const snapshot = store.export();
const json = JSON.stringify(snapshot);
// Restore
const store2 = new MemoryStore();
store2.import(JSON.parse(json));Persistence & Serialisation
All state (vocabulary, document vectors, weights) is serialisable to plain JSON. This enables integration with any database.
MongoDB example
import { LTD } from "@mera-vansh/ms-ltd";
import { MongoClient } from "mongodb";
const client = new MongoClient(process.env.MONGODB_URI!);
const col = client.db("myapp").collection("ltd_brain");
// Save
const ltd = new LTD();
ltd.ingest(myDocs);
await col.replaceOne({ _id: "v1" }, { _id: "v1", ...ltd.export() }, { upsert: true });
// Load
const saved = await col.findOne({ _id: "v1" });
const ltd2 = new LTD();
if (saved) ltd2.import(saved);File system example
import { writeFileSync, readFileSync } from "fs";
// Save
writeFileSync("brain.json", JSON.stringify(ltd.export(), null, 2));
// Load
const ltd2 = new LTD();
ltd2.import(JSON.parse(readFileSync("brain.json", "utf8")));Recipes
Multilingual FAQ bot
import { LTD } from "@mera-vansh/ms-ltd";
const bot = new LTD({ defaultTopK: 3 });
bot.ingest([
{ id: "faq-gotra-en", text: "What is gotra? gotra is a clan lineage system", metadata: { answer: "Gotra is a patrilineal clan system in Hindu tradition." } },
{ id: "faq-gotra-hi", text: "गोत्र क्या है गोत्र वंश परंपरा", metadata: { answer: "गोत्र एक पितृवंशीय कुल परंपरा है।" } },
{ id: "faq-nakshatra", text: "nakshatra birth star lunar mansion", metadata: { answer: "Nakshatra is the lunar mansion at the time of birth." } },
]);
function ask(userInput: string) {
const res = bot.call(userInput);
if (res.confidence < 0.1) {
return "Sorry, I don't have information on that yet.";
}
const top = res.candidates[0]!;
// Positive feedback on use
bot.feedback(top.id, "positive");
return top.metadata["answer"] as string;
}
ask("gotra kya hota hai?"); // returns Hindi FAQ answer
ask("what is a nakshatra?"); // returns English FAQ answerEmotion-aware response routing
import { LTD } from "@mera-vansh/ms-ltd";
const ltd = new LTD();
ltd.ingest(myKnowledgeBase);
function handleMessage(userText: string) {
const { emotion, tone, candidates, confidence } = ltd.call(userText);
if (emotion === "GRIEF") {
return { type: "condolence", message: "I'm so sorry for your loss." };
}
if (emotion === "CONFUSION") {
return { type: "clarify", message: "Let me explain that more clearly." };
}
if (tone === "URGENT") {
return { type: "priority", answer: candidates[0] };
}
return { type: "standard", answer: candidates[0], confidence };
}Lexicon-powered autocomplete
import { LTD } from "@mera-vansh/ms-ltd";
const ltd = new LTD();
// Suggest as user types (Devanagari)
ltd.suggest("नमस्", 5).map(r => r.entry.text);
// → ["नमस्ते", "नमस्कार", ...]
// Suggest from IAST romanization
ltd.suggest("rāmā", 3).map(r => ({
text: r.entry.text,
gloss: r.entry.gloss,
lang: r.entry.lang,
}));
// → [{ text: "रामायण", gloss: "the Ramayana", lang: "hi" }, ...]
// Suggest kinship terms
ltd.suggest("dādā").map(r => r.entry.gloss);
// → ["paternal grandfather", ...]Sanskrit inflection pipeline
import { VibhaktiEngine, GenderAgreement, Transliterator } from "@mera-vansh/ms-ltd";
// Inflect a noun
const nom = VibhaktiEngine.inflect("rām", "a_m", 1, "sg");
console.log(nom.form); // → "rāmaḥ"
console.log(nom.kāraka); // → "kartā"
// Generate a full paradigm
const forms = VibhaktiEngine.paradigm("sīt", "aa_f");
forms.forEach(f => console.log(`${f.vibhakti}/${f.vacana}: ${f.form}`));
// Agree an adjective
GenderAgreement.agreeAdjective("sundarā", "f", "sg", "sa"); // → "sundarā"
// Transliterate to Devanagari
Transliterator.iastToScript("rāmaḥ", "Devanagari"); // → "रआमअः"Script-aware input routing
import { ScriptDetector, detectTone } from "@mera-vansh/ms-ltd";
function classifyInput(text: string) {
const script = ScriptDetector.detectScript(text);
const lang = ScriptDetector.detectLanguage(text);
const tone = detectTone(text);
return { script, lang, tone };
}
classifyInput("aap ka gotra kya hai?");
// → { script: "Latin", lang: "en", tone: "FORMAL" }
classifyInput("आप का गोत्र क्या है?");
// → { script: "Devanagari", lang: "hi", tone: "FORMAL" }Building a domain classifier
import { LTD } from "@mera-vansh/ms-ltd";
const classifier = new LTD();
// Seed each domain with representative phrases
classifier.ingest([
{ id: "astro-1", text: "nakshatra rashi horoscope kundali", metadata: { domain: "astrology" } },
{ id: "astro-2", text: "नक्षत्र राशि कुंडली ज्योतिष", metadata: { domain: "astrology" } },
{ id: "gotra-1", text: "gotra pravara rishi lineage clan", metadata: { domain: "gotra" } },
{ id: "gotra-2", text: "गोत्र प्रवर ऋषि वंश कुल", metadata: { domain: "gotra" } },
{ id: "ritl-1", text: "vivah puja samskara ritual ceremony", metadata: { domain: "ritual" } },
]);
function classify(userInput: string) {
const { candidates, confidence } = classifier.call(userInput);
if (confidence < 0.15) return "unknown";
return candidates[0]?.metadata["domain"] ?? "unknown";
}
classify("my nakshatra is Rohini"); // → "astrology"
classify("bharadwaj gotra mein vivah"); // → "gotra" or "ritual"
classify("random unrelated words xyz"); // → "unknown"TypeScript Types
import type {
// Core types
LTDOptions,
LTDResponse,
LTDCandidate,
LTDState,
// Document types
IngestDocument,
VectorEntry,
FeedbackSignal, // "positive" | "negative" | "neutral"
// Classification types
Emotion, // "REVERENCE" | "JOY" | "GRIEF" | "ANGER" | "CONFUSION" | "NEUTRAL"
Tone, // "REVERENTIAL" | "FORMAL" | "URGENT" | "CURIOUS" | "INFORMAL" | "NEUTRAL"
Script, // "Devanagari" | "Tamil" | ... | "Mixed" | "Unknown"
LangCode, // "en" | "hi" | "mr" | "bn" | ... (18 codes)
// Rule types (for custom extensions)
EmotionRule,
ToneRule,
// Lexicon types
LexiconEntry, // { text, lang, romanized, gloss, category, subcategory? }
LexiconCategory, // "salutation" | "kinship" | "emotion_rasa" | "geography" | "literature" | "time" | "number"
LexiconSuggestion, // { entry: LexiconEntry, matchedPrefix: string }
// Grammar types
Vibhakti, // 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 (Sanskrit cases)
Vacana, // "sg" | "du" | "pl"
Linga, // "m" | "f" | "n"
StemClass, // "a_m" | "aa_f" | "i_m" | "ii_f" | "u_m" | "cons"
InflectedForm, // { form, vibhakti, vacana, linga, kāraka }
TranslitScheme, // "Devanagari" | "Bengali" | "Tamil" | "Telugu" | ...
// Vector type
SparseVector, // Map<string, number>
} from "@mera-vansh/ms-ltd";Design Principles
| Principle | Detail |
|---|---|
| Zero dependencies | No runtime npm packages; only @types/node as devDependency |
| Deterministic | Every output is a direct function of rules; no sampling, no randomness |
| Multilingual | Unicode-first; works correctly with all Indic vowel signs (Mc/Mn marks) |
| Sparse representation | Map<string, number> — only non-zero terms stored |
| Feedback-weighted | Retrieval scores multiplied by per-document weights [0.1 – 3.0] |
| JSON-safe state | All Map instances serialised as [key, value][] arrays |
| Stateless utilities | Tokenizer, ScriptDetector, Transliterator, VibhaktiEngine, GenderAgreement, SovReorder are fully static — no instantiation needed |
IDF Formula
Uses the scikit-learn smooth IDF variant to avoid zero-IDF for universal terms:
IDF(t) = log((N + 1) / (df(t) + 1)) + 1Where N = corpus size, df(t) = number of documents containing term t. The +1 offset ensures IDF ≥ 1.0 for all terms.
Known Limitations
- Stemming is suffix-stripping only (not full morphological analysis). Hindi and English are supported; other languages are tokenised but not stemmed.
- Language detection is heuristic (seed-word based). Short inputs or texts without seed words may fall back to the script's default language.
- Regex word boundaries (
\b) are ASCII-only in standard JavaScript. Tone/emotion patterns targeting Devanagari-only text use keyword substring matching, not boundary-anchored regex. - Mixed-script inputs (
script === "Mixed") returnlang === null. iastToScriptoutputs standalone vowel characters, not matra vowel signs. For properly-joined syllabic Devanagari, a syllabification pass over the output is required.devanagariToIasthandles standalone Devanagari characters only; matra vowel signs (U+093E–U+094C) pass through unchanged.
Development
This package is part of the mera-vansh monorepo.
# Build
pnpm build
# Type-check
pnpm type-check
# Lint
pnpm lint
# Test
pnpm test
# Test with coverage
pnpm test:coverage
# Dry-run publish check (verify no src/ leaked)
pnpm pack --dry-runTesting
# Run full Vitest suite (~1179 test cases)
pnpm test
# Watch mode
pnpm test:watch
# Coverage report
pnpm test:coverage
# Legacy integration tests (112 assertions)
pnpm test:legacyTest files live in test/ — one file per module:
test/
├── math.similarity.test.ts cosineSimilarity
├── math.vectorizer.test.ts TFIDFVectorizer
├── nlp.tokenizer.test.ts Tokenizer
├── nlp.detector.test.ts ScriptDetector (+ Sprint 1 whole-word tests)
├── nlp.trie.test.ts LexiconTrie
├── grammar.transliterator.test.ts Transliterator
├── grammar.vibhakti.test.ts VibhaktiEngine
├── grammar.agreement.test.ts GenderAgreement
├── grammar.sovreorder.test.ts SovReorder
├── rules.emotion.test.ts detectEmotion + EMOTION_RULES (+ Sprint 2 devotional keywords)
├── rules.tone.test.ts detectTone + TONE_RULES
├── rules.lexicon.test.ts LEXICON
├── storage.memorystore.test.ts MemoryStore
├── engine.ltd.test.ts LTD (end-to-end)
├── engine.suggest.test.ts LTD.suggest()
└── test-ltd.ts Legacy integration runnerContributing
See CONTRIBUTING.md at the monorepo root.
- Fork and clone
pnpm installfrom the monorepo rootpnpm --filter @mera-vansh/ms-ltd testto run the test suitepnpm --filter @mera-vansh/ms-ltd type-checkto type-check
All rules (emotion-rules.ts, tone-rules.ts) and seed keywords (detector.ts) are plain TypeScript arrays — easy to extend without touching core logic.
Changelog
2.0.0 (2026-03-24)
- Deterministic NLP pipeline replacing probabilistic Naive Bayes
- TF-IDF vectoriser with sklearn smooth IDF
- Multilingual tokeniser with
\p{L}\p{N}\p{M}Unicode regex - Script detection across 10 Unicode blocks
Sprint 1 — Language detection hardening
- Whole-word tokenization in
ScriptDetector.detectLanguage()— seeds are now matched as complete tokens (no false substring matches) - Devanagari disambiguation: Hindi / Marathi / Nepali / Sanskrit / Maithili / Konkani correctly separated
- Bengali / Assamese and Urdu / Sindhi disambiguation
- Removed
"के"from Hindi seeds (shared with Nepali interrogative)
Sprint 2 — Devotional REVERENCE keywords
- Added 8 Hindu devotional invocations to
REVERENCEemotion: हरे कृष्ण, राम राम, जय श्री राम, हर हर महादेव, जय माता दी, ॐ नमः शिवाय, जय जय, ॐ
Sprint 6 — Sanskrit grammar tools
Transliterator— IAST ↔ Indic script mapping (9 scripts, phoneme-level)VibhaktiEngine— Sanskrit nominal inflection (8 cases × 3 numbers × 6 stem classes = 24-form paradigms)GenderAgreement— Hindi/Marathi/Sanskrit adjective agreement + honorific pronoun detectionSovReorder— English SVO → SOV word-order reordering
Sprint 7 — Lexicon + autocomplete
LEXICON— ~2200 curated entries across 7 categories (salutation, kinship, emotion_rasa, geography, literature, time, number) in 18 languages with IAST romanizationsLexiconTrie— Unicode BFS prefix trie for autocomplete over anyLexiconEntrysetLTD.suggest()— lazily-built trie search over the built-in LEXICON
Test suite
- 7 new test files covering all Sprint 6/7 additions
- 2 existing test files augmented (detector + emotion)
- Total: ~1179 Vitest test cases
License
GPL-3.0 © Mera Vansh — dwivna
See LICENSE for the full license text.
