nepali-nlp-pro-max

v1.0.0

Published

a day ago

📚 Nepali natural language processing — Devanagari normalizer, sentence and word tokenizer, curated stopwords, light stemmer, script detection, embedded number-word extraction.

nepali-nlp-pro-max

नेपाली एनएलपी — pro-max edition

Nepali natural language processing utilities. Devanagari normalizer, sentence and word tokenizer, curated stopwords (220+), light stemmer, script detection, and embedded number-word extraction.

✨ Highlights

🇳🇵 Nepali-first — every function is tuned for Devanagari + the realities of Nepali text (poorna virama, ZWNJ pollution, mixed-script content)
🧹 Production-grade normalizer — NFC + ZWNJ/ZWJ stripping + whitespace collapse, the three things every Nepali pipeline rewrites
✂️ Tokenizers — sentences (। / ॥ / ? / !), words, and Unicode-safe character iteration
🛑 Curated stopwords — 220+ pronouns, particles, postpositions, auxiliaries; extendable / overridable
🪓 Light stemmer — strips case markers (-ले, -को, -मा, -लाई, -बाट, …) and plural -हरू with safety guards
🔍 Script detection — isDevanagari, containsDevanagari, mixedScriptRatio for routing decisions
🔢 Number-word extraction — find २ लाख ५ हजार inside arbitrary text and convert to BigInt
📦 Zero deps · ESM + CJS · TypeScript-first · tree-shakeable

📦 Install

npm install nepali-nlp-pro-max
pnpm add nepali-nlp-pro-max
yarn add nepali-nlp-pro-max
bun add nepali-nlp-pro-max

⚡ Quick Start

import {
  normalize,
  tokenizeSentences,
  tokenizeWords,
  removeStopwords,
  stem,
  isDevanagari,
  detectScript,
  extractNumbers,
} from "nepali-nlp-pro-max";

const text = "म नेपालमा बस्छु। तपाईंलाई कस्तो छ?";

normalize(text);                                   // ZWNJ-stripped, NFC, whitespace-collapsed
tokenizeSentences(text);
// ["म नेपालमा बस्छु।", "तपाईंलाई कस्तो छ?"]

tokenizeWords("म नेपालमा बस्छु।");
// ["म", "नेपालमा", "बस्छु"]

removeStopwords(["म", "नेपालमा", "बस्छु"]);
// ["नेपालमा", "बस्छु"]

stem("नेपालमा");        // "नेपाल"
stem("किताबहरू");       // "किताब"
stem("मानिसहरूले");     // "मानिस"

isDevanagari("नेपाल");   // true
detectScript("Hi नमस्ते");// "mixed"

extractNumbers("मलाई २ लाख ५ हजार चाहिन्छ");
// [{ value: 205000n, raw: "२ लाख ५ हजार", start: 6, end: 18 }]

🧠 Mental Model

| Rule | Why | |---|---| | Always normalize first. ZWNJ/ZWJ + decomposed forms break exact-match search and indexing. | One normalize() call up-front avoids dozens of false misses. | | Sentence boundary = । / ॥ / ? / ! followed by whitespace or EOS. | Latin-style . is too noisy in mixed text. | | Stemmer is suffix-strip, not full morphology. | A real Nepali morphological analyser is research-grade. The light stemmer covers the 90% case (case markers + plural). | | Stopwords are a starting point, not gospel. | Pass extra / exclude per app — news search wants different filtering than chat moderation. | | Number extraction is greedy. | extractNumbers walks the text and captures the longest valid run starting at each position. |

🧰 Full API

| Function | Description | |---|---| | normalize(text, opts?) | NFC + strip ZWNJ/ZWJ + collapse whitespace | | stripZeroWidth(text) | Remove ZWNJ (U+200C) / ZWJ (U+200D) only | | toNFC(text) | Apply Unicode NFC only |

NormalizeOptions: { stripZeroWidth?, collapseWhitespace?, nfc? }

| Function | Description | |---|---| | isDevanagari(s) | Every non-whitespace, non-punct char is Devanagari | | containsDevanagari(s) | At least one Devanagari char | | containsLatin(s) | At least one Latin (A-Z / a-z) char | | detectScript(s) | "devanagari" \| "latin" \| "mixed" \| "none" | | mixedScriptRatio(s) | Devanagari fraction over (Devanagari + Latin) chars |

| Function | Description | |---|---| | tokenizeSentences(text) | Split on । / ॥ / ? / ! | | tokenizeWords(text) | Split on whitespace + Devanagari/Latin punct | | tokenizeCharacters(text) | Iterate Unicode code points |

| Function / Constant | Description | |---|---| | STOPWORDS | Bundled ReadonlySet<string> (220+ entries) | | isStopword(word, opts?) | Membership check with extend/exclude support | | removeStopwords(tokens, opts?) | Filter tokens against the active set |

StopwordOptions: { stopwords?, extra?, exclude? }

| Function / Constant | Description | |---|---| | CASE_MARKERS | Default suffix list, longest-first | | stem(word, opts?) | Strip one matching suffix | | stemAll(tokens, opts?) | Batch stem |

StemOptions: { suffixes?, minResidue? } — minResidue defaults to 2 (avoids over-stripping single-syllable roots).

| Function | Description | |---|---| | parseNumberWord(s) | Single number-word string → bigint \| null | | extractNumbers(text) | Find every embedded number run, return NumberMatch[] | | NUMBER_WORDS | The 0-99 + scale words map (Devanagari → BigInt) |

NumberMatch: { value: bigint, raw: string, start: number, end: number }

🎯 Recipes

Pre-process for Elasticsearch / OpenSearch indexing

import { normalize, tokenizeWords, removeStopwords, stemAll } from "nepali-nlp-pro-max";

function indexable(text: string): string[] {
  const cleaned = normalize(text);
  const words = tokenizeWords(cleaned);
  const content = removeStopwords(words);
  return stemAll(content);
}

indexable("म नेपालमा बस्छु। नेपाल राम्रो छ।");
// ["नेपाल", "बस्", "नेपाल", "राम्र"]

Route mixed-script content

import { detectScript, mixedScriptRatio } from "nepali-nlp-pro-max";

function pickPipeline(text: string): "ne" | "en" | "both" {
  const script = detectScript(text);
  if (script === "devanagari") return "ne";
  if (script === "latin") return "en";
  if (mixedScriptRatio(text) > 0.5) return "ne";
  return "both";
}

Extract amounts from news articles

import { extractNumbers } from "nepali-nlp-pro-max";

const article = "बजेटमा सरकारले शिक्षाका लागि १ खर्ब २५ अर्ब छुट्याएको छ।";
const matches = extractNumbers(article);
// [{ value: 125000000000n, raw: "१ खर्ब २५ अर्ब", ... }]

App-specific stopword tuning

import { removeStopwords } from "nepali-nlp-pro-max";

// News search wants "र" out (it's noisy) but keeps "तर" (signals contrast)
removeStopwords(tokens, { exclude: ["तर"] });

// Chat moderation: add domain stopwords
removeStopwords(tokens, { extra: ["हुन्छ", "ठिकै", "हजुर"] });

🤝 Contributing

PRs welcome. Common contributions:

More stopwords for specific domains (news, legal, technical)
Additional case-marker variants found in dialect / older text
Sentence-boundary edge cases (decimals, abbreviations in mixed text)
Bug fixes, type improvements

npm install
npm test
npm run typecheck
npm run build

📜 License

Made with ❤️ for the Nepali developer community.

बनाइएको नेपाली डेभलपर समुदायको लागि।

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

nepali-nlp-pro-max

नेपाली एनएलपी — pro-max edition

✨ Highlights

📦 Install

⚡ Quick Start

🧠 Mental Model

🧰 Full API

🎯 Recipes

Pre-process for Elasticsearch / OpenSearch indexing

Route mixed-script content

Extract amounts from news articles

App-specific stopword tuning

🤝 Contributing

📜 License