nepali-nlp-pro-max
v1.0.0
Published
ЁЯУЪ Nepali natural language processing тАФ Devanagari normalizer, sentence and word tokenizer, curated stopwords, light stemmer, script detection, embedded number-word extraction.
Maintainers
Readme
nepali-nlp-pro-max
рдиреЗрдкрд╛рд▓реА рдПрдирдПрд▓рдкреА тАФ pro-max edition
Nepali natural language processing utilities. Devanagari normalizer, sentence and word tokenizer, curated stopwords (220+), light stemmer, script detection, and embedded number-word extraction.
тЬи Highlights
- ЁЯЗ│ЁЯЗ╡ Nepali-first тАФ every function is tuned for Devanagari + the realities of Nepali text (poorna virama, ZWNJ pollution, mixed-script content)
- ЁЯз╣ Production-grade normalizer тАФ NFC + ZWNJ/ZWJ stripping + whitespace collapse, the three things every Nepali pipeline rewrites
- тЬВя╕П Tokenizers тАФ sentences (
ред/рее/?/!), words, and Unicode-safe character iteration - ЁЯЫС Curated stopwords тАФ 220+ pronouns, particles, postpositions, auxiliaries; extendable / overridable
- ЁЯкУ Light stemmer тАФ strips case markers (
-рд▓реЗ,-рдХреЛ,-рдорд╛,-рд▓рд╛рдИ,-рдмрд╛рдЯ, тАж) and plural-рд╣рд░реВwith safety guards - ЁЯФН Script detection тАФ
isDevanagari,containsDevanagari,mixedScriptRatiofor routing decisions - ЁЯФв Number-word extraction тАФ find
реи рд▓рд╛рдЦ рел рд╣рдЬрд╛рд░inside arbitrary text and convert to BigInt - ЁЯУж Zero deps ┬╖ ESM + CJS ┬╖ TypeScript-first ┬╖ tree-shakeable
ЁЯУж Install
npm install nepali-nlp-pro-max
pnpm add nepali-nlp-pro-max
yarn add nepali-nlp-pro-max
bun add nepali-nlp-pro-maxтЪб Quick Start
import {
normalize,
tokenizeSentences,
tokenizeWords,
removeStopwords,
stem,
isDevanagari,
detectScript,
extractNumbers,
} from "nepali-nlp-pro-max";
const text = "рдо рдиреЗрдкрд╛рд▓рдорд╛ рдмрд╕реНрдЫреБред рддрдкрд╛рдИрдВрд▓рд╛рдИ рдХрд╕реНрддреЛ рдЫ?";
normalize(text); // ZWNJ-stripped, NFC, whitespace-collapsed
tokenizeSentences(text);
// ["рдо рдиреЗрдкрд╛рд▓рдорд╛ рдмрд╕реНрдЫреБред", "рддрдкрд╛рдИрдВрд▓рд╛рдИ рдХрд╕реНрддреЛ рдЫ?"]
tokenizeWords("рдо рдиреЗрдкрд╛рд▓рдорд╛ рдмрд╕реНрдЫреБред");
// ["рдо", "рдиреЗрдкрд╛рд▓рдорд╛", "рдмрд╕реНрдЫреБ"]
removeStopwords(["рдо", "рдиреЗрдкрд╛рд▓рдорд╛", "рдмрд╕реНрдЫреБ"]);
// ["рдиреЗрдкрд╛рд▓рдорд╛", "рдмрд╕реНрдЫреБ"]
stem("рдиреЗрдкрд╛рд▓рдорд╛"); // "рдиреЗрдкрд╛рд▓"
stem("рдХрд┐рддрд╛рдмрд╣рд░реВ"); // "рдХрд┐рддрд╛рдм"
stem("рдорд╛рдирд┐рд╕рд╣рд░реВрд▓реЗ"); // "рдорд╛рдирд┐рд╕"
isDevanagari("рдиреЗрдкрд╛рд▓"); // true
detectScript("Hi рдирдорд╕реНрддреЗ");// "mixed"
extractNumbers("рдорд▓рд╛рдИ реи рд▓рд╛рдЦ рел рд╣рдЬрд╛рд░ рдЪрд╛рд╣рд┐рдиреНрдЫ");
// [{ value: 205000n, raw: "реи рд▓рд╛рдЦ рел рд╣рдЬрд╛рд░", start: 6, end: 18 }]ЁЯза Mental Model
| Rule | Why |
|---|---|
| Always normalize first. ZWNJ/ZWJ + decomposed forms break exact-match search and indexing. | One normalize() call up-front avoids dozens of false misses. |
| Sentence boundary = ред / рее / ? / ! followed by whitespace or EOS. | Latin-style . is too noisy in mixed text. |
| Stemmer is suffix-strip, not full morphology. | A real Nepali morphological analyser is research-grade. The light stemmer covers the 90% case (case markers + plural). |
| Stopwords are a starting point, not gospel. | Pass extra / exclude per app тАФ news search wants different filtering than chat moderation. |
| Number extraction is greedy. | extractNumbers walks the text and captures the longest valid run starting at each position. |
ЁЯз░ Full API
| Function | Description |
|---|---|
| normalize(text, opts?) | NFC + strip ZWNJ/ZWJ + collapse whitespace |
| stripZeroWidth(text) | Remove ZWNJ (U+200C) / ZWJ (U+200D) only |
| toNFC(text) | Apply Unicode NFC only |
NormalizeOptions: { stripZeroWidth?, collapseWhitespace?, nfc? }
| Function | Description |
|---|---|
| isDevanagari(s) | Every non-whitespace, non-punct char is Devanagari |
| containsDevanagari(s) | At least one Devanagari char |
| containsLatin(s) | At least one Latin (A-Z / a-z) char |
| detectScript(s) | "devanagari" \| "latin" \| "mixed" \| "none" |
| mixedScriptRatio(s) | Devanagari fraction over (Devanagari + Latin) chars |
| Function | Description |
|---|---|
| tokenizeSentences(text) | Split on ред / рее / ? / ! |
| tokenizeWords(text) | Split on whitespace + Devanagari/Latin punct |
| tokenizeCharacters(text) | Iterate Unicode code points |
| Function / Constant | Description |
|---|---|
| STOPWORDS | Bundled ReadonlySet<string> (220+ entries) |
| isStopword(word, opts?) | Membership check with extend/exclude support |
| removeStopwords(tokens, opts?) | Filter tokens against the active set |
StopwordOptions: { stopwords?, extra?, exclude? }
| Function / Constant | Description |
|---|---|
| CASE_MARKERS | Default suffix list, longest-first |
| stem(word, opts?) | Strip one matching suffix |
| stemAll(tokens, opts?) | Batch stem |
StemOptions: { suffixes?, minResidue? } тАФ minResidue defaults to 2 (avoids over-stripping single-syllable roots).
| Function | Description |
|---|---|
| parseNumberWord(s) | Single number-word string тЖТ bigint \| null |
| extractNumbers(text) | Find every embedded number run, return NumberMatch[] |
| NUMBER_WORDS | The 0-99 + scale words map (Devanagari тЖТ BigInt) |
NumberMatch: { value: bigint, raw: string, start: number, end: number }
ЁЯОп Recipes
Pre-process for Elasticsearch / OpenSearch indexing
import { normalize, tokenizeWords, removeStopwords, stemAll } from "nepali-nlp-pro-max";
function indexable(text: string): string[] {
const cleaned = normalize(text);
const words = tokenizeWords(cleaned);
const content = removeStopwords(words);
return stemAll(content);
}
indexable("рдо рдиреЗрдкрд╛рд▓рдорд╛ рдмрд╕реНрдЫреБред рдиреЗрдкрд╛рд▓ рд░рд╛рдореНрд░реЛ рдЫред");
// ["рдиреЗрдкрд╛рд▓", "рдмрд╕реН", "рдиреЗрдкрд╛рд▓", "рд░рд╛рдореНрд░"]Route mixed-script content
import { detectScript, mixedScriptRatio } from "nepali-nlp-pro-max";
function pickPipeline(text: string): "ne" | "en" | "both" {
const script = detectScript(text);
if (script === "devanagari") return "ne";
if (script === "latin") return "en";
if (mixedScriptRatio(text) > 0.5) return "ne";
return "both";
}Extract amounts from news articles
import { extractNumbers } from "nepali-nlp-pro-max";
const article = "рдмрдЬреЗрдЯрдорд╛ рд╕рд░рдХрд╛рд░рд▓реЗ рд╢рд┐рдХреНрд╖рд╛рдХрд╛ рд▓рд╛рдЧрд┐ рез рдЦрд░реНрдм реирел рдЕрд░реНрдм рдЫреБрдЯреНрдпрд╛рдПрдХреЛ рдЫред";
const matches = extractNumbers(article);
// [{ value: 125000000000n, raw: "рез рдЦрд░реНрдм реирел рдЕрд░реНрдм", ... }]App-specific stopword tuning
import { removeStopwords } from "nepali-nlp-pro-max";
// News search wants "рд░" out (it's noisy) but keeps "рддрд░" (signals contrast)
removeStopwords(tokens, { exclude: ["рддрд░"] });
// Chat moderation: add domain stopwords
removeStopwords(tokens, { extra: ["рд╣реБрдиреНрдЫ", "рдард┐рдХреИ", "рд╣рдЬреБрд░"] });ЁЯдЭ Contributing
PRs welcome. Common contributions:
- More stopwords for specific domains (news, legal, technical)
- Additional case-marker variants found in dialect / older text
- Sentence-boundary edge cases (decimals, abbreviations in mixed text)
- Bug fixes, type improvements
npm install
npm test
npm run typecheck
npm run buildЁЯУЬ License
MIT ┬й 2026 l3lackcurtains
Made with тЭдя╕П for the Nepali developer community.
рдмрдирд╛рдЗрдПрдХреЛ рдиреЗрдкрд╛рд▓реА рдбреЗрднрд▓рдкрд░ рд╕рдореБрджрд╛рдпрдХреЛ рд▓рд╛рдЧрд┐ред
