naive-bayes-language-detector

v1.3.0

Published

5 days ago

Naive Bayes - NGram language detector

0High
0Medium
0Low

a-aochoa

language-detection naive-bayes nlp text-classification typescript

Language Detector

A Naive Bayes language detector optimized for short, informal text like SMS and chat messages. Built with TypeScript and powered by TF-IDF vectorization and Gaussian Naive Bayes classification.

Supported Languages

| Code | Language | Flag | Slang Terms | Regional Support | | ---- | ---------- | ------ | ----------- | --------------------------------------------------------------------------------------------------------- | | en | English | 🇺🇸🇬🇧 | ~1,300+ | US, UK, Gen-Z, Gaming, AAVE, texting | | es | Spanish | 🇪🇸🇲🇽 | ~1,800+ | Mexico, Spain, Argentina, Colombia, Venezuela, Chile, Caribbean | | fr | French | 🇫🇷 | ~300+ | Standard French, SMS abbreviations (mdr, ptdr, slt, tkt) | | it | Italian | 🇮🇹 | ~350+ | Standard Italian, regional variants (cmq, tvb, xke) | | pt | Portuguese | 🇧🇷🇵🇹 | ~500+ | Brazilian (pt-BR) & European (pt-PT) Portuguese | | de | German | 🇩🇪🇦🇹🇨🇭 | ~400+ | Standard German, Austrian, Swiss German, youth slang (Jugendsprache) |

Total: ~4,600+ slang terms across all languages for improved informal text detection.

Features

✅ Optimized for short text: Works well with SMS and chat messages (1-50 words)
✅ Handles informal language: Supports slang, abbreviations, and texting patterns
✅ Multi-language support: 6 languages with regional variations
✅ Language filtering: Restrict detection to specific languages with "neither" detection
✅ Fast inference: <5ms per detection, suitable for real-time applications
✅ TypeScript support: Full type definitions included
✅ Slang dictionary fallback: Comprehensive detection for ambiguous cases
✅ Zero dependencies at runtime: Lightweight and self-contained

Installation

npm install naive-bayes-language-detector

Quick Start

import { getDetector } from 'naive-bayes-language-detector';

// Load the pre-trained model
const detector = getDetector(
   './node_modules/naive-bayes-language-detector/dist/models/language-model.json',
);

// Detect language
const result = detector.detect('Hola, ¿cómo estás?');
console.log(result);
// {
//   language: 'es',
//   confidence: 0.95,
//   isReliable: true,
//   probabilities: { es: 0.95, en: 0.01, fr: 0.02, it: 0.01, pt: 0.005, de: 0.005 },
//   source: 'ml'
// }

// Batch detection
const results = detector.detectBatch(['hello', 'hola', 'bonjour', 'ciao', 'oi']);

API Reference

`getDetector(modelPath: string): LanguageDetector`

Get or create a singleton detector instance.

const detector = getDetector('./models/language-model.json');

`LanguageDetector.detect(text: string): DetectionResult`

Detect the language of a single text.

interface DetectionResult {
   language: string; // Detected language code ('en', 'es', 'fr', 'it', 'pt', 'de')
   confidence: number; // Confidence score (0-1)
   isReliable: boolean; // True if confidence > 0.7
   probabilities?: Record<string, number>; // Probability per language
   source?: 'ml' | 'slang' | 'slang-override' | 'combined';
}

`LanguageDetector.detectBatch(texts: string[]): DetectionResult[]`

Detect languages for multiple texts efficiently.

`LanguageDetector.setAllowedLanguages(languages, options?): this`

Restrict detection to specific languages only. Useful when you only care about certain languages.

// Only detect English or Spanish
detector.setAllowedLanguages(['en', 'es']);

// With fast mode for better performance (skips "neither" detection)
detector.setAllowedLanguages(['en', 'es'], { fastMode: true });

Options:

| Option | Type | Default | Description | | ---------- | --------- | ------- | -------------------------------------------------------------------------------------------------- | | fastMode | boolean | false | When true, only computes probabilities for allowed languages (faster but no "neither" detection) |

Behavior:

| Scenario | fastMode: false (default) | fastMode: true | | -------- | --------------------------- | ---------------- | | Text matches allowed language | High confidence, isReliable: true | High confidence, isReliable: true | | Text doesn't match allowed languages | Low confidence, isReliable: false | High confidence (re-normalized) |

detector.setAllowedLanguages(['en', 'es']);

// Spanish text - detected correctly
detector.detect('Hola amigo');
// { language: 'es', confidence: 0.95, isReliable: true }

// French text with en/es filter - "neither" case
detector.detect('Bonjour!');
// { language: 'en', confidence: 0.12, isReliable: false }
// Low confidence indicates text doesn't really match allowed languages

`LanguageDetector.clearAllowedLanguages(): this`

Remove language restrictions and detect all supported languages again.

detector.clearAllowedLanguages();
// Now detects all 6 languages

`LanguageDetector.allowedLanguages: string[] | null`

Get the currently allowed languages. Returns null if all languages are allowed.

`LanguageDetector.fastMode: boolean`

Get whether fast mode is enabled.

`resetDetector(): void`

Reset the singleton instance (useful for testing).

How It Works

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Input Text    │ ──▶ │ Text Normalizer │ ──▶ │TF-IDF Vectorizer│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Detection      │ ◀── │ Slang Detection │ ◀── │ Naive Bayes     │
│  Result         │     │ (fallback)      │     │ Classifier      │
└─────────────────┘     └─────────────────┘     └─────────────────┘

TF-IDF Vectorizer

Converts text to numerical vectors using character n-grams (2-5 characters).

import { TfidfVectorizer } from 'naive-bayes-language-detector';

const vectorizer = new TfidfVectorizer({
   minN: 2, // Minimum n-gram size
   maxN: 5, // Maximum n-gram size
   maxFeatures: 5000, // Vocabulary limit
});

vectorizer.fit(trainingTexts);
const vector = vectorizer.transform('hello world');

Naive Bayes Classifier

Gaussian Naive Bayes classifier for language prediction.

import { NaiveBayesClassifier } from 'naive-bayes-language-detector';

const classifier = new NaiveBayesClassifier();
classifier.fit(vectors, labels);
const prediction = classifier.predict(vector);

Slang Detection

For short/ambiguous texts, the detector uses comprehensive slang dictionaries:

| Language | Examples | | ---------- | ----------------------------------------------- | | English | lol, bruh, ngl, fr, lowkey, bussin, innit, mate | | Spanish | wey, neta, chido, parce, bacano, po, cachai | | French | mdr, ptdr, slt, tkt, jsp, bcp, cv | | Italian | cmq, tvb, xke, nn, qlc, grz | | Portuguese | kkk, blz, vlw, tmj, mano, bora, fixe | | German | digga, krass, geil, oida, leiwand, hdl, vllt |

Training Your Own Model

1. Download Training Data

npm run download-data

Downloads data from multiple sources:

| Source | Description | Link | | ------------------------------------------------------- | -------------------------------- | ---------------------------------------------------- | | Tatoeba | Community-sourced sentence pairs | tatoeba.org | | OpenSubtitles | Movie and TV subtitles | opus.nlpl.eu | | Leipzig Corpora | Web and news text | uni-leipzig.de | | TED2020 | TED talk transcripts | opus.nlpl.eu | | QED | Educational content | opus.nlpl.eu | | Ubuntu | Technical support dialogues | opus.nlpl.eu |

2. Prepare Data

npm run prepare-data

Processes raw data, filters by length, and removes duplicates.

3. Train Model

npm run train

Trains a TF-IDF + Naive Bayes model using batch processing and saves to models/language-model.json.

4. Evaluate Model

npm run evaluate

Runs the model against 959 test cases and reports accuracy.

Interactive mode:

npm run evaluate -- -i

Text Normalization

import { normalizeText, augmentText } from 'naive-bayes-language-detector';

// Normalize text (lowercase, remove URLs, emails, phone numbers)
const normalized = normalizeText('Hello World! https://example.com');
// 'hello world'

// Augment for training (creates variations with abbreviations)
const variations = augmentText('porque no vienes', 'es');
// ['porque no vienes', 'xq no vienes', ...]

Project Structure

language-detector/
├── src/                    # TypeScript source
│   ├── index.ts           # Main exports
│   ├── types/             # Type definitions
│   ├── utils/             # Utilities (normalization, n-grams, slang)
│   └── inference/         # ML components (vectorizer, classifier, detector)
├── dist/                  # Compiled JavaScript (CommonJS)
├── test/                  # Mocha + Chai test files
├── scripts/               # Training and evaluation scripts
├── models/                # Pre-trained model
│   └── language-model.json
└── data/                  # Training data (not included in npm package)

Type Exports

import type {
   LanguageCode, // 'en' | 'es' | 'fr' | 'it' | 'pt' | 'de'
   DetectionResult,
   DetectionSource,
   SlangDetectionResult,
   PredictionResult,
   VectorizerOptions,
   VectorizerData,
   ClassifierData,
   ModelData,
   AllowedLanguagesOptions, // Options for setAllowedLanguages()
} from 'naive-bayes-language-detector';

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run tests
npm test

# Run tests with coverage
npm run coverage

# Lint code
npm run lint
npm run lint:fix

# Training workflow
npm run download-data
npm run prepare-data
npm run train
npm run evaluate

Tech Stack

| Technology | Purpose | | ---------------------------------------------------------------- | --------------------- | | TypeScript | Type-safe development | | Node.js | Runtime environment | | Mocha + Chai | Testing framework | | ESLint + Prettier | Code quality | | Husky | Git hooks | | Airbnb Style Guide | Code style |

Git Hooks

This project uses Husky for Git hooks:

pre-commit: Runs lint-staged to lint and format staged .ts files

# Hooks are automatically installed when you run npm install
npm install

Requirements

Node.js >= 20

Performance

| Metric | Value | | -------------- | --------------------- | | Inference time | <5ms per text | | Model size | ~1.7MB (JSON) | | Accuracy | 100% (959 test cases) | | Memory usage | ~50MB loaded |

License

ISC

Contributing

Contributions are welcome! Please ensure:

All tests pass (npm test)
Code is linted (npm run lint)
New features include tests

Maintainers

@aaochoa

Made with ❤️ for the messaging community

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Language Detector

Supported Languages

Features

Installation

Quick Start

API Reference

getDetector(modelPath: string): LanguageDetector

LanguageDetector.detect(text: string): DetectionResult

LanguageDetector.detectBatch(texts: string[]): DetectionResult[]

LanguageDetector.setAllowedLanguages(languages, options?): this

LanguageDetector.clearAllowedLanguages(): this

LanguageDetector.allowedLanguages: string[] | null

LanguageDetector.fastMode: boolean

resetDetector(): void

How It Works

Architecture

TF-IDF Vectorizer

Naive Bayes Classifier

Slang Detection

Training Your Own Model

1. Download Training Data

2. Prepare Data

3. Train Model

4. Evaluate Model

Text Normalization

Project Structure

Type Exports

Development

Tech Stack

Git Hooks

Requirements

Performance

License

Contributing

Maintainers

`getDetector(modelPath: string): LanguageDetector`

`LanguageDetector.detect(text: string): DetectionResult`

`LanguageDetector.detectBatch(texts: string[]): DetectionResult[]`

`LanguageDetector.setAllowedLanguages(languages, options?): this`

`LanguageDetector.clearAllowedLanguages(): this`

`LanguageDetector.allowedLanguages: string[] | null`

`LanguageDetector.fastMode: boolean`

`resetDetector(): void`