naive-bayes-language-detector
v1.3.0
Published
Naive Bayes - NGram language detector
Maintainers
Readme
Language Detector
A Naive Bayes language detector optimized for short, informal text like SMS and chat messages. Built with TypeScript and powered by TF-IDF vectorization and Gaussian Naive Bayes classification.
Supported Languages
| Code | Language | Flag | Slang Terms | Regional Support |
| ---- | ---------- | ------ | ----------- | --------------------------------------------------------------------------------------------------------- |
| en | English | 🇺🇸🇬🇧 | ~1,300+ | US, UK, Gen-Z, Gaming, AAVE, texting |
| es | Spanish | 🇪🇸🇲🇽 | ~1,800+ | Mexico, Spain, Argentina, Colombia, Venezuela, Chile, Caribbean |
| fr | French | 🇫🇷 | ~300+ | Standard French, SMS abbreviations (mdr, ptdr, slt, tkt) |
| it | Italian | 🇮🇹 | ~350+ | Standard Italian, regional variants (cmq, tvb, xke) |
| pt | Portuguese | 🇧🇷🇵🇹 | ~500+ | Brazilian (pt-BR) & European (pt-PT) Portuguese |
| de | German | 🇩🇪🇦🇹🇨🇭 | ~400+ | Standard German, Austrian, Swiss German, youth slang (Jugendsprache) |
Total: ~4,600+ slang terms across all languages for improved informal text detection.
Features
- ✅ Optimized for short text: Works well with SMS and chat messages (1-50 words)
- ✅ Handles informal language: Supports slang, abbreviations, and texting patterns
- ✅ Multi-language support: 6 languages with regional variations
- ✅ Language filtering: Restrict detection to specific languages with "neither" detection
- ✅ Fast inference: <5ms per detection, suitable for real-time applications
- ✅ TypeScript support: Full type definitions included
- ✅ Slang dictionary fallback: Comprehensive detection for ambiguous cases
- ✅ Zero dependencies at runtime: Lightweight and self-contained
Installation
npm install naive-bayes-language-detectorQuick Start
import { getDetector } from 'naive-bayes-language-detector';
// Load the pre-trained model
const detector = getDetector(
'./node_modules/naive-bayes-language-detector/dist/models/language-model.json',
);
// Detect language
const result = detector.detect('Hola, ¿cómo estás?');
console.log(result);
// {
// language: 'es',
// confidence: 0.95,
// isReliable: true,
// probabilities: { es: 0.95, en: 0.01, fr: 0.02, it: 0.01, pt: 0.005, de: 0.005 },
// source: 'ml'
// }
// Batch detection
const results = detector.detectBatch(['hello', 'hola', 'bonjour', 'ciao', 'oi']);API Reference
getDetector(modelPath: string): LanguageDetector
Get or create a singleton detector instance.
const detector = getDetector('./models/language-model.json');LanguageDetector.detect(text: string): DetectionResult
Detect the language of a single text.
interface DetectionResult {
language: string; // Detected language code ('en', 'es', 'fr', 'it', 'pt', 'de')
confidence: number; // Confidence score (0-1)
isReliable: boolean; // True if confidence > 0.7
probabilities?: Record<string, number>; // Probability per language
source?: 'ml' | 'slang' | 'slang-override' | 'combined';
}LanguageDetector.detectBatch(texts: string[]): DetectionResult[]
Detect languages for multiple texts efficiently.
LanguageDetector.setAllowedLanguages(languages, options?): this
Restrict detection to specific languages only. Useful when you only care about certain languages.
// Only detect English or Spanish
detector.setAllowedLanguages(['en', 'es']);
// With fast mode for better performance (skips "neither" detection)
detector.setAllowedLanguages(['en', 'es'], { fastMode: true });Options:
| Option | Type | Default | Description |
| ---------- | --------- | ------- | -------------------------------------------------------------------------------------------------- |
| fastMode | boolean | false | When true, only computes probabilities for allowed languages (faster but no "neither" detection) |
Behavior:
| Scenario | fastMode: false (default) | fastMode: true |
| -------- | --------------------------- | ---------------- |
| Text matches allowed language | High confidence, isReliable: true | High confidence, isReliable: true |
| Text doesn't match allowed languages | Low confidence, isReliable: false | High confidence (re-normalized) |
detector.setAllowedLanguages(['en', 'es']);
// Spanish text - detected correctly
detector.detect('Hola amigo');
// { language: 'es', confidence: 0.95, isReliable: true }
// French text with en/es filter - "neither" case
detector.detect('Bonjour!');
// { language: 'en', confidence: 0.12, isReliable: false }
// Low confidence indicates text doesn't really match allowed languagesLanguageDetector.clearAllowedLanguages(): this
Remove language restrictions and detect all supported languages again.
detector.clearAllowedLanguages();
// Now detects all 6 languagesLanguageDetector.allowedLanguages: string[] | null
Get the currently allowed languages. Returns null if all languages are allowed.
LanguageDetector.fastMode: boolean
Get whether fast mode is enabled.
resetDetector(): void
Reset the singleton instance (useful for testing).
How It Works
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Input Text │ ──▶ │ Text Normalizer │ ──▶ │TF-IDF Vectorizer│
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Detection │ ◀── │ Slang Detection │ ◀── │ Naive Bayes │
│ Result │ │ (fallback) │ │ Classifier │
└─────────────────┘ └─────────────────┘ └─────────────────┘TF-IDF Vectorizer
Converts text to numerical vectors using character n-grams (2-5 characters).
import { TfidfVectorizer } from 'naive-bayes-language-detector';
const vectorizer = new TfidfVectorizer({
minN: 2, // Minimum n-gram size
maxN: 5, // Maximum n-gram size
maxFeatures: 5000, // Vocabulary limit
});
vectorizer.fit(trainingTexts);
const vector = vectorizer.transform('hello world');Naive Bayes Classifier
Gaussian Naive Bayes classifier for language prediction.
import { NaiveBayesClassifier } from 'naive-bayes-language-detector';
const classifier = new NaiveBayesClassifier();
classifier.fit(vectors, labels);
const prediction = classifier.predict(vector);Slang Detection
For short/ambiguous texts, the detector uses comprehensive slang dictionaries:
| Language | Examples | | ---------- | ----------------------------------------------- | | English | lol, bruh, ngl, fr, lowkey, bussin, innit, mate | | Spanish | wey, neta, chido, parce, bacano, po, cachai | | French | mdr, ptdr, slt, tkt, jsp, bcp, cv | | Italian | cmq, tvb, xke, nn, qlc, grz | | Portuguese | kkk, blz, vlw, tmj, mano, bora, fixe | | German | digga, krass, geil, oida, leiwand, hdl, vllt |
Training Your Own Model
1. Download Training Data
npm run download-dataDownloads data from multiple sources:
| Source | Description | Link | | ------------------------------------------------------- | -------------------------------- | ---------------------------------------------------- | | Tatoeba | Community-sourced sentence pairs | tatoeba.org | | OpenSubtitles | Movie and TV subtitles | opus.nlpl.eu | | Leipzig Corpora | Web and news text | uni-leipzig.de | | TED2020 | TED talk transcripts | opus.nlpl.eu | | QED | Educational content | opus.nlpl.eu | | Ubuntu | Technical support dialogues | opus.nlpl.eu |
2. Prepare Data
npm run prepare-dataProcesses raw data, filters by length, and removes duplicates.
3. Train Model
npm run trainTrains a TF-IDF + Naive Bayes model using batch processing and saves to models/language-model.json.
4. Evaluate Model
npm run evaluateRuns the model against 959 test cases and reports accuracy.
Interactive mode:
npm run evaluate -- -iText Normalization
import { normalizeText, augmentText } from 'naive-bayes-language-detector';
// Normalize text (lowercase, remove URLs, emails, phone numbers)
const normalized = normalizeText('Hello World! https://example.com');
// 'hello world'
// Augment for training (creates variations with abbreviations)
const variations = augmentText('porque no vienes', 'es');
// ['porque no vienes', 'xq no vienes', ...]Project Structure
language-detector/
├── src/ # TypeScript source
│ ├── index.ts # Main exports
│ ├── types/ # Type definitions
│ ├── utils/ # Utilities (normalization, n-grams, slang)
│ └── inference/ # ML components (vectorizer, classifier, detector)
├── dist/ # Compiled JavaScript (CommonJS)
├── test/ # Mocha + Chai test files
├── scripts/ # Training and evaluation scripts
├── models/ # Pre-trained model
│ └── language-model.json
└── data/ # Training data (not included in npm package)Type Exports
import type {
LanguageCode, // 'en' | 'es' | 'fr' | 'it' | 'pt' | 'de'
DetectionResult,
DetectionSource,
SlangDetectionResult,
PredictionResult,
VectorizerOptions,
VectorizerData,
ClassifierData,
ModelData,
AllowedLanguagesOptions, // Options for setAllowedLanguages()
} from 'naive-bayes-language-detector';Development
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run tests
npm test
# Run tests with coverage
npm run coverage
# Lint code
npm run lint
npm run lint:fix
# Training workflow
npm run download-data
npm run prepare-data
npm run train
npm run evaluateTech Stack
| Technology | Purpose | | ---------------------------------------------------------------- | --------------------- | | TypeScript | Type-safe development | | Node.js | Runtime environment | | Mocha + Chai | Testing framework | | ESLint + Prettier | Code quality | | Husky | Git hooks | | Airbnb Style Guide | Code style |
Git Hooks
This project uses Husky for Git hooks:
- pre-commit: Runs
lint-stagedto lint and format staged.tsfiles
# Hooks are automatically installed when you run npm install
npm installRequirements
- Node.js >= 20
Performance
| Metric | Value | | -------------- | --------------------- | | Inference time | <5ms per text | | Model size | ~1.7MB (JSON) | | Accuracy | 100% (959 test cases) | | Memory usage | ~50MB loaded |
License
Contributing
Contributions are welcome! Please ensure:
- All tests pass (
npm test) - Code is linted (
npm run lint) - New features include tests
Maintainers
Made with ❤️ for the messaging community
