npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

naive-bayes-language-detector

v1.3.0

Published

Naive Bayes - NGram language detector

Readme

Language Detector

npm version License: ISC Node.js TypeScript

A Naive Bayes language detector optimized for short, informal text like SMS and chat messages. Built with TypeScript and powered by TF-IDF vectorization and Gaussian Naive Bayes classification.

Supported Languages

| Code | Language | Flag | Slang Terms | Regional Support | | ---- | ---------- | ------ | ----------- | --------------------------------------------------------------------------------------------------------- | | en | English | 🇺🇸🇬🇧 | ~1,300+ | US, UK, Gen-Z, Gaming, AAVE, texting | | es | Spanish | 🇪🇸🇲🇽 | ~1,800+ | Mexico, Spain, Argentina, Colombia, Venezuela, Chile, Caribbean | | fr | French | 🇫🇷 | ~300+ | Standard French, SMS abbreviations (mdr, ptdr, slt, tkt) | | it | Italian | 🇮🇹 | ~350+ | Standard Italian, regional variants (cmq, tvb, xke) | | pt | Portuguese | 🇧🇷🇵🇹 | ~500+ | Brazilian (pt-BR) & European (pt-PT) Portuguese | | de | German | 🇩🇪🇦🇹🇨🇭 | ~400+ | Standard German, Austrian, Swiss German, youth slang (Jugendsprache) |

Total: ~4,600+ slang terms across all languages for improved informal text detection.

Features

  • Optimized for short text: Works well with SMS and chat messages (1-50 words)
  • Handles informal language: Supports slang, abbreviations, and texting patterns
  • Multi-language support: 6 languages with regional variations
  • Language filtering: Restrict detection to specific languages with "neither" detection
  • Fast inference: <5ms per detection, suitable for real-time applications
  • TypeScript support: Full type definitions included
  • Slang dictionary fallback: Comprehensive detection for ambiguous cases
  • Zero dependencies at runtime: Lightweight and self-contained

Installation

npm install naive-bayes-language-detector

Quick Start

import { getDetector } from 'naive-bayes-language-detector';

// Load the pre-trained model
const detector = getDetector(
   './node_modules/naive-bayes-language-detector/dist/models/language-model.json',
);

// Detect language
const result = detector.detect('Hola, ¿cómo estás?');
console.log(result);
// {
//   language: 'es',
//   confidence: 0.95,
//   isReliable: true,
//   probabilities: { es: 0.95, en: 0.01, fr: 0.02, it: 0.01, pt: 0.005, de: 0.005 },
//   source: 'ml'
// }

// Batch detection
const results = detector.detectBatch(['hello', 'hola', 'bonjour', 'ciao', 'oi']);

API Reference

getDetector(modelPath: string): LanguageDetector

Get or create a singleton detector instance.

const detector = getDetector('./models/language-model.json');

LanguageDetector.detect(text: string): DetectionResult

Detect the language of a single text.

interface DetectionResult {
   language: string; // Detected language code ('en', 'es', 'fr', 'it', 'pt', 'de')
   confidence: number; // Confidence score (0-1)
   isReliable: boolean; // True if confidence > 0.7
   probabilities?: Record<string, number>; // Probability per language
   source?: 'ml' | 'slang' | 'slang-override' | 'combined';
}

LanguageDetector.detectBatch(texts: string[]): DetectionResult[]

Detect languages for multiple texts efficiently.

LanguageDetector.setAllowedLanguages(languages, options?): this

Restrict detection to specific languages only. Useful when you only care about certain languages.

// Only detect English or Spanish
detector.setAllowedLanguages(['en', 'es']);

// With fast mode for better performance (skips "neither" detection)
detector.setAllowedLanguages(['en', 'es'], { fastMode: true });

Options:

| Option | Type | Default | Description | | ---------- | --------- | ------- | -------------------------------------------------------------------------------------------------- | | fastMode | boolean | false | When true, only computes probabilities for allowed languages (faster but no "neither" detection) |

Behavior:

| Scenario | fastMode: false (default) | fastMode: true | | -------- | --------------------------- | ---------------- | | Text matches allowed language | High confidence, isReliable: true | High confidence, isReliable: true | | Text doesn't match allowed languages | Low confidence, isReliable: false | High confidence (re-normalized) |

detector.setAllowedLanguages(['en', 'es']);

// Spanish text - detected correctly
detector.detect('Hola amigo');
// { language: 'es', confidence: 0.95, isReliable: true }

// French text with en/es filter - "neither" case
detector.detect('Bonjour!');
// { language: 'en', confidence: 0.12, isReliable: false }
// Low confidence indicates text doesn't really match allowed languages

LanguageDetector.clearAllowedLanguages(): this

Remove language restrictions and detect all supported languages again.

detector.clearAllowedLanguages();
// Now detects all 6 languages

LanguageDetector.allowedLanguages: string[] | null

Get the currently allowed languages. Returns null if all languages are allowed.

LanguageDetector.fastMode: boolean

Get whether fast mode is enabled.

resetDetector(): void

Reset the singleton instance (useful for testing).

How It Works

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Input Text    │ ──▶ │ Text Normalizer │ ──▶ │TF-IDF Vectorizer│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Detection      │ ◀── │ Slang Detection │ ◀── │ Naive Bayes     │
│  Result         │     │ (fallback)      │     │ Classifier      │
└─────────────────┘     └─────────────────┘     └─────────────────┘

TF-IDF Vectorizer

Converts text to numerical vectors using character n-grams (2-5 characters).

import { TfidfVectorizer } from 'naive-bayes-language-detector';

const vectorizer = new TfidfVectorizer({
   minN: 2, // Minimum n-gram size
   maxN: 5, // Maximum n-gram size
   maxFeatures: 5000, // Vocabulary limit
});

vectorizer.fit(trainingTexts);
const vector = vectorizer.transform('hello world');

Naive Bayes Classifier

Gaussian Naive Bayes classifier for language prediction.

import { NaiveBayesClassifier } from 'naive-bayes-language-detector';

const classifier = new NaiveBayesClassifier();
classifier.fit(vectors, labels);
const prediction = classifier.predict(vector);

Slang Detection

For short/ambiguous texts, the detector uses comprehensive slang dictionaries:

| Language | Examples | | ---------- | ----------------------------------------------- | | English | lol, bruh, ngl, fr, lowkey, bussin, innit, mate | | Spanish | wey, neta, chido, parce, bacano, po, cachai | | French | mdr, ptdr, slt, tkt, jsp, bcp, cv | | Italian | cmq, tvb, xke, nn, qlc, grz | | Portuguese | kkk, blz, vlw, tmj, mano, bora, fixe | | German | digga, krass, geil, oida, leiwand, hdl, vllt |

Training Your Own Model

1. Download Training Data

npm run download-data

Downloads data from multiple sources:

| Source | Description | Link | | ------------------------------------------------------- | -------------------------------- | ---------------------------------------------------- | | Tatoeba | Community-sourced sentence pairs | tatoeba.org | | OpenSubtitles | Movie and TV subtitles | opus.nlpl.eu | | Leipzig Corpora | Web and news text | uni-leipzig.de | | TED2020 | TED talk transcripts | opus.nlpl.eu | | QED | Educational content | opus.nlpl.eu | | Ubuntu | Technical support dialogues | opus.nlpl.eu |

2. Prepare Data

npm run prepare-data

Processes raw data, filters by length, and removes duplicates.

3. Train Model

npm run train

Trains a TF-IDF + Naive Bayes model using batch processing and saves to models/language-model.json.

4. Evaluate Model

npm run evaluate

Runs the model against 959 test cases and reports accuracy.

Interactive mode:

npm run evaluate -- -i

Text Normalization

import { normalizeText, augmentText } from 'naive-bayes-language-detector';

// Normalize text (lowercase, remove URLs, emails, phone numbers)
const normalized = normalizeText('Hello World! https://example.com');
// 'hello world'

// Augment for training (creates variations with abbreviations)
const variations = augmentText('porque no vienes', 'es');
// ['porque no vienes', 'xq no vienes', ...]

Project Structure

language-detector/
├── src/                    # TypeScript source
│   ├── index.ts           # Main exports
│   ├── types/             # Type definitions
│   ├── utils/             # Utilities (normalization, n-grams, slang)
│   └── inference/         # ML components (vectorizer, classifier, detector)
├── dist/                  # Compiled JavaScript (CommonJS)
├── test/                  # Mocha + Chai test files
├── scripts/               # Training and evaluation scripts
├── models/                # Pre-trained model
│   └── language-model.json
└── data/                  # Training data (not included in npm package)

Type Exports

import type {
   LanguageCode, // 'en' | 'es' | 'fr' | 'it' | 'pt' | 'de'
   DetectionResult,
   DetectionSource,
   SlangDetectionResult,
   PredictionResult,
   VectorizerOptions,
   VectorizerData,
   ClassifierData,
   ModelData,
   AllowedLanguagesOptions, // Options for setAllowedLanguages()
} from 'naive-bayes-language-detector';

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run tests
npm test

# Run tests with coverage
npm run coverage

# Lint code
npm run lint
npm run lint:fix

# Training workflow
npm run download-data
npm run prepare-data
npm run train
npm run evaluate

Tech Stack

| Technology | Purpose | | ---------------------------------------------------------------- | --------------------- | | TypeScript | Type-safe development | | Node.js | Runtime environment | | Mocha + Chai | Testing framework | | ESLint + Prettier | Code quality | | Husky | Git hooks | | Airbnb Style Guide | Code style |

Git Hooks

This project uses Husky for Git hooks:

  • pre-commit: Runs lint-staged to lint and format staged .ts files
# Hooks are automatically installed when you run npm install
npm install

Requirements

Performance

| Metric | Value | | -------------- | --------------------- | | Inference time | <5ms per text | | Model size | ~1.7MB (JSON) | | Accuracy | 100% (959 test cases) | | Memory usage | ~50MB loaded |

License

ISC

Contributing

Contributions are welcome! Please ensure:

  1. All tests pass (npm test)
  2. Code is linted (npm run lint)
  3. New features include tests

Maintainers


Made with ❤️ for the messaging community