en-es-detector

v1.0.0

Published

22 days ago

Production-ready English/Spanish language detector using hybrid ML, Heuristics, and Bloom Filter approach.

Downloads

102

0High
0Medium
0Low

shashanklandge

language-detection english spanish bloom-filter nlp

en-es-detector

Production-Ready English / Spanish Language Detector

A high-performance, hybrid language detection library optimized for distinguishing English and Spanish text, specifically capable of handling:

CamelCase Technical Jargon (e.g., HelloWorld, AdjustableBanner)
Short Fragments & UI Labels
Mixed Content

It combines Bloom Filters (fast dictionary lookup), Linguistic Heuristics (stopword/pattern matching), and an N-Gram ML Model to achieve high accuracy where standard libraries fail.

Installation

npm install en-es-detector

Usage

Basic Detection

The main export detect runs the full pipeline (Dictionary + Heuristics + ML).

import { detect } from 'en-es-detector';

const result = detect("HelloWorld");
console.log(result);
// Output: { lang: 'en', confidence: 0.99, method: 'dictionary_strict' }

Low-Level ML Detection

If you want to bypass the dictionary/heuristic layer and use only the N-Gram model (faster but less accurate for technical jargon), use detectML.

import { detectML } from 'en-es-detector';

const result = detectML("HelloWorld");
// Likely less confident or incorrect for jargon

Advanced Configuration

You can override the default sensitivity thresholds by passing an options object as the second argument to detect().

Example

const customOptions = {
    // Make it stricter: 90% of words must be English
    MIN_ENGLISH_RATIO_STRICT: 0.9, 
    // Trust the dictionary less
    CONF_DICT_STRICT: 0.95 
};

detect("HelloWorld", customOptions);

// Confidence to assign when a Spanish stopword is found
CONF_SPANISH_STOPWORD: 1.0,

};

const result = detect("SomeAmbiguousText", customOptions);


## How It Works

The detector uses a multi-stage pipeline:

1.  **Normalization**: Splits CamelCase (`ItemCarousel` → `Item Carousel`) and kebab-case.
2.  **Dictionary Check**: Checks tokens against a **Bloom Filter** containing ~155k English words.
3.  **Heuristics**: Scans for high-precision Spanish stopwords (e.g., `de`, `la`, `que`) and patterns (e.g., `cion`, `ñ`).
4.  **ML Inference**: Runs a pre-trained N-Gram hashing model (MurmurHash3 + Logistic Regression weights) as a fallback.
5.  **Ensemble Decision**: Combines all signals. For example, if the ML model predicts "Spanish" but the Dictionary sees 100% English words, the detector correctly overrides it to "English".

## License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

en-es-detector

Installation

Usage

Basic Detection

Low-Level ML Detection

Advanced Configuration

Example