en-es-detector
v1.0.0
Published
Production-ready English/Spanish language detector using hybrid ML, Heuristics, and Bloom Filter approach.
Downloads
102
Maintainers
Readme
en-es-detector
Production-Ready English / Spanish Language Detector
A high-performance, hybrid language detection library optimized for distinguishing English and Spanish text, specifically capable of handling:
- CamelCase Technical Jargon (e.g.,
HelloWorld,AdjustableBanner) - Short Fragments & UI Labels
- Mixed Content
It combines Bloom Filters (fast dictionary lookup), Linguistic Heuristics (stopword/pattern matching), and an N-Gram ML Model to achieve high accuracy where standard libraries fail.
Installation
npm install en-es-detectorUsage
Basic Detection
The main export detect runs the full pipeline (Dictionary + Heuristics + ML).
import { detect } from 'en-es-detector';
const result = detect("HelloWorld");
console.log(result);
// Output: { lang: 'en', confidence: 0.99, method: 'dictionary_strict' }Low-Level ML Detection
If you want to bypass the dictionary/heuristic layer and use only the N-Gram model (faster but less accurate for technical jargon), use detectML.
import { detectML } from 'en-es-detector';
const result = detectML("HelloWorld");
// Likely less confident or incorrect for jargonAdvanced Configuration
You can override the default sensitivity thresholds by passing an options object as the second argument to detect().
| Option | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| MIN_ENGLISH_RATIO_STRICT | Number | 0.8 | Minimum ratio of valid English words required to trigger a "Strict Dictionary" match (High Confidence). |
| MIN_ENGLISH_RATIO_LOOSE | Number | 0.6 | Minimum ratio of English words to trigger a fallback "Loose Dictionary" match if no Spanish stopwords exist. |
| MAX_ENGLISH_RATIO_FOR_SPANISH | Number | 0.3 | Maximum allowed ratio of English words when strong Spanish stopwords are present. Prevents false positives. |
| CONF_DICT_STRICT | Number | 0.99 | Confidence score assigned when strict dictionary criteria are met. |
| CONF_DICT_LOOSE | Number | 0.85 | Confidence score assigned when loose dictionary criteria are met. |
| CONF_SPANISH_STOPWORD | Number | 0.95 | Confidence score boost for texts containing high-frequency Spanish stopwords. |
| MIN_ML_CONFIDENCE_TO_OVERRIDE_DICT | Number | 0.95 | If the ML model is this confident in its prediction, it can override a "Loose Dictionary" match to prevent errors. |
Example
const customOptions = {
// Make it stricter: 90% of words must be English
MIN_ENGLISH_RATIO_STRICT: 0.9,
// Trust the dictionary less
CONF_DICT_STRICT: 0.95
};
detect("HelloWorld", customOptions);// Confidence to assign when a Spanish stopword is found
CONF_SPANISH_STOPWORD: 1.0, };
const result = detect("SomeAmbiguousText", customOptions);
## How It Works
The detector uses a multi-stage pipeline:
1. **Normalization**: Splits CamelCase (`ItemCarousel` → `Item Carousel`) and kebab-case.
2. **Dictionary Check**: Checks tokens against a **Bloom Filter** containing ~155k English words.
3. **Heuristics**: Scans for high-precision Spanish stopwords (e.g., `de`, `la`, `que`) and patterns (e.g., `cion`, `ñ`).
4. **ML Inference**: Runs a pre-trained N-Gram hashing model (MurmurHash3 + Logistic Regression weights) as a fallback.
5. **Ensemble Decision**: Combines all signals. For example, if the ML model predicts "Spanish" but the Dictionary sees 100% English words, the detector correctly overrides it to "English".
## License
ISC