english-validator

v2.0.2

Published

5 days ago

Detect whether a sentence is English or non-English. Returns true/false with high accuracy using dictionary lookup and trigram analysis.

Downloads

431

0High
0Medium
0Low

nuvayutech

english language-detection non-english nlp text-analysis language detector validator is-english sentence-detection typescript

english-validator

Detect whether a sentence is English or non-English. Returns true / false with high accuracy.

Features

Dictionary-powered — 274k+ English word dictionary for accurate word-level checks
Trigram analysis — uses franc as a secondary signal for statistical language detection
Lightweight API — single function call, returns a boolean
Configurable — adjustable thresholds, minimum word length, number handling
Built-in caching — LRU-style memoization for fast repeated lookups
TypeScript support — ships with full type declarations and JSDoc
ESM & CJS — works with import and require (zero runtime dependencies)

Installation

npm install english-validator

Quick Start

ESM (React, Next.js, Vite, modern Node.js)

import { isEnglish, detectNonEnglishText } from "english-validator";

isEnglish("The quick brown fox jumps over the lazy dog");
// => true

isEnglish("Ceci est une phrase en français");
// => false

// Or use the inverse API:
detectNonEnglishText("Das ist ein deutscher Satz");
// => true  (it IS non-English)

detectNonEnglishText("Hello, how are you?");
// => false (it is NOT non-English)

CommonJS (Node.js)

const { isEnglish, detectNonEnglishText } = require("english-validator");

console.log(isEnglish("Hello world")); // true

TypeScript

The package ships with full type declarations. Import types directly:

import {
  isEnglish,
  detectNonEnglishText,
  matchesDocumentPattern,
  clearLanguageDetectorCaches,
} from "english-validator";
import type { DetectionOptions } from "english-validator";

// Use DetectionOptions for custom configuration
const options: DetectionOptions = {
  englishThreshold: 0.7,
  minWordLength: 3,
  allowNumbers: false,
};

const result: boolean = isEnglish("Check this text", options);

API

`isEnglish(text, options?)`

Returns true if the text is English, false otherwise.

| Parameter | Type | Description | | --------- | ------------------ | ---------------------------------- | | text | string \| null \| undefined | Text to analyse. Returns true for empty/null/undefined. | | options | DetectionOptions | Optional configuration (see below) |

isEnglish("Hello world");          // true
isEnglish("Bonjour le monde");     // false
isEnglish("", { englishThreshold: 0.5 }); // true (empty)

`detectNonEnglishText(text, options?)`

Returns true if the text is non-English, false if English. Inverse of isEnglish.

detectNonEnglishText("Das ist Deutsch");   // true
detectNonEnglishText("This is English");   // false

`matchesDocumentPattern(text)`

Returns true if the text matches document ID patterns like AEM01-WI-DSU06-SD01.

matchesDocumentPattern("AEM01-WI-DSU06-SD01"); // true
matchesDocumentPattern("Hello world");          // false

`clearLanguageDetectorCaches()`

Clears the internal LRU memoization caches. Call this in long-running applications to free memory or to reset state between independent detection sessions.

clearLanguageDetectorCaches(); // frees all cached results

`DetectionOptions`

Configuration object accepted by isEnglish and detectNonEnglishText:

| Option | Type | Default | Description | | ------------------- | ----------- | ------- | ---------------------------------------------------- | | englishThreshold | number | 0.8 | Ratio of English words needed to classify as English (0.0–1.0) | | minWordLength | number | 2 | Words shorter than this are skipped during analysis | | allowNumbers | boolean | true | Treat standalone numbers as valid English tokens | | allowAbbreviations| boolean | true | Treat uppercase abbreviations (e.g. NATO, FBI) as valid English tokens | | customPatterns | RegExp[] | — | Regex patterns to strip from text before validation | | excludeWords | string[] | — | Words to remove from text before validation (case-insensitive, whole-word) |

Note: Short texts (4 words or fewer) automatically use a relaxed threshold of 0.6 regardless of the configured englishThreshold, to avoid false positives on English fragments.

Quick Examples

import { isEnglish } from "english-validator";

// englishThreshold — lower it to allow mixed-language text
isEnglish("Hello mundo friend", { englishThreshold: 0.5 });       // true (50%+ English)

// minWordLength — skip short words like "a", "I" during analysis
isEnglish("I am a big fan of this", { minWordLength: 3 });         // true

// allowNumbers — treat "42" as a valid English token (default: true)
isEnglish("Order 42 is ready", { allowNumbers: true });            // true

// allowAbbreviations — treat "NATO", "FBI" as valid (default: true)
isEnglish("NATO signed the agreement", { allowAbbreviations: true }); // true

// customPatterns — strip JIRA IDs before validation
isEnglish("Fix bug PROJ-1234 in login flow", {
  customPatterns: [/[A-Z]+-\d+/g],
});                                                                 // true

// excludeWords — remove brand names / jargon before validation
isEnglish("Deploy Kubernetes pods and monitor dashboards", {
  excludeWords: ["Kubernetes"],
});                                                                 // true

Usage Examples

Custom Patterns — Strip Unwanted Tokens

Use customPatterns to remove regex-matched tokens (e.g. JIRA ticket IDs, codes) before validation:

import { isEnglish } from "english-validator";

// JIRA ticket IDs would normally fail the dictionary check
isEnglish("Fix bug PROJ-1234 in login flow", {
  customPatterns: [/PROJ-\d+/g],
});
// => true

// Multiple patterns
isEnglish("REF:ABC123 the system is operational CODE:XY99", {
  customPatterns: [/REF:\w+/g, /CODE:\w+/g],
});
// => true

Exclude Words — Remove Known Non-Dictionary Terms

Use excludeWords to drop specific words (brand names, internal jargon) before validation:

import { isEnglish } from "english-validator";

// "Kubernetes" and "Grafana" aren't in the dictionary
isEnglish("Deploy Kubernetes pods and monitor with Grafana dashboards", {
  excludeWords: ["Kubernetes", "Grafana"],
});
// => true

// Case-insensitive and whole-word only
isEnglish("The ACME widget is working fine", {
  excludeWords: ["acme"],
});
// => true  ("acme" removed, remaining text is English)

Combining Options

import { isEnglish } from "english-validator";
import type { DetectionOptions } from "english-validator";

const opts: DetectionOptions = {
  customPatterns: [/TKT-\d+/g],
  excludeWords: ["Datadog", "Terraform"],
  englishThreshold: 0.7,
  allowAbbreviations: true,
};

isEnglish("TKT-5678 Deploy Terraform stack monitored by Datadog", opts);
// => true

React Component

import { isEnglish } from "english-validator";

function LanguageCheck({ text }: { text: string }) {
  return (
    <div>
      {isEnglish(text) ? "✅ English" : "❌ Not English"}
    </div>
  );
}

Node.js API Middleware

import { detectNonEnglishText } from "english-validator";

app.post("/api/comment", (req, res) => {
  if (detectNonEnglishText(req.body.text)) {
    return res.status(400).json({ error: "Only English text is accepted" });
  }
  // proceed...
});

Custom Threshold

import { isEnglish } from "english-validator";
import type { DetectionOptions } from "english-validator";

// More lenient — allows mixed-language text
const lenient: DetectionOptions = { englishThreshold: 0.5 };
isEnglish("Hello mundo", lenient); // true (50%+ English)

// Stricter — requires almost all words to be English
const strict: DetectionOptions = { englishThreshold: 0.95 };
isEnglish("Hello mundo", strict);  // false

Use Cases

Chatbots & Virtual Assistants — validate that user messages are in English before routing to an English-only NLP pipeline or LLM
Content Moderation — reject or flag non-English submissions in forums, comment sections, or review platforms
Form Validation — ensure text fields (feedback, support tickets, descriptions) contain English input
Data Pipelines & ETL — filter English-only records from multilingual datasets during ingestion
CMS & Publishing — gate content uploads to English-only workflows
Search Indexing — tag or partition documents by language before indexing
Email / Notification Filtering — detect and route non-English inbound messages
API Gateways — enforce English-only payloads at the middleware layer

How It Works

Preprocessing — strips document IDs, geographical terms, special characters, user-supplied customPatterns, and excludeWords
Dictionary lookup — each word is checked against a 274k+ English word set
Non-English screening — detects European characters (ä, ö, ü, ñ, etc.), word suffixes (-keit, -ción, -zione), and function words (le, la, der, die, das)
Contraction resolution — splits contractions on apostrophes (e.g. don't → don) and rechecks the base word against the dictionary
English ratio — calculates the percentage of recognized English words
Trigram fallback — if the ratio is below the threshold, franc provides a statistical language classification as a tiebreaker
Result — returns a boolean

Supported Non-English Language Detection

The library detects non-English text across multiple language families using three complementary techniques: character analysis, suffix matching, and vocabulary/function-word detection.

| Language | Characters | Suffixes | Vocabulary / Function Words | |---|---|---|---| | German | ä ö ü ß | -keit, -schaft | und, oder, aber, wenn, weil, dass, nicht, kein · der, die, das, den, dem, ein, eine | | French | é è ê ë à â ç ù û ÿ æ œ | -eur | est, sont, être, avoir, faire, quand, où, pourquoi · le, la, les, du, des, dans, avec | | Spanish | ñ á í ó ú ¡ ¿ | -ción | que, como, porque, pero, cuando, donde, este, esta · el, los, las, del, al, con, sin, por | | Italian | ì ò | -zione | sono, essere, avere, fare, dire, come, quando, dove · il, lo, gli | | Dutch | — | -baar, -lijk | maar, want, omdat, hoewel, terwijl, dus · het, een, op, aan, voor, met, door | | Portuguese | ção | -agem | eu, tu, ele, ela, nós, isto, isso, aquilo · os, dos, das, nos, nas, um, uma | | Turkish | ş ğ ı | — | ben, sen, biz, siz, onlar, bana, sana, benim, senin | | Scandinavian | å ø æ | — | jeg, mig, min, mit, dig, din, han, hun, den, det, denne, dette | | Polish | ł ń ś ź ż ą ć ę | — | (character-level detection) |

Performance

| Aspect | Detail | |---|---| | Dictionary lookups | O(1) via Set (274k+ entries) | | Word cache | LRU with 5,000 entry limit | | Franc cache | LRU with 1,000 entry limit | | Regex patterns | Precompiled at module load — zero runtime compilation | | Geographical patterns | Built once from dictionary data at module initialisation |

Running Tests

npm test

Contributing

Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Commit your changes (git commit -am 'Add my feature')
Push to the branch (git push origin feat/my-feature)
Open a Pull Request

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

english-validator

Features

Installation

Quick Start

ESM (React, Next.js, Vite, modern Node.js)

CommonJS (Node.js)

TypeScript

API

isEnglish(text, options?)

detectNonEnglishText(text, options?)

matchesDocumentPattern(text)

clearLanguageDetectorCaches()

DetectionOptions

Quick Examples

Usage Examples

Custom Patterns — Strip Unwanted Tokens

Exclude Words — Remove Known Non-Dictionary Terms

Combining Options

React Component

Node.js API Middleware

Custom Threshold

Use Cases

How It Works

Supported Non-English Language Detection

Performance

Running Tests

Contributing

License

`isEnglish(text, options?)`

`detectNonEnglishText(text, options?)`

`matchesDocumentPattern(text)`

`clearLanguageDetectorCaches()`

`DetectionOptions`