language-detector-web

v2.0.0

Published

7 months ago

Efficient Language Detection for Multilingual Documents

0High
0Medium
0Low

maximesobrier

language detection nlp multilingual documents

Efficient Language Detection for Multilingual Documents

LanguageDetector is a TypeScript library designed to detect languages for web pages in 102 and languages. A research paper with more information will be published soon.

This library has no dependency. It's been tested on the server-side with nodejs. It should work in the browser.

Example

import LanguageDetector from 'language-detector-web';

const detector = new LanguageDetector();

const languages = detector.getSupportedLanguages();
console.log(languages); // ["af", "am", "ar", "as", "az", "be", "bg", "bn", "br", "bs", …]

const results = detector.getLanguages('This is an English text.'); // ['en']
console.log(`The main languages are ${results.join(' ')}.`); // The main languages are en.

Installation

npm install language-detector-web

Usage

Importing the Class

import LanguageDetector from 'language-detector-web';

Creating an Instance

const detector = new LanguageDetector();

Methods

LanguageDetector(mergeResults?, mergeDatasets?, skipSimilar?)

Creates an instance of LanguageDetector

mergeResults: Merge languages with different alphabets (simplified and traditional chinese, Bengali and Romanized Bengali, etc.). Example: { 'zh': ['zhs', 'zht'] , 'bn': ['bnr'], 'hi': ['hir'] }
mergeDatasets: Merge special datasets with a language. Example: {'code': 'en', 'misc': 'en'}
skipSimilar: Skip similar languages (for top result only). False by default

getSupportedLanguages()

Returns the list of supported languages as ISO 639-1 code: en (English), fr (French), nl (Dutch), etc.

getLanguagesWithScores(rawText)

Returns the score for each language supported:

{ 'en': 25.6, 'zh': -136.0', 'nl': 0, ...}

Scores can be 0 or negative. This library was designed and tested with the visible text of the web page, without any HTML content. This functions cleans up the text: emojis are removed, etc. Scores will likely increase with the length of the page.

getLanguages(rawText, minimumRatio?)

Returns the most likely language(s) used in the page from highest score to lowest score.

minimumRatio: minimum ratio to the highest score to be included, 0.0 to 1.0 - 0.8 by default

If the language with highest score has value of 100, only languages with a score of 80 (0.8 ratio) or more returned.

Configuration

The list of supported languages and their attributes (top letters and words) are contained in languages.json. This library is built with the top 10,000 words and letters for each language. Other datasets are available on GitHub: top 1k, 2k, 5k, 10k and 20k. See the research paper (coming soon) for the performances of each dataset.

Tests

To run tests, use this command:

npm test

Most test files were created with automated translation tools. Since the validity of the content has not been verified, failed tests (.txt.failed extension) have been disabled. To force these test files to be used, run this command:

FORCE_ALL_TESTS=true npm test

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Efficient Language Detection for Multilingual Documents

Example

Installation

Usage

Importing the Class

Creating an Instance

Methods

LanguageDetector(mergeResults?, mergeDatasets?, skipSimilar?)

getSupportedLanguages()

getLanguagesWithScores(rawText)

getLanguages(rawText, minimumRatio?)

Configuration

Tests

Contributing

License