synonym-optimizer

v5.3.0

Published

7 months ago

Finds the text which has the least number of repetitions

Downloads

648

0High
0Medium
0Low

ludan.stoeckle

repetitions synonyms Natural Language Generation NLG

synonym-optimizer

Gives a score to a string depending on the variety of the synonyms used.

For instance, let's compare The coffee is good. I love that coffee with The coffee is good. I love that bewerage. The second alternative is better because a synonym is used for coffee. This module will give a better score to the second alternative.

The lowest score the better.

Fully supported languages are French German English Italian and Spanish.

What it does / How it works:

single words are extracted thanks to a tokenizer wink-tokenizer
words are lowercased
stopwords are removed
- for fully supported languages, a default stopwords list is included, which you can customize
- for all other languages, no default list is included, but you can provide a custom stop words lists
for fully supported languages, words are stemmed using snowball-stemmer (for all other languages: no stemming)
when the same word appears multiples times, it raises the score depending on the distance of the two occurrences (if the occurrences are closes it raises the score a lot)

Designed primarly to test the output of a NLG (Natural Language Generation) system.

The stemmer is not perfect. For instance in Italian, cameriere and cameriera have the same stem (camerier), while camerieri and cameriera have a different one (camer and camerier).

Installation

npm install synonym-optimizer

Usage

var synOptimizer = require('synonym-optimizer');

alts = [
  'The coffee is good. I love that coffee.',
  'The coffee is good. I love that bewerage.'
]

/*
The coffee is good. I love that coffee.: 0.5
The coffee is good. I love that bewerage.: 0
*/
alts.forEach((alt) => {
  let score = synOptimizer.scoreAlternative('en_US', alt, null, null, null, null);
  console.log(`${alt}: ${score}`);
});

The main function is scoreAlternative. It takes a string and returns its score. Arguments are:

lang (string, mandatory): the language.
- fully supported languages are fr_FR, en_US, de_DE, it_IT and es_ES
- with any other language (for instance Dutch nl_NL) stemming is disabled and stopwords are not removed
alternative (string, mandatory): the string to score
stopWordsToAdd (string[], optional): list of stopwords to add to the standard stopwords list
stopWordsToRemove (string[], optional): list of stopwords to remove to the standard stopwords list
stopWordsOverride (string[], optional): replaces the standard stopword list
identicals (string[][], optional): list of words that should be considered as beeing identical, for instance [ ['phone', 'cellphone', 'smartphone'] ].

You can also use the getBest function. Most arguments are exactly the same, but instead of alternative, use alternatives (string[]). The output number will not be the score, but simply the index of the best alternative.

The tokenizer is wink-tokenizer, it does works with many languages (English, French, German, Hindi, Sanskrit, Marathi etc.) but not asian languages. Therefore the module will not work properly with Japanese, Chinese etc.

Adding new languages (for developpers / maintainers)

check for existence of stopwords module: stopwords-*
check for stemmer in snowball-stemmer collection (or plug another stemmer)
plug everything and add tests
find a proper tokenizer if wink-tokenizer does not work

Misc

The build writes stopwords a asciidoc in the rosaenlg-doc module.

Dependencies and licences

wink-tokenizer to tokenize sentences in multiple languages (MIT).
stopwords-en/de/fs/it/es for standard stopwords lists per language (MIT).
snowball-stemmer to stem words per language (MIT).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

synonym-optimizer

Installation

Usage

Adding new languages (for developpers / maintainers)

Misc

Dependencies and licences