uw-wordmapbooster
v1.0.3
Published
Adds an AI boost to wordmap. Wordmap is a tool for helping word alignments between a source and target language.
Readme
wordmapbooster
WordMapBooster is an extension on WordMap. WordMapBooster adds a Javascript tree based model for establishing alignment statistic relevance.
boostwordmapwithfs is a node project which allows WordMapBooster to be run against usfm code in a node environment without being encumbered by the memory restrictions of a browser.
This tool is demoed with Alignment Transferer. Netlify demo.
Installation
npm
yarn add wordmapboosteryarn
yarn add wordmapboosterData Format
Interactions with the WordMapBooster api use the data classes as established by WordMap.
Terms
Data is added as corpus data and as alignment data.
Corpus datais bulk data which includes aligned and unaligned data in verse pairs between source and target languages. This gives statistical context for the Alignment data to compute against.Alignment datais n-gram to n-gram pairs between source and target languages for established alignments. This provides the data for the model to train on.
Classes
The following class references are needed when interacting with the WordMapBooster api.
Corpus datais composed as array of source Tokens and target Tokens per verse.Alignment datais an array of Alignment objects.Alignment object: A Alignment object is constructed from a source and target Ngram. The point of an Ngram is that a word in the source language can actually be related to multiple words in the target language or vis versa.Ngram: A Ngram is constructed from an array of one or more Tokens.Token: A Token is constructed from a dict with the most important information beingtextThe word that the token represents.morphThe morphology code for the specific token.
The rest of the the information is added by the library when the Token is used.
API
Create an instance of one of the following models:
JLBoostWordMap: JLBoostWordMap does not take morphology into consideration when making wordmap suggestions.MorphJLBoostWordMap: MorphJLBoostWordMap does take morphology into consideration when making wordmap suggestions.
The object takes the following arguments:
train_steps: This is the number of iterations that the model trains before being able to make predictions. The default is1000.learning_rate: This how fast the model converges on the information. The default is0.7.tree_depth: This is the number of splits in each tree which is generated each successful train_steps step. The default is5.
The WordMap arguments are also supported which these were used in testing.
targetNgramLength: This limits the permutation search space which the WordMap Engine grinds through. The value used in testing is5.warnings: The value used in testing isfalse.forceOccurrenceOrder: Forces suggestions to preserve the order of word occurrences. The value used in testing isfalse.
targetNgramLength: 5, warnings: false, forceOccurrenceOrder:false
Pass in the corpus data. This is done with
appendKeyedCorpusTokens. This is one function call for adding all the source and target text. The first argument is thecorpusSourceTextwhich is a dictionary from a string verse identifier to the array of Tokens. The format of the string identifier can be whatever as long as it corresponds with the second argument. In the case of the NT this is Greek. In the case of the OT, this is majoritively Hebrew.corpusTargetTextwhich is a dictionary keyed by the same string verse identifiers to the array of Target text. In the case of Greek to English, this is the English.
The corpus data is passed in for both the already aligned and not yet aligned data.
Add in the alignment data. This is done with
add_alignments_2oradd_alignments_4.add_alignments_4is more computationally intensive but does better with less data such as only Titus, but the benefits wash out with a larger book such as with Matthew. The arguments aresourceText: This is the same data which is passed as the first argument toappendKeyedCorpusTokens.targetText: Same as second argument toappendKeyedCorpusTokens.alignments: This is a dictionary keyed from the same string verse identifiers to arrays of Alignments. These are the manually completed alignments which are then used as training for the model to make predictions with.
This function is async and should be called with async or deal handle the returned promise. The promise will resolve once the model has completed training and predictions can be performed.
Optionally the model can be serialized. This makes it possible to train the model in a browser worker thread which is what is done in the Alignment Transferer demo of this module. This also makes it possible to save and restore the model from disk or to train the model in the cloud and then ship the model to the browser just for inference.
save: This converts the trained model into a JSON-able set of arrays and dictionaries.AbstractWordMapWrapper.load: This load converts set of arrays and dictionaries and converts it back into the model. The saved data self indexes which model type was saved, so the right model is reproduced.AbstractWordMapWrapper.async_load: This is the same as load, but is an async version which utilizes setTimeout to yield the cpu to keep from freezing everything during a long load.
Predictions are run on the model to get suggestions. This is done by running the
predictmethod. The first argument is an array of Tokens for thesourceSentence. The second argument is an array of Tokens for thetargetSentence. Technically you can pass in strings instead but then you can't pass in morphological information on the Tokens.maxSuggestions: The result is an array of suggested alignments, this gives the number that you want.manuallyAligned: This is an array of Alignment's which should be respected when constructing the suggestions. This way if the aligner is half done aligning the verse, the tool will not make suggestions which are incompatible with the alignments already completed.
Once The Suggestions have been returned, you can access the Alignments by calling
getPredictionsto get the Predictions and then callingalignmentto get the Alignment out of the Prediction. The confidence of each individual Alignment is accessed bygetScore("confidence")on the prediction. Please follow the object documentation or reference the following examples to see how this is further broken down into ngrams and tokens.
Examples
Examples of using this module can be obtained by referencing three different places.
The test code. This module is tested with jest tests. Please see the JLBoostWordMap test code for examples of constructing the object, feeding it data and training it.Alignment Transferer. The react demo application developed along with this code has a background thread which trains this model. You can see can see the background thread's implementationboostwordmapwithfs. There is also a node project which is used to test this module. This project is not limited by the memory constraints found in a WebBrowser. This project does not have an interface built for it but expects the source to be modified for individual tests. run_alignment_tests.ts is the main entry point of this project.
Publishing
npm
npm i --legacy-peer-deps && npm run build && npm publish