wordmap-usfm

v0.5.4

Published

15 days ago

A utility for injecting alignment data into USFM files

Downloads

0High
0Medium
0Low

wordMAP-usfm

This library provides utilities for injecting alignment data from WordMAP into USFM3 files.

Usage

From the command line:

npm i wordmap-usfm -g
wordmap-usfm --help

As a module:

npm i wordmap-usfm

import {align} from 'wordmap-usfm';
...
const alignedUSFM = align(alignmentData, usfmData);

Input

Alignment JSON Data Structure

The intent is to have a single file as input that allows full round trip conversion to USFM 3 without any loss. One of the resources is intended to be the source/primary text, the second resource is the one that is the target language of the USFM file. The target language is what would typically be shown in the USFM file without any alignment data.

In theory the data structure is extensible to allow for other metadata per word or token and potentially more than two languages although that may not lend itself well to USFM.

Top level attributes

The top level attributes of the data structure are conformsTo:String, metadata:Object, and segments:Array.

{
  "conformsTo": "alignment-0.1",
  "metadata": {...},
  "segments": [...]
}

conformsTo

The conformsTo attribute specifies which version of the spec was used for the generation of the alignment file. Over time we plan to make changes to the alignment specification and are using Semantic Versioning starting with version 0.1 for this release.

"conformsTo": "alignment-0.1",

metadata

The top level attribute metadata stores information about the content stored in the file.

"metadata": {
    "modified": "1524293704",
    "resources": {...}
  },

metadata.modified

The metadata.modified attribute is a unix timestamp of the last modification of the file. Any time an edit is made to the content of the file this timestamp should be updated so that users of the file can keep up with the latest version of the data.

"modified": "1524293704"

metadata.resources

The metadata.resources attribute is an object whose keys are the names of resources and the values are the metadata describing the resources. These keys/names are used later in the segments section of the file to specify which resource the respective content belongs to. The expected attributes for each resource are languageCode:String, name:String, and version:String. This information will be used in generating the headers of the USFM3 file output.

NOTES:

One text will be specified as the language of the USFM file and the other will be aligned to it as USFM3 milestones.
One of the resource's text of each segment will be used as the raw USFM string for the verse for USFM3 generation.
The tokens in the corresponding segment of the other resource will be aligned to the tokens found in the raw string of the first.

    "resources": {
      "r0": {
        "languageCode": "el-x-koine",
        "name": "UGNT",
        "version": "0.1"
      },
      "r1": {
        "languageCode": "en",
        "name": "ULT",
        "version": "9"
      }

segments

The segments attribute is an array of individual segments of the resources grouped together at the aligned segment level.

  "segments": [
    {
      "resources": {...},
      "alignments": [...]
    },
    {
      "resources": {...},
      "alignments": [...]
    },
    ...
  ]

segments[n].resources

The segments[n].resources attribute is an object of which the keys correspond to the keys in the metadata.resources. In the example below, r0 and r1 are the resource keys.

The values of the resource keys are an object whose attributes are text:String, tokens:Array, and metadata:Object.

The text attribute holds the raw string of the segment.
The tokens attribute is an array of individual tokens as strings.
- Later spec revisions will include tokens represented as data objects.
The metadata attribute is an object that holds data about the segment.
- Currently only requires a contextId attribute that identifies where the segment belongs, such as the verse identifier.
- Having metadata.contextId at each resource allows for alignments to exist between different versification systems.

      "resources": {
        "r0": {
          "text": "Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυὶδ υἱοῦ Ἀβραάμ.",
          "tokens": ["Βίβλος", "γενέσεως", "Ἰησοῦ", "Χριστοῦ", "υἱοῦ", "Δαυὶδ", "υἱοῦ", "Ἀβραάμ"],
          "metadata": {
            "contextId": "MAT001001"
          }
        },
        "r1": {
          "text": "The book of the genealogy of Jesus Christ, son of David, son of Abraham:",
          "tokens": ["The", "book", "of", "the", "genealogy", "of", "Jesus", "Christ", "son", "of", "David", "son", "of", "Abraham"],
          "metadata": {
            "contextId": "MAT001001"
          }
        },
      }

segments[n].alignments

The segments[n].alignments attribute is an array of individual alignments between tokens of the resources at the same level.

Each alignment is an object with the attributes of score:Float, verified:Boolean, and the [key]:Object that correspond with the resources.

The score attribute holds the confidence of this specific alignment generated by the alignment tool.
The verified attribute holds the boolean of whether or not the alignment was generated or approved by a human.
The remaining attributes hold an array of indexes that correspond to their string counterparts in the segments[n].resources[key].tokens in the respective key of the array.

The example below shows an alignment of the above tokens. Note that alignments to null can be represented as not being present at all. Optionally they can be represented as indexes on one side and an empty array on the other.

  "alignments": [
    {
      "score": 0.516905944153279,
      "r0": [0],
      "r1": [0, 1],
      "verified": false
    },
    {
      "score": 0.5363691430931895,
      "r0": [1],
      "r1": [3, 4],
      "verified": false
    },
    {
      "score": 0.5372550334365089,
      "r0": [2],
      "r1": [6],
      "verified": false
    },
    {
      "score": 0.4762634342491979,
      "r0": [3],
      "r1": [7],
      "verified": false
    },
    {
      "score": 0.46762244230161903,
      "r0": [4],
      "r1": [8, 9],
      "verified": false
    },
    {
      "score": 0.5253404588058129,
      "r0": [5],
      "r1": [10],
      "verified": false
    },
    {
      "r0": [6],
      "r1": [11, 12],
      "verified": true
    },
    {
      "r0": [7],
      "r1": [13],
      "verified": true
    }
  ]

The example below is fabricated to show many to many, many to one, one to many, one to one, none to many, one to none, one to many verified, and many to one verified in a respective order. The non-verified are machine aligned and verified are human aligned or confirmed.

  "alignments": [
    {
      "score": 0.516905944153279,
      "r0": [0, 1],
      "r1": [1, 4],
      "verified": false
    },
    {
      "score": 0.5363691430931895,
      "r0": [1, 2],
      "r1": [4],
      "verified": false
    },
    {
      "score": 0.5372550334365089,
      "r0": [2],
      "r1": [6, 7],
      "verified": false
    },
    {
      "score": 0.4762634342491979,
      "r0": [3],
      "r1": [7],
      "verified": false
    },
    {
      "score": 0.46762244230161903,
      "r0": [],
      "r1": [8, 9],
      "verified": false
    },
    {
      "score": 0.5253404588058129,
      "r0": [5],
      "r1": [],
      "verified": true
    },
    {
      "r0": [6],
      "r1": [10, 12],
      "verified": true
    },
    {
      "r0": [7, 9],
      "r1": [13],
      "verified": true
    }
  ]

Roadmap

Support extracting alignment data from USFM3. This will be useful when importing usfm into tC.
Support alignments that span verses
Support alignments that span chapters