twl-generator

v1.4.13

Published

11 days ago

Generate term-to-article lists from unfoldingWord en_tw archive for Bible books. Works in both Node.js (CLI) and React.js (browser) environments.

Downloads

465

TWL Generator

A Node.js library and CLI tool for generating Translation Word Links (TWL) TSV files from Door43 USFM data and Translation Words (TW) metadata. This tool intelligently matches biblical terms with their corresponding Translation Words articles using Strong's numbers, morphological analysis, and contextual matching.

Installation

Global CLI Installation

npm install -g twl-generator

Library Installation

npm install twl-generator

Usage

Command Line Interface

Generate TWL for a specific book:

twl-generator --book rut
# Creates: rut.twl.tsv and rut.no-match.twl.tsv

Generate TWL for all books:

twl-generator --all --out-dir ./output
# Creates TWL files for all 66 biblical books

Specify custom output location:

twl-generator --book mat --out matthew.twl.tsv

Enable advanced verb conjugation matching:

twl-generator --book jhn --use-compromise
# Uses compromise.js for better verb form detection

CLI Options

--book <code>: Book code (e.g., gen, exo, mat, mrk, jhn, etc.)
--all: Generate TWL files for all biblical books
--out <file>: Specify output file path
--out-dir <dir>: Output directory (for --all option)
--use-compromise: Enable advanced morphological analysis using compromise.js

Library Usage

Basic Usage

import { generateTwlByBook } from 'twl-generator';

// Generate TWL for Ruth
const result = await generateTwlByBook('rut');
console.log(result.matchedTsv);    // Main TWL output
console.log(result.noMatchTsv);    // Unmatched entries for analysis

With Advanced Options

import { generateTwlByBook } from 'twl-generator';

// Use advanced morphological analysis
const result = await generateTwlByBook('jhn', { 
  useCompromise: true  // Enable compromise.js for better verb matching
});

// Save to files
import fs from 'fs/promises';
await fs.writeFile('john.twl.tsv', result.matchedTsv);
await fs.writeFile('john.no-match.tsv', result.noMatchTsv);

Integration Example

import { generateTwlByBook } from 'twl-generator';

async function processBibleBook(bookCode) {
  try {
    const { matchedTsv, noMatchTsv } = await generateTwlByBook(bookCode);
    
    // Process the TSV data
    const lines = matchedTsv.split('\n');
    const header = lines[0];
    const rows = lines.slice(1).filter(Boolean);
    
    console.log(`Generated ${rows.length} TWL entries for ${bookCode.toUpperCase()}`);
    
    // Further processing...
    return { success: true, entries: rows.length };
  } catch (error) {
    console.error(`Failed to process ${bookCode}:`, error);
    return { success: false, error: error.message };
  }
}

How It Works

The TWL Generator uses a sophisticated multi-stage process to create Translation Word Links:

1. Data Sources

Original Language USFM: Hebrew (hbo_uhb) and Greek (el-x-koine_ugnt) texts from Door43
English Bible: unfoldingWord Literal Text (en_ult) for context matching
Translation Words: Local tw_strongs_list.json containing Strong's mappings and term definitions
Strong's Numbers: Links between original language words and semantic concepts

2. Processing Pipeline

Stage 1: Extract Strong's Data

Parses USFM \w tags to extract Strong's numbers from original language texts
Builds initial TSV with Reference, Strong's ID, and surface words
Handles multi-word phrases that share Strong's number sequences

Stage 2: Generate English Context

Uses tsv-quote-converters to find corresponding English text (GLQuote) in ULT
Adds GLQuote and GLOccurrence columns for contextual matching
Converts to OrigWords/Occurrence format for processing

Stage 3: Intelligent Article Selection

For each Strong's number and its English context, the system:

Prioritizes candidate articles based on:
- Articles whose slug appears in the GLQuote text
- Article type preference: kt/ (key terms) → names/ → other/
- Alphabetical sorting within each category
Performs 4-stage matching (best match wins):
- Stage 1: Case-sensitive word boundary matching
- Stage 2: Case-insensitive word boundary matching
- Stage 3: Case-sensitive substring matching
- Stage 4: Case-insensitive morphological variants
Morphological analysis includes:
- Pluralization (dog → dogs, man → men)
- Verb conjugation (-ing, -ed forms)
- Irregular verb forms (go → went, see → saw)
- Optional advanced analysis with compromise.js

Stage 4: Quality Assurance

Generates disambiguation info when multiple articles could match
Marks entries as "Variant of" when morphological variants are used
Creates separate files for matched and unmatched entries
Provides detailed statistics and sample unmatched entries

3. Output Format

The generated TSV contains these columns:

| Column | Description | |--------|-------------| | Reference | Chapter:verse (e.g., "1:1") | | ID | Random 4-character ID starting with letter | | Tags | "keyterm", "name", or empty based on article type | | OrigWords | The matched word(s) from the text | | Occurrence | Which occurrence of this word in the verse | | TWLink | Link to Translation Words article (rc://*/tw/dict/bible/...) | | GLQuote | English text context from ULT |
| GLOccurrence | Occurrence number in English context | | Strongs | Original Strong's number | | Variant of | Original term if morphological variant was used | | Disambiguation | List of other possible articles |

4. Matching Examples

Reference   OrigWords   GLQuote              TWLink                      Variant of
1:17        grace       grace and truth      rc://*/tw/dict/bible/kt/grace
1:17        gracious    gracious God         rc://*/tw/dict/bible/kt/grace   grace
2:3         men         wise men came        rc://*/tw/dict/bible/other/man
2:3         wisdom      with great wisdom    rc://*/tw/dict/bible/kt/wise    wise

Development

Prerequisites

Node.js 18+ (uses native fetch)
Git access to Door43 repositories

Setup

git clone https://github.com/unfoldingWord/node-twl-generator.git
cd node-twl-generator
npm install

Testing

# Test single book generation
npm test

# Test specific book
npm run cli -- --book rut

# Test with advanced morphology
npm run cli -- --book jhn --use-compromise

Local Development

# Run CLI locally
node src/cli.js --book gen --out test-output.tsv

# Test library integration
node -e "import('./src/index.js').then(m => m.generateTwlByBook('rut').then(console.log))"

Project Structure

src/
├── cli.js                    # Command line interface
├── index.js                  # Main library exports
├── common/
│   └── books.js             # Bible book metadata
└── utils/
    ├── twl-matcher.js       # Term matching algorithms (legacy)
    ├── zipProcessor.js      # TW archive processing (legacy)
    └── usfm-alignment-remover.js  # USFM parsing (legacy)
tw_strongs_list.json         # Translation Words database

Data Files

`tw_strongs_list.json`

This file contains the core mapping between Strong's numbers and Translation Words articles:

{
  "kt/god": {
    "article": {
      "terms": ["God", "god", "deity", "divine"]
    },
    "strongs": [
      ["H430"],     // Single Strong's number
      ["H410"],
      ["G2316", "G2318"]  // Multiple Strong's for compound concepts
    ]
  }
}

Contributing

We welcome contributions! Here's how you can help:

Reporting Issues

Missing matches: If legitimate biblical terms aren't being matched
False positives: If non-terms are being incorrectly matched
Performance issues: Slow processing or memory problems
Data quality: Incorrect Strong's mappings or term definitions

Enhancement Ideas

Better morphological analysis: Improve verb conjugation and irregular forms
Multi-language support: Extend beyond English GLQuotes
Contextual disambiguation: Use surrounding words for better article selection
Performance optimization: Faster processing for large corpora

Development Workflow

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes with tests
Run the test suite: npm test
Submit a pull request with detailed description

Testing Your Changes

# Test various scenarios
npm run cli -- --book psa --use-compromise  # Large book with advanced features
npm run cli -- --book phm                   # Short book for quick testing
npm run cli -- --book rev                   # Symbolic language testing

Browser Compatibility

While primarily designed for Node.js, core functionality works in modern browsers:

// React/Browser usage example
import { generateTwlByBook } from 'twl-generator';

const MyComponent = () => {
  const [tsvData, setTsvData] = useState(null);
  
  const generateTWL = async () => {
    try {
      const result = await generateTwlByBook('mat');
      setTsvData(result.matchedTsv);
    } catch (error) {
      console.error('TWL generation failed:', error);
    }
  };
  
  return (
    <div>
      <button onClick={generateTWL}>Generate TWL for Matthew</button>
      {tsvData && <pre>{tsvData}</pre>}
    </div>
  );
};

Performance

Typical processing times:

Short books (Philemon, 2-3 John): < 5 seconds
Medium books (Ruth, Ephesians): 5-15 seconds
Large books (Psalms, Matthew): 30-60 seconds
All books: 15-30 minutes depending on network speed

Memory usage scales with book size, typically 50-200MB peak.

License

MIT License - see LICENSE file for details.

Support

Issues: https://github.com/unfoldingWord/node-twl-generator/issues
Discussions: https://github.com/unfoldingWord/node-twl-generator/discussions
Documentation: https://github.com/unfoldingWord/node-twl-generator/wiki

Related Projects

tsv-quote-converters - GLQuote generation
compromise - Advanced morphological analysis
Door43 Content - Source biblical texts and resources