shlesha

v0.5.4

Published

3 months ago

High-performance extensible transliteration library with hub-and-spoke architecture

Downloads

0High
0Medium
0Low

skmnktl

Shlesha - Schema-Driven Transliteration Library

A transliteration library for Sanskrit and Indic scripts using schema-driven architecture. Built with compile-time optimization and runtime schema loading.

Quick Start

Setup command:

./scripts/quick-start.sh

This sets up everything: Rust environment, Python bindings, WASM support, and runs all tests.

For detailed setup instructions, see DEVELOPER_SETUP.md.

Documentation: See DOCUMENTATION_INDEX.md for guides and references.

Architecture Features

Schema-generated converters with compile-time optimization
Zero runtime overhead from code generation
Token-based conversion system for memory efficiency

Schema-Based Architecture

Compile-Time Code Generation

Converters are generated at compile-time from declarative schemas:

# schemas/slp1.yaml - Generates optimized SLP1 converter
metadata:
  name: "slp1"
  script_type: "roman"
  description: "Sanskrit Library Phonetic Basic"

target: "iso15919"

mappings:
  vowels:
    "A": "ā"
    "I": "ī" 
    "U": "ū"
    # ... more mappings

# schemas/bengali.yaml - Generates optimized Bengali converter  
metadata:
  name: "bengali"
  script_type: "brahmic"
  description: "Bengali/Bangla script"

mappings:
  vowels:
    "অ": "अ"    # Bengali A → Devanagari A
    "আ": "आ"    # Bengali AA → Devanagari AA
    # ... more mappings

Build-Time Optimization

The build system automatically generates highly optimized converters:

# Build output showing schema processing
warning: Processing YAML schemas...
warning: Generating optimized converters with Handlebars templates...
warning: Created 18 schema-generated converters with O(1) lookups

Hub-and-Spoke Architecture

Multi-Hub Design

Devanagari Hub: Central format for Indic scripts (तमिल → देवनागरी → गुजराती)
ISO-15919 Hub: Central format for romanization schemes (ITRANS → ISO → IAST)
Cross-Hub Conversion: Seamless Indic ↔ Roman via both hubs
Direct Conversion: Bypass hubs when possible for maximum performance

Routing

The system determines the conversion path:

// Direct passthrough - zero conversion cost
transliterator.transliterate("धर्म", "devanagari", "devanagari")?; // instant

// Single hub - one conversion 
transliterator.transliterate("धर्म", "devanagari", "iso")?; // deva→iso

// Cross-hub - optimized path
transliterator.transliterate("dharma", "itrans", "bengali")?; // itrans→iso→deva→bengali

Supported Scripts

Indic Scripts (Schema-Generated)

Devanagari (devanagari, deva) - Sanskrit, Hindi, Marathi
Bengali (bengali, bn) - Bengali/Bangla script
Tamil (tamil, ta) - Tamil script
Telugu (telugu, te) - Telugu script
Gujarati (gujarati, gu) - Gujarati script
Kannada (kannada, kn) - Kannada script
Malayalam (malayalam, ml) - Malayalam script
Odia (odia, od) - Odia/Oriya script
Gurmukhi (gurmukhi, pa) - Punjabi script
Sinhala (sinhala, si) - Sinhala script
Sharada (sharada, shrd) - Historical script of Kashmir, crucial for Vedic manuscripts
Tibetan (tibetan, tibt, bo) - Important for Buddhist Vedic transmission
Thai (thai, th) - Adapted from Grantha for Buddhist Vedic texts

Romanization Schemes (Schema-Generated)

ISO-15919 (iso15919, iso) - International standard
ITRANS (itrans) - Indian languages TRANSliteration
SLP1 (slp1) - Sanskrit Library Phonetic Basic
Harvard-Kyoto (harvard_kyoto, hk) - ASCII-based scheme
Velthuis (velthuis) - TeX-compatible scheme
WX (wx) - ASCII-based notation

Hand-Coded Scripts

IAST (iast) - International Alphabet of Sanskrit Transliteration
Kolkata (kolkata) - Regional romanization scheme
Grantha (grantha) - Classical Sanskrit script

Usage Examples

Rust Library

use shlesha::Shlesha;

let transliterator = Shlesha::new();

// High-performance cross-script conversion
let result = transliterator.transliterate("धर्म", "devanagari", "gujarati")?;
println!("{}", result); // "ધર્મ"

// Roman to Indic conversion  
let result = transliterator.transliterate("dharmakṣetra", "slp1", "tamil")?;
println!("{}", result); // "தர்மக்ஷேத்ர"

// Schema-generated converters in action
let result = transliterator.transliterate("dharmakSetra", "slp1", "iast")?;
println!("{}", result); // "dharmakśetra"

Python Bindings (PyO3)

import shlesha

# Create transliterator with all schema-generated converters
transliterator = shlesha.Shlesha()

# Fast schema-based conversion
result = transliterator.transliterate("ধর্ম", "bengali", "telugu")
print(result)  # "ధర్మ"

# Performance with metadata tracking
result = transliterator.transliterate_with_metadata("धर्मkr", "devanagari", "iast")
print(f"Output: {result.output}")  # "dharmakr"
print(f"Unknown tokens: {len(result.metadata.unknown_tokens)}")

# Runtime extensibility
scripts = shlesha.get_supported_scripts()
print(f"Supports {len(scripts)} scripts: {scripts}")

Command Line Interface

# Schema-generated high-performance conversion
shlesha transliterate --from slp1 --to devanagari "dharmakSetra"
# Output: धर्मक्षेत्र

# Cross-script conversion via dual hubs  
shlesha transliterate --from itrans --to tamil "dharma"
# Output: தர்ம

# List all schema-generated + hand-coded scripts
shlesha scripts
# Output: bengali, devanagari, gujarati, harvard_kyoto, iast, iso15919, itrans, ...

WebAssembly (Browser/Node.js)

import init, { WasmShlesha } from './pkg/shlesha.js';

async function demo() {
    await init();
    const transliterator = new WasmShlesha();
    
    // Schema-generated converter performance in browser
    const result = transliterator.transliterate("કર્મ", "gujarati", "devanagari");
    console.log(result); // "कर्म"
    
    // Runtime script discovery
    const scripts = transliterator.listSupportedScripts();
    console.log(`${scripts.length} scripts available`);
}

Runtime Schema Loading

Shlesha supports runtime schema loading across all APIs to add custom scripts without recompilation.

Rust API

use shlesha::Shlesha;

let mut transliterator = Shlesha::new();

// Load custom schema from YAML content
let custom_schema = r#"
metadata:
  name: "my_custom_script"
  script_type: "roman"
  has_implicit_a: false
  description: "My custom transliteration scheme"

target: "iso15919"

mappings:
  vowels:
    "a": "a"
    "e": "ē"
  consonants:
    "k": "k"
    "t": "ṭ"
"#;

// Load the schema at runtime
transliterator.load_schema_from_string(custom_schema, "my_custom_script")?;

// Use immediately without recompilation
let result = transliterator.transliterate("kate", "my_custom_script", "devanagari")?;
println!("{}", result); // "काटे"

// Schema management
let info = transliterator.get_schema_info("my_custom_script").unwrap();
println!("Loaded {} with {} mappings", info.name, info.mapping_count);

Python API

import shlesha

transliterator = shlesha.Shlesha()

# Load schema from YAML string
yaml_content = """
metadata:
  name: "custom_script"
  script_type: "roman"
  has_implicit_a: false
  description: "Custom transliteration"

target: "iso15919"

mappings:
  vowels:
    "a": "a"
  consonants:
    "k": "k"
"""

# Runtime loading
transliterator.load_schema_from_string(yaml_content, "custom_script")

# Immediate usage
result = transliterator.transliterate("ka", "custom_script", "devanagari")
print(result)  # "क"

# Schema info
info = transliterator.get_schema_info("custom_script")
print(f"Script: {info['name']}, Mappings: {info['mapping_count']}")

# Schema management
transliterator.remove_schema("custom_script")
transliterator.clear_runtime_schemas()

JavaScript/WASM API

import init, { WasmShlesha } from './pkg/shlesha.js';

async function loadCustomScript() {
    await init();
    const transliterator = new WasmShlesha();
    
    // Define custom schema
    const yamlContent = `
metadata:
  name: "custom_script"
  script_type: "roman"
  has_implicit_a: false
  description: "Custom script"

target: "iso15919"

mappings:
  vowels:
    "a": "a"
  consonants:
    "k": "k"
`;
    
    // Load at runtime
    transliterator.loadSchemaFromString(yamlContent, "custom_script");
    
    // Use immediately
    const result = transliterator.transliterate("ka", "custom_script", "devanagari");
    console.log(result); // "क"
    
    // Get schema information
    const info = transliterator.getSchemaInfo("custom_script");
    console.log(`Name: ${info.name}, Mappings: ${info.mapping_count}`);
}

Key Runtime Features

✅ Load from YAML strings - No file system required
✅ Load from file paths - For development workflows
✅ Schema validation - Automatic error checking
✅ Hot reloading - Add/remove schemas dynamically
✅ Schema introspection - Get metadata about loaded schemas
✅ Memory management - Clear schemas when done
✅ Cross-platform - Identical API across Rust, Python, WASM

Use Cases

Development & Testing

// Test schema variations quickly
transliterator.load_schema_from_string(variant_a, "test_a")?;
transliterator.load_schema_from_string(variant_b, "test_b")?;
// Compare results immediately

Dynamic Applications

# User uploads custom transliteration scheme
user_schema = request.files['schema'].read().decode('utf-8')
transliterator.load_schema_from_string(user_schema, user_id)
# Use immediately in application

Configuration-Driven Systems

// Load schemas from configuration
config.schemas.forEach(schema => {
    transliterator.loadSchemaFromString(schema.content, schema.name);
});

Performance & Benchmarks

Performance Analysis

Shlesha uses a hub-and-spoke architecture with schema-generated converters, trading some performance for extensibility compared to direct conversion approaches.

Performance Characteristics

Competitive with other transliteration libraries
Schema-generated converters match hand-coded performance
Optimized for both short and long text processing

Architecture Trade-offs

| Aspect | Shlesha | Vidyut | |--------|---------|---------| | Performance | Hub-based | Direct conversion | | Extensibility | Runtime schemas | Compile-time only | | Script Support | 15+ (easily expandable) | Limited | | Architecture | Hub-and-spoke | Direct conversion | | Bindings | Rust/Python/WASM/CLI | Rust only |

Schema-Driven Development

Adding New Scripts

Adding support for new scripts with schemas:

# schemas/new_script.yaml
metadata:
  name: "NewScript"
  description: "Description of the script"
  unicode_block: "NewScript"
  has_implicit_vowels: true

mappings:
  vowels:
    - source: "𑀅"  # New script character
      target: "अ"   # Devanagari equivalent
    # ... add more mappings

# Rebuild to include new script
cargo build
# New script automatically available!

Template-Based Generation

Converters are generated using Handlebars templates for consistency:

{{!-- templates/indic_converter.hbs --}}
/// {{metadata.description}} converter generated from schema
pub struct {{pascal_case metadata.name}}Converter {
    {{snake_case metadata.name}}_to_deva_map: HashMap<char, char>,
    deva_to_{{snake_case metadata.name}}_map: HashMap<char, char>,
}

impl {{pascal_case metadata.name}}Converter {
    pub fn new() -> Self {
        // Generated O(1) lookup tables
        let mut {{snake_case metadata.name}}_to_deva = HashMap::new();
        {{#each character_mappings}}
        {{snake_case ../metadata.name}}_to_deva.insert('{{this.source}}', '{{this.target}}');
        {{/each}}
        // ... template continues
    }
}

Quality Assurance

Test Suite

127 tests covering all functionality
Schema-generated converter tests for all 14 generated converters
Performance regression tests ensuring schema = hand-coded speed
Cross-script conversion matrix testing all 210+ pairs
Unknown character handling

Build System Validation

# Test schema-generated converters maintain performance
cargo test --lib

# Verify all conversions work
cargo test comprehensive_bidirectional_tests

# Performance benchmarks
cargo run --example shlesha_vs_vidyut_benchmark

Build Configuration & Features

Schema Processing Features

# Default: Schema-generated + hand-coded converters
cargo build

# Development mode with schema recompilation
cargo build --features "schema-dev"

# Minimal build (hand-coded only)
cargo build --no-default-features --features "hand-coded-only"

# All features (Python + WASM + CLI)
cargo build --features "python,wasm,cli"

Runtime Extensibility

let mut transliterator = Shlesha::new();

// Load additional schemas at runtime (future feature)
transliterator.load_schema("path/to/new_script.yaml")?;

// Schema registry access
let scripts = transliterator.list_supported_scripts();
println!("Dynamically loaded: {:?}", scripts);

Advanced Features

Metadata Collection

// Track unknown characters and conversion details
let result = transliterator.transliterate_with_metadata("धर्मkr", "devanagari", "iast")?;

if let Some(metadata) = result.metadata {
    println!("Conversion: {} → {}", metadata.source_script, metadata.target_script);
    for unknown in metadata.unknown_tokens {
        println!("Unknown '{}' at position {}", unknown.token, unknown.position);
    }
}

Script Characteristics

// Schema-aware script properties
let registry = ScriptConverterRegistry::default();

// Indic scripts have implicit vowels
assert!(registry.script_has_implicit_vowels("bengali").unwrap());
assert!(registry.script_has_implicit_vowels("devanagari").unwrap());

// Roman schemes don't
assert!(!registry.script_has_implicit_vowels("itrans").unwrap());
assert!(!registry.script_has_implicit_vowels("slp1").unwrap());

Hub Processing Control

// Fine-grained control over conversion paths
let hub = Hub::new();

// Direct hub operations
let iso_text = hub.deva_to_iso("धर्म")?;  // Devanagari → ISO
let deva_text = hub.iso_to_deva("dharma")?;  // ISO → Devanagari

// Cross-hub conversion with metadata
let result = hub.deva_to_iso_with_metadata("धर्म")?;

Documentation

Architecture Guide - Deep dive into hub-and-spoke design
Schema Reference - Complete schema format documentation
Performance Guide - Optimization techniques and benchmarks
API Reference - Complete function and type reference
Developer Setup - Development environment setup
Release System - Automated release workflow overview
Deployment Guide - Complete deployment and environment setup
crates.io RC Support - Release candidate publishing guide
Security Setup - Token management and environment security
Contributing Guide - Guidelines for contributors

Quick Reference

# Generate documentation
cargo doc --open

# Run all examples
cargo run --example shlesha_vs_vidyut_benchmark
cargo run --example roman_allocation_analysis  

# Performance testing
cargo bench

Releases

Shlesha uses an automated release system for publishing to package registries:

Quick Release

# Guided release process
./scripts/release.sh

Package Installation

# Python (PyPI)
pip install shlesha

# WASM (npm)  
npm install shlesha-wasm

# Rust (crates.io)
cargo add shlesha

See DEPLOYMENT.md for complete release documentation.

Contributing

Contributions are welcome. The schema-driven architecture simplifies adding new scripts:

Add Schema: Create TOML/YAML mapping file
Test: Run test suite to verify
Benchmark: Ensure performance maintained
Submit: Open PR with schema and tests

See CONTRIBUTING.md for detailed guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Unicode Consortium for Indic script standards
ISO-15919 for romanization standardization
Sanskrit Library for SLP1 encoding schemes
Vidyut Project for performance benchmarking standards
Rust Community for excellent tools (PyO3, wasm-pack, handlebars)