@nerdbond/text

v0.0.2

Published

2 years ago

Downloads

0High
0Medium
0Low

lancejpollard

Overview

This library aims to be a way of converting text in all kinds of writing systems to a consistent and stable ASCII encoding, which can then further be processed into a more readable form. It should be able to take text mixing various scripts and isolate out the parts to romanize in as best a way as possible.

There are many languages which use the same script in slightly different ways. For example, Vietnamese uses the Latin alphabet with all kinds of specialized diacritics, same with Chinese Pinyin. And Arabic is used in various forms such as Standard Arabic, Persian, and Urdu, amongst others. So in these cases, given an arbitrary chunk of text which we don't know the encoding for, it can only do a rough approximation of a guess (like it's a Latin or Arabic script, not knowing if it's Vietnamese vs. Finnish vs. Icelandic, etc.).

When we know the encoding of the text, such as given some Icelandic text, we can write a custom handler for transliterating that as best as we can. So we have two entrypoints:

Unknown text
Known text

If we know the type of text and system it's written in, we can potentially add a parser for that. Otherwise it falls back to a more generic parser like the Latin parser.

Some languages have very good transliteration capabilities, such as the many Indic scripts used for just one or a few languages (like Tamil, or Thai, or Sinhala for example). These languages can be transliterated fairly well. But given Yoruba or Vietnamese, without knowing it's one of those langugaes, we won't be able to get super close in terms of pronunciation automatically, you need to tell it to use those specific parsers.

Installation

pnpm add @nerdbond/text
yarn add @nerdbond/text
npm i @nerdbond/text

Usage

You can use this library to process text in a few steps:

Convert written text in various languages to ASCII chat text (seed chat text).
Convert that ASCII chat text to diacritic-rich chat text (rose chat text).
Or convert the ASCII text to simplified chat text (bird chat text), which loses the pronunciation factors but makes it easy on the eyes.

import text from '@nerdbond/text'

text.tibetan.make('འཁངས') // => khaq

Make it seemingly human readable:

import text from '@nerdbond/text'
import chat from '@nerdbond/chat'

chat.read(text.tibetan.make('འཁངས')) // => khang

Find out what script some text is from:

import text from '@nerdbond/text'

text.find('कल्पना') // => { form: 'devanagari', rank: 1 }
text.rank('कल्पना') // gives back more than one language if apparent.

TODO

Take mixed script writings and transliterate them as best as possible.

import text from '@nerdbond/text'

text.make('कल्पनाའཁངས')

License

MIT

NerdBond

This is being developed by the folks at NerdBond, a California-based project for helping humanity master information and computation. NerdBond started off in the winter of 2008 as a spark of an idea, to forming a company 10 years later in the winter of 2018, to a seed of a project just beginning its development phases. It is entirely bootstrapped by working full time and running Etsy and Amazon shops. Also find us on Facebook, Twitter, and LinkedIn. Check out our other GitHub projects as well!

Published

Vulnerabilities

Links

Maintainers

Keywords