hollerith

v12.0.1

Published

5 months ago

vectorial indices (VDXs) that allow for arbitrary many interstitial elements with a binary sortable representation

0High
0Medium
0Low

loveencounterflow

vectorial indices vdx

Hollerith

Table of Contents generated with DocToc

Hollerith

Hollerith

vectorial indices ('vindices' or VDXs) that allow for arbitrary many interstitial elements with a binary sortable representation

Compact Encoding

Motivation

| +A | +B | +C | | −A | -------: | ---: | ---: | --- | ---: | +001 | 1 | o1 | | +002 | 2 | o2 | | +003 | 3 | o3 | | +004 | 4 | o4 | | +005 | 5 | o5 | | +006 | 6 | o6 | | +007 | 7 | o7 | | +008 | 8 | o8 | | +009 | 9 | o9 | | +010 | 10 | p10 | | +011 | 11 | p11 | | +012 | 12 | p12 | | +013 | 13 | p13 | | +014 | 14 | p14 | | +015 | 15 | p15 | | ... | ... | ... | | ... | +985 | 985 | q985 | | +986 | 986 | q986 | | +987 | 987 | q987 | | +988 | 988 | q988 | | +989 | 989 | q989 | | +990 | 990 | q990 | | +991 | 991 | q991 | | +992 | 992 | q992 | | +993 | 993 | q993 | | +994 | 994 | q994 | | +995 | 995 | q995 | | +996 | 996 | q996 | | +997 | 997 | q997 | | +998 | 998 | q998 | | +999 | 999 | q999 | | −B | −C | | ---: | ---: | | 1000 + −999 = 1 | 001 | k001 | | 1000 + −998 = 2 | 002 | k002 | | 1000 + −997 = 3 | 003 | k003 | | 1000 + −996 = 4 | 004 | k004 | | 1000 + −995 = 5 | 005 | k005 | | 1000 + −994 = 6 | 006 | k006 | | 1000 + −993 = 7 | 007 | k007 | | 1000 + −992 = 8 | 008 | k008 | | 1000 + −991 = 9 | 009 | k009 | | 1000 + −990 = 10 | 010 | k010 | | 1000 + −989 = 11 | 011 | k011 | | 1000 + −988 = 12 | 012 | k012 | | 1000 + −987 = 13 | 013 | k013 | | 1000 + −986 = 14 | 014 | k014 | | 1000 + −985 = 15 | 015 | k015 | ... | ... | ... | | 1000 + −015 = 985 | 85 | l85 | | 1000 + −014 = 986 | 86 | l86 | | 1000 + −013 = 987 | 87 | l87 | | 1000 + −012 = 988 | 88 | l88 | | 1000 + −011 = 989 | 89 | l89 | | 1000 + −010 = 990 | 0 | m0 | | 1000 + −009 = 991 | 1 | m1 | | 1000 + −008 = 992 | 2 | m2 | | 1000 + −007 = 993 | 3 | m3 | | 1000 + −006 = 994 | 4 | m4 | | 1000 + −005 = 995 | 5 | m5 | | 1000 + −004 = 996 | 6 | m6 | | 1000 + −003 = 997 | 7 | m7 | | 1000 + −002 = 998 | 8 | m8 | | 1000 + −001 = 999 | 9 | m9 |

For positive numbers greater than zero as in +A:
- +B: Remove leading zeroes, if any.
- +C: Prepend a magnifier corresponding to the number of digits; longer numbers get a lexicographically bigger magnifier.
For negative numbers greater than zero:
- −A: Add _max_integer plus one; this 'lifts' −999 to +1
- −B: Do two things:
  - pad with leading zeroes up to the length of _max_integer and
  - remove leading nines.
- −C: ; Now absolutely small negative integers have few digits again while absolutely large ones have many digits again. Prepend a magnifier corresponding to the number of digits; numbers with more digits get a prefix that lexicographically precedes ones with fewer digits.

+000 0 n

Invariants

blank: ' ' # separator used in magnifiers and uniliterals
alphabet: '0123456789' # digits; length of alphabet is the base
magnifiers: 'ABC XYZ' #
uniliterals: 'EFGHIJKLM N OPQRSTUVW' # negative uniliterals, blank, zero uniliteral, blank, positive uniliterals
dimension: 3 # number of indices supported
zpuns: 'NOPQRSTUVW' # DERIVED # zero and positive uniliteral numbers
nuns: 'EFGHIJKLM' # DERIVED # negative uniliteral numbers
_max_integer: +999 # DERIVED
_min_integer: -999 # DERIVED
zpun_max: +9 # DERIVED # biggest number representable as uniliteral
nun_min: -9 # DERIVED # smallest number representable as uniliteral
base: 10 # DERIVED from length of alphabet
nmag_chrs_reversed: 'ABC' # DERIVED from first part of magnifiers
pmag_chrs: 'XYZ' # DERIVED from latter part of magnifiers
pmag: ' XYZ' # DERIVED from magnifiers # positive 'magnifier' for 1 to 3 positive digits
nmag: ' CBA' # DERIVED from magnifiers # negative 'magnifier' for 1 to 3 negative digits
leading_niners_re: /// ^ (?: 9 )*? (?= . $ ) ///gv # DERIVED # 'negative leader', discardable leading highest digits ('niners') of lifted negative numbers
_max_digits_per_idx: 3 # DERIVED # maximum number of digits of a single index in this system; must be at least 1, at most number of positive magnifiers
no codepoint is repeated
only codepoints between U+0000 and U+10ffff are supported;
- NOTE this needs re-writing string index access to array index access
all codepoints must be be given in monotonically ascending order, both individually and when concatenated as alphabet + nmag_chrs_reversed + nuns + zpuns (with blanks elided)

To Do

Digits Needed to Represent an 'All-9s Number' Less Than Max Safe Integer

| from base | to base | max_niners |
| -------:  | ------: | ---------: |
|        2  |         |       52   |
|        3  |         |       33   |
|        4  |         |       26   |
|        5  |         |       22   |
|        6  |         |       20   |
|        7  |         |       18   |
|        8  |         |       17   |
|        9  |         |       16   |
|       10  |     11  |       15   |
|       12  |     13  |       14   |
|       14  |     16  |       13   |
|       17  |     21  |       12   |
|       22  |     28  |       11   |
|       29  |     39  |       10   |
|       40  |     59  |        9   |
|       60  |     98  |        8   |
|       99  |    128  |        7   |

Ex. for digits 01234 i.e. base 5 you can write out at most 22 'Niners' (in the sense of 'highest single-digit number in that base'; that's a 4) before getting an integer that is greater than Number.MAX_SAFE_INTEGER; adding a 23rd digit 4 will cross that limit.

The largest base supported by JS (in parseInt( s, base ) and Number::toString( base )) is 36 which has z for its 'niner' (biggest digit). As per the above table, you can have integers of up to 10 zs, so zzzzzzzzzz (3,656,158,440,062,975) is the largest 'all-Niners' safe integer in that system. All bases from 29 up to and including 39 have that same 10-digit limit.

Bases 14, 15, and 16 all have a limit of 15 'Niners', so their biggest safe 'all-Niners' are ddddddddddddddd, eeeeeeeeeeeeeee, and fffffffffffffff, respectively.

These numbers are calculated with the following formulas / functions:

Logarithm to a given base of a given number n:

log_to_base = ( n, base ) -> ( Math.log n ) / ( Math.log base )

Number of digits required to write a given number n in a positional system with a given base:
```
get_required_digits = ( n, base ) -> Math.ceil log_to_base n, base
```
Maximum number of highest-value digits (i.e. base - 1) to write a number that does not exceed a given number n:
```
get_max_niner_digit_count  = ( n, base ) -> ( required_digits n, base ) - 1
```

Why not VarInts, LEB128?

Using https://github.com/joeltg/big-varint, we can easily see that, with Signed BigInt VarInts, negative numbers get interspersed as odd byte values in between the positive integers, which get represented as even byte values such that, for the numbers -3 to +3, you get the single-byte encodings [ 5, 3, 1, 0, 2, 4, 6, ], in that order. This entails that encoded values will be sorted according to their absolute value, not their signed value, with each negative number (like -3) directly preceding its positive counterpart (+3). This runs counter the properties we need for Vectorial Indexes.

Other than that, VarInt is also an explicitly binary encoding in the sense that it uses sequences of bytes with intentional disregard for how those bytes could be turned into manageable chunks of printable text. That's great for some situations, but if you want to have textual (a.k.a. 'human-readable') input and output, there's always some extra encoding-decoding step necessary in addition to the necessity to encode and decode between index arrays and their sortable version.

{ encode,
  decode,     } = ( require 'big-varint' ).signed
#...........................................................................................................
numbers = [ -3 .. +3 ]
varints = ( encode ( BigInt n ) for n in numbers )
#...........................................................................................................
numbers.sort ( a, b ) ->
  return +1 if a > b
  return -1 if a < b
  return  0
#...........................................................................................................
varints.sort ( a, b ) ->
  return +1 if a[ 0 ] > b[ 0 ]
  return -1 if a[ 0 ] < b[ 0 ]
  return  0

| numbers | varints |
| ------: | ------: |
|      -3 |      0n |
|      -2 |     -1n |
|      -1 |      1n |
|       0 |     -2n |
|      +1 |      2n |
|      +2 |     -3n |
|      +3 |      3n |

Using LEB128 is similarly unsuited; sorting the LEB128 encodings of the BigInt sequence [ -127n, -3, -2, -1, 0, 1, 2, 3, 127n ] yields [ 0n, 1n, 2n, 3n, -3n, -2n, -1n, -127n, 127n, ], which is not conducive to VDX sorting as described here.

Configuration

[—] validate that _max_digits_per_idx and _max_integer, when both set, don't contradict each other
[—] devise terminology to distinguish uniliteral for zero (constants_10mvp2: N) and digit for zero (constants_10mvp2: 0, intermittingly call _digit_zero)
- [—] digit nine ('novenary digit'): 'nova'
- [—] digit zero: 'nought'
- [—] uniliteral zero: 'cipher'
- [—] add settings to retrieve both
[—] _max_idx_width is derived as max_idx_digits + 1; dimension is the (maximum) count of indexes per vector(ial index); _sortkey_width is number of code units (i.e. string length, not number of characters) of all sortkeys produced by Hollerith::encode_vdx(), derived as _max_idx_width * dimension
[—] Hollerith::encode() resolves to encode_idx() or encode_vdx(), depending on whether input is number or list
- [—] Hollerith::encode_idx() encodes a single integer between inclusive cfg._min_integer and cfg._max_integer without zero-padding; the maximum length of the returned value is given by _max_idx_width (one uniliteral / magnifier plus max_idx_digits)
- [—] Hollerith::encode_vdx()
- [—] depending on use case, there's a case to be made for _idx_width both to exclude and to include the magnifier, so constants_10mvp2._idx_width is 3 (because its longest representable numbers are -999 and +999) and 4 (because its longest representable numbers are written out as A000 and Z999, respectively).

Other

[—] support codepoints beyond U+ffff; this needs re-writing string index access to frozen array index access
[—] in compile_sortkey_lexer, fix derivation of nuns_letters, puns_letters, nmag_letters, pmag_letters, zero_letters to work for non-BMP codepoints
[—] in Hollerith_typespace.magnifiers, nmag_bare_reversed and pmag_bare are derived as x.split @[CFG].blank where a RegEx compiled from @[CFG].blank (roughly as new RegExp "(?:#{RegExp.escape @[CFG].blank})+", 'v') should be used
[—] remove restriction that negative and positive magnifiers must have same length
[—] replace Hollerith_typespace.incremental_text(), Hollerith_typespace.decremental_text() with Hollerith_typespace.incremental(), Hollerith_typespace.decremental() that
- accept texts (to be turned into lists) and lists-of-characters
- skip over Hollerith_typespace[CFG].blanks
[—] avoid padding and reversing of character lists, rather, use appropriate indexes
[—] avoid having to declare magnifiers that are entirely covered by uniliterals
[—] re-implement unstable-anybase-brics.coffee#encode(), unstable-anybase-brics.coffee#decode() to avoid building intermediate values
[—] re-name Hollerith_typespace[CFG].alphabet to digits, which frees alphabet to signify the complete, ordered, blank-separated list of characters used to define a given Hollerith number format
[—] (-> Bric-A-Brac) implement a character analyzer that, given a string (or list, or set) of characters, returns a set of (RegEx-compatible) Unicode attributes that is common to all the characters in the input
[—] implement a way to declare alphabets with Unicode ranges. A range is declared as [a-z], with a lower and a higher single codepoint embedded in between [, -, ] (there's no clash with - being used literally in an alphabet because - (U+002d) comes before [ (U+005b), and the only character between [ and ] (U+005d) is \\ (U+005c), thus XYZ[\\-]]^_ can only be taken to mean XYZ\\]^_, XYZ[[-]]^_ can only be taken to mean XYZ[\\]^_, and so on).
[—] implement a way to declare RegExes...
- [—] ... that should apply to all declared Hollerith letters and cause errors where not matched
- [—] ... that cause non-conformant letters from the declared sets are silently skipped (such that a declaration of '[\x00-\x7f]' in conjunction with { skip_unless: /\p{L}/v, }) gives the set of US 7bit ASCII letters, how many there may ever be
[—] since integers beyond Number.MAX_SAFE_INTEGER can not be reliably used to count in steps of one, both the inclusion of _max_integer and _min_integer in a Hollerith numbering alphabet and the derivation of _max_integer and _min_integer from the base and the declared magnifiers must take this limit into account
[—] Allow composite VDXs, especially allow high-capacity encoder for first index, low-capacity encoder for trailing digits (as those are used for revisions)
- a typical setup would have a cardinals-only codec allowing big positive line numbers in the first position; since cardinals-only codecs start at zero, insertions before the first line are still possible, line number zero being used for anything going to the top of a document

Is Done

[+] why are there two zeroes? necessary?
[+] clarify, rename _max_digits_per_idx -> max_idx_digits
[+] regularize settings (CFG) keys:
- keys that start with a letter are user-settable
- keys that start with an _ underscore are treated as implementation details; unless mentioned in the documentation, setting them has undefined effects (normally they will just be overwritten)
- so are keys starting with an _ underscore, which are considered 'pro' or 'advanced' level ('know what you're doing')
- keys that start with a $ dollar sign are not to be set by users but are to be derived or otherwise set by the app; when they are encountered while compiling the CFG object, they cause an error

Don't

[🚫] implement a way to declare alphabets with Unicode ranges as in a..z or [a..z], e.g. { digits: 'A..DF..Z', } is equivalent to uppercase ASCII A thru Z with E excluded (this still leaves room for doubt how to interpret &...5: is it /[[&-.]5]/v i.e. "&'()*+,-.5" or is it /[&[.-5]]/v i.e. "&./012345"?)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Hollerith

Hollerith

Compact Encoding

Motivation

Invariants

See Also

To Do

Digits Needed to Represent an 'All-9s Number' Less Than Max Safe Integer

Why not VarInts, LEB128?

Configuration

Other

Is Done

Don't