hollerith
v12.0.1
Published
vectorial indices (VDXs) that allow for arbitrary many interstitial elements with a binary sortable representation
Downloads
29
Maintainers
Readme
Hollerith
Table of Contents generated with DocToc
Hollerith
vectorial indices ('vindices' or VDXs) that allow for arbitrary many interstitial elements with a binary sortable representation
Compact Encoding
Motivation
| +A | +B | +C | | −A | −B | −C | | -------: | ---: | ---: | --- | ---: | ---: | ---: | | +001 | 1 | o1 | | 1000 + −999 = 1 | 001 | k001 | | +002 | 2 | o2 | | 1000 + −998 = 2 | 002 | k002 | | +003 | 3 | o3 | | 1000 + −997 = 3 | 003 | k003 | | +004 | 4 | o4 | | 1000 + −996 = 4 | 004 | k004 | | +005 | 5 | o5 | | 1000 + −995 = 5 | 005 | k005 | | +006 | 6 | o6 | | 1000 + −994 = 6 | 006 | k006 | | +007 | 7 | o7 | | 1000 + −993 = 7 | 007 | k007 | | +008 | 8 | o8 | | 1000 + −992 = 8 | 008 | k008 | | +009 | 9 | o9 | | 1000 + −991 = 9 | 009 | k009 | | +010 | 10 | p10 | | 1000 + −990 = 10 | 010 | k010 | | +011 | 11 | p11 | | 1000 + −989 = 11 | 011 | k011 | | +012 | 12 | p12 | | 1000 + −988 = 12 | 012 | k012 | | +013 | 13 | p13 | | 1000 + −987 = 13 | 013 | k013 | | +014 | 14 | p14 | | 1000 + −986 = 14 | 014 | k014 | | +015 | 15 | p15 | | 1000 + −985 = 15 | 015 | k015 | | ... | ... | ... | | ... ... | ... | ... | | +985 | 985 | q985 | | 1000 + −015 = 985 | 85 | l85 | | +986 | 986 | q986 | | 1000 + −014 = 986 | 86 | l86 | | +987 | 987 | q987 | | 1000 + −013 = 987 | 87 | l87 | | +988 | 988 | q988 | | 1000 + −012 = 988 | 88 | l88 | | +989 | 989 | q989 | | 1000 + −011 = 989 | 89 | l89 | | +990 | 990 | q990 | | 1000 + −010 = 990 | 0 | m0 | | +991 | 991 | q991 | | 1000 + −009 = 991 | 1 | m1 | | +992 | 992 | q992 | | 1000 + −008 = 992 | 2 | m2 | | +993 | 993 | q993 | | 1000 + −007 = 993 | 3 | m3 | | +994 | 994 | q994 | | 1000 + −006 = 994 | 4 | m4 | | +995 | 995 | q995 | | 1000 + −005 = 995 | 5 | m5 | | +996 | 996 | q996 | | 1000 + −004 = 996 | 6 | m6 | | +997 | 997 | q997 | | 1000 + −003 = 997 | 7 | m7 | | +998 | 998 | q998 | | 1000 + −002 = 998 | 8 | m8 | | +999 | 999 | q999 | | 1000 + −001 = 999 | 9 | m9 |
For positive numbers greater than zero as in +A:
- +B: Remove leading zeroes, if any.
- +C: Prepend a magnifier corresponding to the number of digits; longer numbers get a lexicographically bigger magnifier.
For negative numbers greater than zero:
- −A: Add
_max_integerplus one; this 'lifts' −999 to +1 - −B: Do two things:
- pad with leading zeroes up to the length of
_max_integerand - remove leading nines.
- pad with leading zeroes up to the length of
- −C: ; Now absolutely small negative integers have few digits again while absolutely large ones have many digits again. Prepend a magnifier corresponding to the number of digits; numbers with more digits get a prefix that lexicographically precedes ones with fewer digits.
- −A: Add
+000 0 n
Invariants
blank:' '# separator used inmagnifiersanduniliteralsalphabet:'0123456789'# digits; length ofalphabetis thebasemagnifiers:'ABC XYZ'#uniliterals:'EFGHIJKLM N OPQRSTUVW'# negative uniliterals, blank, zero uniliteral, blank, positive uniliteralsdimension:3# number of indices supportedzpuns:'NOPQRSTUVW'# DERIVED # zero and positive uniliteral numbersnuns:'EFGHIJKLM'# DERIVED # negative uniliteral numbers_max_integer:+999# DERIVED_min_integer:-999# DERIVEDzpun_max:+9# DERIVED # biggest number representable as uniliteralnun_min:-9# DERIVED # smallest number representable as uniliteralbase:10# DERIVED from length of alphabetnmag_chrs_reversed:'ABC'# DERIVED from first part ofmagnifierspmag_chrs:'XYZ'# DERIVED from latter part ofmagnifierspmag:' XYZ'# DERIVED from magnifiers # positive 'magnifier' for 1 to 3 positive digitsnmag:' CBA'# DERIVED from magnifiers # negative 'magnifier' for 1 to 3 negative digitsleading_niners_re:/// ^ (?: 9 )*? (?= . $ ) ///gv# DERIVED # 'negative leader', discardable leading highest digits ('niners') of lifted negative numbers_max_digits_per_idx:3# DERIVED # maximum number of digits of a single index in this system; must be at least 1, at most number of positive magnifiersno codepoint is repeated
only codepoints between U+0000 and U+10ffff are supported;
- NOTE this needs re-writing string index access to array index access
all codepoints must be be given in monotonically ascending order, both individually and when concatenated as
alphabet + nmag_chrs_reversed + nuns + zpuns(withblanks elided)
See Also
inspired by & thx to https://stately.cloud/blog/encoding-sortable-binary-database-keys/
To Do
Digits Needed to Represent an 'All-9s Number' Less Than Max Safe Integer
| from base | to base | max_niners |
| -------: | ------: | ---------: |
| 2 | | 52 |
| 3 | | 33 |
| 4 | | 26 |
| 5 | | 22 |
| 6 | | 20 |
| 7 | | 18 |
| 8 | | 17 |
| 9 | | 16 |
| 10 | 11 | 15 |
| 12 | 13 | 14 |
| 14 | 16 | 13 |
| 17 | 21 | 12 |
| 22 | 28 | 11 |
| 29 | 39 | 10 |
| 40 | 59 | 9 |
| 60 | 98 | 8 |
| 99 | 128 | 7 |Ex. for digits 01234 i.e. base 5 you can write out at most 22 'Niners' (in the sense of 'highest
single-digit number in that base'; that's a 4) before getting an integer that is greater than
Number.MAX_SAFE_INTEGER; adding a 23rd digit 4 will cross that limit.
The largest base supported by JS (in parseInt( s, base ) and Number::toString( base )) is 36 which has
z for its 'niner' (biggest digit). As per the above table, you can have integers of up to 10 zs, so
zzzzzzzzzz (3,656,158,440,062,975) is the largest 'all-Niners' safe integer in that system. All bases from
29 up to and including 39 have that same 10-digit limit.
Bases 14, 15, and 16 all have a limit of 15 'Niners', so their biggest safe 'all-Niners' are
ddddddddddddddd, eeeeeeeeeeeeeee, and fffffffffffffff, respectively.
These numbers are calculated with the following formulas / functions:
Logarithm to a given
baseof a given numbern:log_to_base = ( n, base ) -> ( Math.log n ) / ( Math.log base )Number of digits required to write a given number
nin a positional system with a givenbase:get_required_digits = ( n, base ) -> Math.ceil log_to_base n, baseMaximum number of highest-value digits (i.e.
base - 1) to write a number that does not exceed a given numbern:get_max_niner_digit_count = ( n, base ) -> ( required_digits n, base ) - 1
Why not VarInts, LEB128?
Using https://github.com/joeltg/big-varint, we can easily see that, with Signed BigInt VarInts, negative
numbers get interspersed as odd byte values in between the positive integers, which get represented as even
byte values such that, for the numbers -3 to +3, you get the single-byte encodings [ 5, 3, 1, 0, 2, 4,
6, ], in that order. This entails that encoded values will be sorted according to their absolute value, not
their signed value, with each negative number (like -3) directly preceding its positive counterpart
(+3). This runs counter the properties we need for Vectorial Indexes.
Other than that, VarInt is also an explicitly binary encoding in the sense that it uses sequences of bytes with intentional disregard for how those bytes could be turned into manageable chunks of printable text. That's great for some situations, but if you want to have textual (a.k.a. 'human-readable') input and output, there's always some extra encoding-decoding step necessary in addition to the necessity to encode and decode between index arrays and their sortable version.
{ encode,
decode, } = ( require 'big-varint' ).signed
#...........................................................................................................
numbers = [ -3 .. +3 ]
varints = ( encode ( BigInt n ) for n in numbers )
#...........................................................................................................
numbers.sort ( a, b ) ->
return +1 if a > b
return -1 if a < b
return 0
#...........................................................................................................
varints.sort ( a, b ) ->
return +1 if a[ 0 ] > b[ 0 ]
return -1 if a[ 0 ] < b[ 0 ]
return 0| numbers | varints |
| ------: | ------: |
| -3 | 0n |
| -2 | -1n |
| -1 | 1n |
| 0 | -2n |
| +1 | 2n |
| +2 | -3n |
| +3 | 3n |Using LEB128 is similarly unsuited; sorting the LEB128 encodings of the BigInt sequence [ -127n, -3,
-2, -1, 0, 1, 2, 3, 127n ] yields [ 0n, 1n, 2n, 3n, -3n, -2n, -1n, -127n, 127n, ], which is not conducive
to VDX sorting as described here.
Configuration
[—]validate that_max_digits_per_idxand_max_integer, when both set, don't contradict each other[—]devise terminology to distinguish uniliteral for zero (constants_10mvp2:N) and digit for zero (constants_10mvp2:0, intermittingly call_digit_zero)[—]digit nine ('novenary digit'): 'nova'[—]digit zero: 'nought'[—]uniliteral zero: 'cipher'[—]add settings to retrieve both
[—]_max_idx_widthis derived asmax_idx_digits + 1;dimensionis the (maximum) count of indexes per vector(ial index);_sortkey_widthis number of code units (i.e. string length, not number of characters) of all sortkeys produced byHollerith::encode_vdx(), derived as_max_idx_width * dimension[—]Hollerith::encode()resolves toencode_idx()orencode_vdx(), depending on whether input is number or list[—]Hollerith::encode_idx()encodes a single integer between inclusivecfg._min_integerandcfg._max_integerwithout zero-padding; the maximum length of the returned value is given by_max_idx_width(one uniliteral / magnifier plusmax_idx_digits)[—]Hollerith::encode_vdx()[—]depending on use case, there's a case to be made for_idx_widthboth to exclude and to include the magnifier, soconstants_10mvp2._idx_widthis3(because its longest representable numbers are-999and+999) and4(because its longest representable numbers are written out asA000andZ999, respectively).
Other
[—]support codepoints beyondU+ffff; this needs re-writing string index access to frozen array index access[—]incompile_sortkey_lexer, fix derivation ofnuns_letters,puns_letters,nmag_letters,pmag_letters,zero_lettersto work for non-BMP codepoints[—]inHollerith_typespace.magnifiers,nmag_bare_reversedandpmag_bareare derived asx.split @[CFG].blankwhere a RegEx compiled from@[CFG].blank(roughly asnew RegExp "(?:#{RegExp.escape @[CFG].blank})+", 'v') should be used[—]remove restriction that negative and positive magnifiers must have same length[—]replaceHollerith_typespace.incremental_text(),Hollerith_typespace.decremental_text()withHollerith_typespace.incremental(),Hollerith_typespace.decremental()that- accept texts (to be turned into lists) and lists-of-characters
- skip over
Hollerith_typespace[CFG].blanks
[—]avoid padding and reversing of character lists, rather, use appropriate indexes[—]avoid having to declare magnifiers that are entirely covered by uniliterals[—]re-implementunstable-anybase-brics.coffee#encode(),unstable-anybase-brics.coffee#decode()to avoid building intermediate values[—]re-nameHollerith_typespace[CFG].alphabettodigits, which freesalphabetto signify the complete, ordered,blank-separated list of characters used to define a given Hollerith number format[—](-> Bric-A-Brac) implement a character analyzer that, given a string (or list, or set) of characters, returns a set of (RegEx-compatible) Unicode attributes that is common to all the characters in the input[—]implement a way to declare alphabets with Unicode ranges. A range is declared as[a-z], with a lower and a higher single codepoint embedded in between[,-,](there's no clash with-being used literally in an alphabet because-(U+002d) comes before[(U+005b), and the only character between[and](U+005d) is\\(U+005c), thusXYZ[\\-]]^_can only be taken to meanXYZ\\]^_,XYZ[[-]]^_can only be taken to meanXYZ[\\]^_, and so on).[—]implement a way to declare RegExes...[—]... that should apply to all declared Hollerith letters and cause errors where not matched[—]... that cause non-conformant letters from the declared sets are silently skipped (such that a declaration of'[\x00-\x7f]'in conjunction with{ skip_unless: /\p{L}/v, }) gives the set of US 7bit ASCII letters, how many there may ever be
[—]since integers beyondNumber.MAX_SAFE_INTEGERcan not be reliably used to count in steps of one, both the inclusion of_max_integerand_min_integerin a Hollerith numbering alphabet and the derivation of_max_integerand_min_integerfrom the base and the declared magnifiers must take this limit into account[—]Allow composite VDXs, especially allow high-capacity encoder for first index, low-capacity encoder for trailing digits (as those are used for revisions)- a typical setup would have a cardinals-only codec allowing big positive line numbers in the first position; since cardinals-only codecs start at zero, insertions before the first line are still possible, line number zero being used for anything going to the top of a document
Is Done
[+]why are there two zeroes? necessary?[+]clarify, rename_max_digits_per_idx->max_idx_digits[+]regularize settings (CFG) keys:- keys that start with a letter are user-settable
- keys that start with an
_underscore are treated as implementation details; unless mentioned in the documentation, setting them has undefined effects (normally they will just be overwritten) - so are keys starting with an
_underscore, which are considered 'pro' or 'advanced' level ('know what you're doing') - keys that start with a
$dollar sign are not to be set by users but are to be derived or otherwise set by the app; when they are encountered while compiling the CFG object, they cause an error
Don't
[🚫]implement a way to declare alphabets with Unicode ranges as ina..zor[a..z], e.g.{ digits: 'A..DF..Z', }is equivalent to uppercase ASCIIAthruZwithEexcluded (this still leaves room for doubt how to interpret&...5: is it/[[&-.]5]/vi.e."&'()*+,-.5"or is it/[&[.-5]]/vi.e."&./012345"?)
