interlex
v0.13.1
Published
<!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*
Readme
InterLex
Table of Contents generated with DocToc
InterLex
Grammar API
Grammar::scan: ( source ) ->: Iterate over all lexemes that result from matching tokens against thesourcetext.Grammar::scan_to_list: ( source ) ->: Same asGrammar::scan()but returns a list of lexemes instead of a generator.Grammar::scan_first: ( source ) ->: Returns the first user-level lexeme yielded byGrammar::scan source. Handy for testing token matchers.Note Always runs the entire scan to ensure that any state is what it would be with
scan()andscan_to_list()but returns one first user-level lexeme: ###
Token Declarations
nameandfitarguments may be first two positional arguments toLevel::new_token();nameorname, fitmay be followed by an objectcfgwith further settings; positionalname,fitwill silently shadow settings incfg(cfgobject will not be changed in any way)name(null): Must be a string giving the name of the token. Names must be unique to the level.level(set byLevel.new_token()):grammar(set byLevel.new_token()):fit(null): What to match at the current position of the source; see Token Matchers.jump(null): Which level to jump to when token matches; see Token Jumps.emit(true): When set tofalse, lexemes produced by the token will not be emitted; this os intended to be used mainly with zero-length-matching jump tokens that don't consume any text and are only there to start or finish a given level, something that will also be indicated by jump signals. Tokens that are declared with{ emit: false, }are called 'ghost' tokens.merge(false):- When set to
true, will merge contiguous lexemes resulting from this token into a single one. Simplest example: a token declared as{ name: 'number', fit: /[0-9]/, }will match single Arabic digits, soGrammar::scan_first '953'will return a lexeme{ name: 'number', hit: '9', }. With{ ..., merge: true, }, though, the sameGrammar::scan_first '953'will now return a lexeme{ name: 'number', hit: '953', }because the contiguous stretch of digits will be merged into a single lexeme. - Other than
trueorfalse, one can also explicitly declare the data merge strategy to be used:- When set to
list, will turn all values in thedataproperty of each merged lexeme into a list of values; this is also the default strategy used formerge: true. - When set to
assign, will useObject.assign()to set key / value pairs from all the merged lexemes in the order of their appearance; this should be faster thanmerge: 'list'but also means that only the last occurrences of each named data item will be preserved in the merged lexeme. - When
mergeis set to a function, that function will be called with an object{ merged, lexemes, }wheremergeis a clone (independent copy) of the first matched lexeme andlexemesis a list of all matched lexemes; inside the function, one can manipulatemerged.datato one's liking. Observe thatmerged.hitwill already be set to the concatenation of all lexemehits andmerged.stophas been set to the last match'sstopproperty.
- When set to
- When set to
Token Matchers
Token matchers—defined using the fit key in token declarations—should preferrably constructed using the
InterLex rx'' regex tag function, but one can also use plain JS RegEx literals. All regular expressions
will be silently 'upgraded' by removing illegal and adding necessary
flags, so for example declaring { fit: /A[a-z]+/, }
will actually result in the regular expression /A[a-z]+/dvy to be used.
Note that in order to use some features of 3rd gen Unicode support afforded by the v flag one will need to
explicitly set that flag on regex literals to make the JS parser accept the JS source file, but for many use
cases just using a 'naked' regex literal should be fine—although those who do so will miss out on the many
finer points of using slevithan/regex.
Zero-Length Matches
- also called 'empty matches'
- are only rejected when they actually happen to occur while scanning a text because it is not feasable (at this point at least) to recognize all possible empty matches by static analysis of regular expressions
- empty matches are allowed only when the respective token has a declared jump; in that scenario, empty
matches are typically the result of a
fitwith a regex lookahead; for example one could declare a token{ name: 'before_digits', fit: /(?=[0-9])/, jump: 'number', }that instructs the lexer to jump to levelnumberright before(?=a digit[0-9]; in that other levelnumber, another token declared as{ name: 'digits', fit: /[0-9]+/i, }can then pick up right on the same spot. Without the empty match and lookaheads, one would inevitably wind up with the stretch of digits split up between two lexemes that might even belong to two different levels.
Using the interlex.rx'' Regex Tag Function
The InterLex rx'' regex tag function is based on the regex'' tag function from
slevithan/regex. It is intended to be used mainly to define a
fit for InterLex Tokens, allows to convert a string into a regular expression object with
rx"[abc]+"(CoffeeScript) orrx`[abc]+`(JavaScript).
Either of the above results in the regular expression /[abc]+/dvy with the d, v and y flags
implicitly set, for which see below.
In contradistinction to the original regex'' tag function provided by slevithan/regex, the rx'' tag
function offers the capability to set additional flags by using JS dotted accessor syntax, which is a fancy
way to say that when e.g. you have fit: rx"[abc]" to match any of the letters 'a', 'b', 'c' in
a given source, then in order to set the case-insensitivy flag i you can write rx.i"[abc]" to match
any of 'a', 'b', 'c', 'A', 'B', or 'C'.
Producing a Regex Tag Function with new_regex_tag()
Using rx2 = new_regex_tag flags, it's possible to produce a new regex tag function with a customized set
of JS Regular Expression
Flags
The default regex tag function exported by interlex, rx'', inherits extended support for flags and
features some additional implicitly-set
flags:
- The
xflag is always implicitly set; it (quote) "makes whitespace insignificant and adds support for line comments (starting with#), allowing you to freely format your regexes for readability"; this flag is emulated byslevithan/regex. - The
vflag is always implicitly set; it enables advanced and improved Unicode support and overcomes some difficulties with the olderuflag; - The
uflag is always implicitly excluded by the implicit, always-on presence ofv. - The
dflag is always implicitly set; it enables indices for matched groups. - The
yflag is always implicitly set; it enables 'sticky' behavior essential for a lexer. - this leaves the user to set one or more of the following flags:
gims
- Repeated flags are ignored, so
rx2 = new_regex_tag 'dvy'just gives a new tag function whose behavior is identical to the default tag function,rx'', and, say,new_regex_tag 'iiiii'is no different fromnew_regex_tag 'i'. There's no way to unset a flag.
Token Jumps
- Jumps are declared with the property name
jumpand a so-called 'jump spec' which is a string that contains (1) either (a) another level's name, or (b) a symbolic string..to indicate 'jump back from current level, return to the previous level'; and (2) an optional 'carry mark'. - Jumps can be either 'staying' or 'carrying' depending on whether the lexeme produced from the token will
'stay' in the token's level or be 'carried' along to the new level (the jump's target). Non-carrying jumps
have no explicit marker; carrying jumps are indicated by an
!exclamation mark behind the jump target, so for example a token declaration in levelgndthat contains{ jump: 'otherlevel!', }indicates that the resulting lexeme'slevelwill beotherlevel, notgnd."
To Be Written
Overview
Grammarhas one or moreLevels.Levelhas one or moreTokens.Tokendefinesfit, can jump into a level or back.Lexemes are produced by aTokenwhenToken::fitmatches the source text at the current position.Tokens havefits,Lexemes havehits;fits producehits.fits are also called 'matchers',hits are also called 'matches'.
Five Scanner Constraints
All scans must be exhaustive, compact, contiguous, bijective and monotonous, meaning:
Exhaustive (a.k.a. "no leftovers"): each position in a source from the first to the last codepoint must be covered by some lexeme; more specifically—because dense coverage already falls out from the requirements of compactness and contiguity—the first (non-signal) lexeme must cover the source position with index
0, and the last (non-signal) lexeme must cover the source position with indexsource.length - 1(a.k.a.source.at -1).Compact (a.k.a. "no gaps"): excluding the first and the last lexeme in a given scan, for each lexeme
kₙthere must be a directly preceding lexemekₙ₋₁that covers the codepoint atkₙ.start - 1and another directly following lexemekₙ₊₁that covers the codepoint atl.stop.Contiguous (a.k.a. "no scatter"): each lexeme represents a single part (called the
Lexeme::hit) of the source; it's not possible for a single lexeme to cover one part of the source here and another, separate part of the source over there.Bijective (a.k.a. "one on one"): after an errorless scan, each codepoint of the source will be associated with exactly one lexeme; no codepoint can belong to two or more lexemes at any time (although when using zero-length matchers, the spaces between codeints that come right before and after a given lexeme may belong to other lexemes).
Monotonous (a.k.a. "no going back"):
- the
startpositions of all lexemes will be emitted in a monotonously increasing order, meaning that for all consecutive lexemeskₙ,kₙ₊₁it must hold thatkₙ.start < kₙ₊₁.start. - This condition is loosened to
kₙ.start ≤ kₙ₊₁.startfor tokens that declare a jump to another level; these and only jump tokens are allowed to match the empty string. This means jump tokens can be formulated as regex lookaheads; non-jump tokens can of course do lookaheads, too, but they must also advance the current position of the scan. - From the above (and under the assumption that neither the tokens nor the source will be modified during a scan, except for state) there follows a "no two visits" corollary: for any given source text, each level of a correctly constructed lexer will only be asked to match a given position in the source up to one time—otherwise the lexer would return the same sequence of lexemes over and over again without ever stopping.
- the
Together these principles impose strong constraints on what a well-formed scan can be and lead to some
practically interesting invariants; for example, the concatenation of all Lexeme::hits from all lexemes
(and signals) resulting from a scan of a given source are equal to the source, i.e. source == [ ( hit
for hit from Grammar.scan source )..., ].join ''
The grammar is required to / will emit error signals in all situations where any of the above constraints is violated.
Signals: Implementation Note
Signals are just lexemes emitted by the scanner (i.e. the grammar). Internally they are formed the same way that user lexemes are formed (namely from a token, a source, and a position); they have the same fields as user lexemes, and can for many purposes run through the same processing pipeline as user lexemes.
Errors Always Emitted
Even with Grammar::cfg.emit_signals set to false, Grammar::scan() will still emit error signals; only
start, stop and jump signals will be suppressed.
Do Not Use Star Quantifiers
do not use star quantifier(?) in regexes unless you know what you're doing; ex. /[a-z]*/ will match even
without there being any ASCII letters
Always Unicode, Except
- Since all regex matchers (regex
fits) have thevflag set,fit: /./will always match Unicode codepoints (not Unicode code units); - however, JavaScript string indices continue to index Unicode code units (not Unicode codepoints);
- hence,
Lexeme::start,::stop,::lengthand::posall index the source in terms of Unicode code units although the matches (hits) are constructed by looking at Unicode codepoints. - Yes, this is annoying.
Errors as Exceptions or Signals
- errors are either thrown as JS exceptions or error signals
- error signals are defined in a system levwl called
$error - implemented
$error.loop, emitted when a grammar tries to scan a portion of a source more than once$error.earlystop, emitted when scanning has stopped but end of source not reachedGrammar.cfg.loop_errors: {'emit'|'throw'}controls whether to emit$error.loopor throw an exception
Lexeme Status Properties
Lexeme::is_system: true for lexemes in levels$signal,$errorLexeme::is_error: true for lexemes in level$errorLexeme::is_signal: true for lexemes in level$signalLexeme::is_user: true for lexemes that are not system lexemes
Data Absorption, Data Reset
Continuation: Linking, Linewise Scanning and EOL Suppletion
Two scanning modes, 'non-linking' (the default) and 'linking', indicated by
Grammar.cfg.linking: true. We therefore speak of a '(non-)linking grammar' and '(un)linked scanning'.In both modes, the first call to
Grammar::scan source_1will cause a$signal.startsignal to be issued. When scanning ofsource_1is complete, a$signal.stopis emitted when the grammar is non-linking and a$signal.pausein the case of a linking grammar.When subsequently a new
source_2text is scanned withGrammar::scan source_2, non-linking grammars will issue$signal.start, but linking grammars will issue$signal.resume.The major difference between the two modes is:
- Non-linking grammars will always start at the configured start level, which remains constant.
- Linking grammars will pick up at the level that the previous call to
scan()left them with, if any, and fall back to the configured start level. - Linking grammars will not issue
$signal.stopuntil they're explicitly asked to; the easiest way to do that is to callGrammar::scan nullwith a 'null source'.
Therefore, in linking grammars, calling
g.scan source_1; g.scan source_2with two separate sources is (almost) the same as callingg.scan source_1 + source_2once with the concatenated sources (except that line numbers and start/stop indices will differ; also, it's of course not unusual to have lexers that react differently to'abc'and'def'in isolation than to their concatenation'abcdef').A detail to keep in mind when working with
jumpsignals is that generally no jumps will be signaled between two scans by linking grammars; non-linking grammars will emit a{ fqname: '$signal.jump', data: { target: null, }, }signal near the end and a{ fqname: '$signal.jump', data: { target: 'A', }, }signal near the start of a new scan (assumingAis the grammar's start level).Grammar::cfg.supply_eol(default:false) can be set to a string which will be 'supplied' to thesourcebefore scan starts. Whensupply_eolis set to a string, that string will be appended to each source passed intoGrammar::scan(). One can setsupply_eoltotruefor Linux-style line endings (a'\n').- This means the source user passes to
Grammar::scan()and thesourceproperty of lexemes may differ by whateversupply_eolis set to
- This means the source user passes to
Note
EOL Suppletion sadly also implies that when a file with different or mixed line endings is scanned for lines and then >
U+000ais supplied where missing, the counts the grammar keeps may differ from the counts one would have to use when accessing the file through low-level file APIs such as NodeJSnode:fs#read()—but, then again, thepositionargument required by those represent bytes while the positions that InterLex works with represent Unicode / UTF-8 code units which are a very different beast already.In case the above turn out to be a real-world concern one can always opt to leave
Grammar::cfg.supply_eolset to the empty string and supply lines with the actual line ending characters preserved; as the case may be, the token matchers would then also be adjusted to deal with different line endings, as the case may be. But, all told, one can then be reasonably sure that the counts derived by piecing together all thehits up to a given point (and applying e.g. NodeJSBuffer.byteLength()) will in fact represent valid, matching offsets into the file of origin.When scanning in continuous mode, lexemes that may cross line boundaries must, when suppletion is used, be explicitly be formulated so as to accept those line boundaries; an alternative to this is to not supply synthetic line endings but
- insert appropriate characters between scans, or
- use a cast generator to issue e.g. a lexeme representing some whitespace when EOL is encountered.
Lexeme Casting
Can declare a
castproperty in token, level, or grammar, will be checked in this ordercast—below called 'the cast method'—must be either a normal function or a generator function (i.e. a function that containsyield)Only the first
castproperty out oftoken.cast?,level.castorgrammar.castwill be usedThe cast method will be called with a lexeme proxy that reflects all the properties of the current lexeme, plus a property
lexemethat returns the current lexeme instance itself; sounds complicated but enables users to write something along the lines ofcast: ({ hit, source, lexeme, }) -> ..., so it's possible to both conveniently destructure the properties of the current lexeme and get hold of the current lexeme itself—the latter being handy when wanting to emit additional lexemes and the current lexemesWhen the cast method is a normal function (without
yield), its return value will be silently discarded; all one can do is modify properties. Observe that the way JavaScript's by value / by reference semantics work, re-assigning a destructured property will (of course) not modify that property on the lexeme; for that you'll need the actual lexeme to do e.g.lexeme.hit = 'occupied!'(not wise but works). One can however modify destructured mutable properties, and this will commonly be thedatadictionary.There's no way from within a non-generator cast method to prevent a lexeme from being issued or to issue additional lexemes; for that one needs to use a generator cast method.
The call signature of a generator cast method is the same as that of a non-generator cast method. Generator cast methods are more flexible because they allow to
yieldadditional lexemes and choose not to issue the current lexeme as seen fit. For example, this generator cast method:cast = ({ hit, start, source, new_lexeme, lexeme, }) -> unless hit is 'c' yield lexeme return null yield new_lexeme 'error.nolikedis', start, source, { letter: hit, } return nullwill issue a new
nolikedislexeme defined on the grammar's user-definederrorlevel when thehitof the current lexeme happens to be a lower case letterc; otherwise, ityields the current lexeme untouched.
To Do
[—]can we replaceLevel::new_token()with a shorter API name?[—]allow to declare tokens when callingLevel::new_token()
[—]include indices for groups:match = 'a🈯z'.match /^(?<head>[a-z]+)(?<other>[^a-z]+)(?<tail>[a-z]+)/d { match.groups..., } { match.indices.groups..., }[—]rename result ofnew_regex_tagto reflect use of flags[—]allow fortoken.fit:[—]functions?- must accept
( start, text, { token, level, grammar, } ) - must return
nullor a lexeme
- must accept
[+]strings
[—]allow different metrics (code units, code points, graphemes) to determinelexeme.length, which lexeme gets returned forLevel::match_longest_at(); compare:help 'Ωilxt_416', Array.from 'a🈯z' help 'Ωilxt_417', 'a🈯z'.split /(.)/u help 'Ωilxt_418', 'a🈯z'.split( /(.)/v ) help 'Ωilxt_419', 'a🈯z'.split( /(.)/d ) help 'Ωilxt_420', match = 'a🈯z'.match /^(?<head>[a-z]+)(?<other>[^a-z]+)(?<tail>[a-z]+)/d help 'Ωilxt_421', { match.groups..., } help 'Ωilxt_422', { match.indices.groups..., } # help 'Ωilxt_423', rx"." # help 'Ωilxt_424', rx/./[—]implement API to test whether lexing has finished[—]write tests to ensure all of the Five Scanner Constraints hold:[+]exhaustiveness[—]compactness[—]contiguity[—]bijection[—]monotony[+]loop detection
[—]the above are ?better named 'scanning' or 'lexeme' constraints
[—]based on the Five Scanner Constraints, can we set an upper limit to the number of steps necessary to scan a given source with a known (upper limit of) number codepoints? Does it otherwise fall out from the implemented algorithm that we can never enter an infinite loop when scanning?[—]implement proper type handling with ClearType[—]unify handling ofcfg; should it always / never become a property of the instance?[–]shouldLevels andTokens get propsis_error,is_signal,is_system,is_user?[—]implement aselect()method somewhere that formalizes matching against lexemes[—]implement API to check whether errors occurred[—]implementlexeme.terminateto cause scanning to stop[—]when a token declaresemit: falsebut matches some material, there will be holes in the scan that violate parts of the Five Scanner Constraints; how to deal with that?[—]reconsider zero-length matches: when a token has matched the empty string, accept that token and—if the token has no jump—try the next token on the same level, if any; stop with the current level at the first token that either matches material, or has a jump, or both. This should become the new default strategy (next tofirstandlongest)[—]implement:Grammar::cfg.before_scan,Grammar::cfg.after_scan[—]test, fixreset_errors[—]consider to rename 'levels' -> 'tracks', 'jumps' -> 'switches'[—]Grammar::cfg.absorb_data:false; when set totrue, copies alllexeme.data[—]should we use aMapinstead of a POD fordata?`[—]allow empty matches for empty inputs[—]implementmerge: 'text'merge: 'join'to concatenate all lexeme data property values[—]$errorname is fixed, but provide a setting to recognize error lexemes, default being a match of/^error(_.*)?|(.*_)?error$/(in factregex"^(?>error(?>_.*)?|(?>.*_)?error)$"using atomic groups) against the token or the level name
To Do: The Concept of "Coarse Parsing"
[—]https://github.com/oils-for-unix/oils.vim/blob/main/doc/algorithms.md calls what InterLex does Coarse Parsing:- Editors like Vim and TextMate use the model of regexes + a context stack.
- With this model, we can recognize YSH "lexer modes", and produce an accurate highlighter. Coarse does not mean inaccurate!
additional comments by the same author on lobste.rs:
andyc 6 days ago
I’m working on in this style, and calling it “coarse lexing” and “coarse parsing” !
You can also call it “segmentation”, since it solves the “composed languages” problem, which Tree-sitter has issues with
The idea is:
Recognize string literals and comments first. Discard them
Now you know that
()[]{}are real code delimiters, not strings or comments.And you can have different languages/parsing algorithms within the delimiters, e.g. grammar-based or operator precedence.
To Do: Levels must obey Topological Ordering Constraints
[—]disallow forward jumps to a 'lower' level i.o.w. enforce topological order of levels: no circuits, just jump-to and jump-back. One reason for this: when you write a grammar with several levels and you jump from level A to level B and then at some point you again jump forward from B to A—it's almost certainly an error and jumping back from B was the intention.- we could implement levels as being nested inside other levels to form a tree, so a level is a list of 'token_or_sublevel' elements; however, this has the disadvantage that levels could no longer be recycled to be used from different levels (such as wanting to recognize an integer literal both from a double and a single-quote-string level). It would also violate the "flat is better than nested" principle.
- we still want to check that no circular situations occur; in loose terms, it's OK to have both
in A: gosub C; in C: return;andin B: gosub C; in C: return;(go from different levels to a third one which then returns to the caller); it's not OK to havein A: goto C; in C: goto Adirectly or indirectly - Observe that when we use
gosubandreturnas analogies, what we really use, namejump: C(jump toC) andjump: '..'(jump back) differs in that when you jump back, scanning will continue with the first token on the previous level, not the previously used token. This is different froma; gosub C; b;where you would expect to continue withb(nota) whenCreturns. - it's OK to have unconnected / isolated levels (which may or may not be used by calls to
{ new_lexeme, }inside ofcastmethods) - so we have both re-use of sublevels and isolated levels which are additional reasons to not view levels as a tree structure.
- Rather, we view levels as a akin to a flat list of functions that may call each other ('gosub'); as with functions, it's not allowed to 'goto' some point in the middle of a level; additionally, we forbid direct and indirect recursion (implied by topological ordering plus isolates).
To Do: No Arbitrary Chunking
We make no promise that a grammar will yield a collection of lexemes that remain identical across arbitrary chunkings of the source (save for their positions).
For one thing, we have no way of making, say, a regex to match some kind of numerical character references (NCR) like
𓄟, for example/&#x[0-9a-f]+;/i, match across calls toGrammar::scan(), so when a source like"the NCR 𓄟 represents 𓄟 *ms*"gets inadvertently split into"the NCR "and"11f; represents 𓄟 *ms*", we have no way to re-write/&#x[0-9a-f];/iprogrammatically to match first the former half of the NCR, then the latter half of it.It is possible to implement recognition of constructs that are impervious to arbitrary chunking but that has to be done manually (and obviously in that case our NCR recognizer would have to operate on individual code points as in
'&','#','x',/[0-9a-f]+/i,';').[—]Come to think of it: is it promising to implement 'staggering' (or 'sticky'??) levels where, in order to proceed on that level, each step has to match in order of declaration? This would necessitate to pick up at the successor of the very token that was considered last in the most recent scan, so in the above example, continue with matching'x'when the last scan had recognized'#'. In any event, such levels should allow zero matches to account for optional parts. Other name: 'compound',truefor levels whose tokens step through the (possibly empty) components of a constructTo conclude, when writing a grammar one should know which kinds of chunking one wants to allow and be aware of chunkings that are disallowed; in a good lot of cases, a sensible choice will be to scan lines of text and allow some, but most of the time not all, tokens and levels to cross line boundaries. Typically, one will allow block comments and string literals to extend across lines but restrict more 'compact' stuff like the NCRs discussed above to single lines.
To Do: Reserved Characters, Catchall Tokens
[—]implementreservedcharacters:[—]allow lexemes to announce 'reserved' / 'forbidden' / 'active' characters (such as<that signals start of an HTML tag) that can later be used to formulate a fallback pattern to capture otherwise unmatched text portions[—]at any point, allow to construct a pattern that only matches reserved characters and a pattern that matches anything except reserved characters
[—]modify behavior of catchall and reserved:[—]catchall and reserved are 'declared', not 'added', meaning they will be created implicitly when_finalize()is called[—]catchall and reserved always come last (in this order)[—]documentation (DRAFT):Reserved and Catchall Lexemes
Each lexeme can announce so-called 'reserved' characters or words; these are for now restricted to strings and lists of strings, but could support regexes in the future as well. The idea is to collect those characters and character sequences that are 'triggers' for a given lexeme and, when the mode has been defined, to automatically construct two lexemes that will capture
all the remaining sequences of non-reserved characters; this is called a catchall lexeme (whose default TID is set to
$catchallunless overriden by alxidsetting). The catchall lexeme's function lies in explicitly capturing any part of the input that has not been covered by any other lexemer higher up in the chain of patterns, thereby avoiding a more unhelpful$errortoken that would just say 'no match at position so-and-so' and terminate lexing.all the remaining reserved characters (default TID:
$reserved); these could conceivably be used to produce a list of fishy parts in the source, and / or to highlight such places in the output, or, if one feels so inclined, terminate parsing with an error message. For example, when one wants to translate Markdown-like markup syntax to HTML, one could decide that double stars start and end bold type (<strong>...</strong>), or, when a single asterisk is used at the start of a line, indicate unordered list items (<ul>...<li>...</ul>), and are considered illegal in any other position except inside code stretches and when escaped with a backslash. Such a mechanism can help to uncover problems with the source text instead of just glancing over dubious markup and 'just do something', possibly leading to subtle errors.
Is Done
[+]implement chunk numbering with CFG settings{ counter_name: 'line_nr', counter_start: 1, }[+]ensure that no tokens with non-sticky matchers are instantiated[+]when using regex fortoken.matcher, should we (1) update flags or (2) reject regexes without required flags?—See whatslevithan/regexdoes withvflag (throwsError: Implicit flags v/u/x/n cannot be explicitly added)[+]during matching, ensure that when lexeme was produced it did consume text[+]extend sanity checks for matcher regex:[+]must not haveg?[+]must haved[+]must havevnotu?
[+]is it possible and useful to allow regular lexemes that take up zero space akin to special lexemes (to be written) that indicate start, end, change of level?[+]documentation:emtpy matches are allowed only as intermediate results (with strategy
longest)or when the respective token declares a jump". Note disallowing empty jumps in internal, intermediate results is somewhat misleading because it, too, does not by any means catch all cases where any of the declared matches could have conceivably remained empty. Which is to say we should either confine ourselves to doing runtime sanity checks (soemthing we can certainly do, with ease) and accept that we have no way to discover matchers that could potentially return empty matches; or, else, we implement watertight up-front checks that only reguler expressions that can never yield an empty match are allowable (something we can almost certainly not do, like at all). Conceptually, no matter the acceptance strategy (firstorlongest), there are always some matchers that could have matched a zero-length string (which we cannot know for sure unless we use some non-existing static analysis technique) but that were either not applied (because some other token's matcher came earlier) or applied and discarded (because some other token's matcher gave a longer match).[+]documentation:jumps can be either 'sticky' or 'carrying' depending on whether the lexeme produced from the respective token will 'stick' to the token's level or be 'carried' along to the new level (the jump's target). Sticky jumps have no explicit marker as they represent the 'default of least surprise' (a lexeme looks almost like the token it was produced by, after all); carrying jumps are indicated by an
!exclamation mark behind the jump target, so for example a token declaration in levelgndthat contains{ jump: 'otherlevel!', }indicates that the resulting lexeme'slevelwill beotherlevel, notgnd."[+]lexemes now contain bothlx.jump.jump_specandlx.jump_spec(in addition tolx.token.jump.jump_spec); it should conceivably only belx.jump.spec[+]implement setting to simplify jumps such that any series of jumps starting withfrom_level: 'a'and ending withto_level: bwithout any intervening non-jumpsignals are simplified to a singlejumpfromatob[+]rename$systemto$signal[+]aggregate CFG settings such that the resulting version has the final results; e.g. a constellation of{ emit_signals: false, simplify_jumps: true, }can be aggregated as{ emit_signals: false, simplify_jumps: false, }[+]renamesimplify_jumps->merge_jumps[+]implementLexeme::posproperty to representlnr,start,stopas text[+]implement lexeme consolidation / simplification merging where all contiguous lexemes with the samefqnameare aggregated into a single lexeme (ex.{ name: 'text', matcher: rx.i"\\[0-9]|[a-z\s]+", }will issue tokens for hits'R','\\2','D','\\2'when scanning'R\\2D\\2'; simplification will reduce these four lexemes to a single lexeme)[+]clarify how to treat entries inLexeme::datawhen merging[+]at first implement usingObject.assign(), later maybe allow custom function[+]later maybe allow custom function
[+]unifyLexeme::groups,Lexeme::data[+]renameToken::matcherto.fit()?[+]option to throw or emit error in case lexing is unfinished[+]allow empty matches provided the token defines a jump[+]ensure$signal.stopis emitted in any case (barring exceptions)[+]emit$signal.errorupon premature eod-of-scan[+]consider to renameGrammar::walk_lexemes()toGrammar::scan()[+]implement infinite loop prevention; levels; each level keeps track of positions, complains when re-visiting position (which violates the "no two visits" principle)[+]notify all levels when scanning starts to reset any state
[+]implementdiscardable,ghosttokens, especially for zero-length jumpers? Could use CFG settingToken::emit, in line withGrammar::emit_signals[+]simplify jump signals todata: { to: Level::name, }data: { target: Level::name, }[+]stop scanning when encountering error signal[+]among the Five Scanner Constraints, monotony can be defined in a stricter way, namely, that you can only enter a level at a given position once; the second time you enter a given level (by moving forewards or backwards), the current position (lexeme.start) must be at least1greater than your previous entry point[+]take error signals out of$signaland put into new$errorlevel? That would entail instead of, say,{ fqname: '$signal.error', data: { kind: 'earlystop', ..., } }we would get{ fqname: '$error.earlystop', data: { ..., } }which is more suitable for processing[+]these internal settings / convenience values should be systematized and maybe collected in a common place; some of the regex flag values can and should be derived from more basic settings:_jsid_re_jump_spec_back_jump_spec_re_regex_flag_lower_re_regex_flag_upper_re_regex_flags_re_default_flags_set_disallowed_flags_set_normalize_new_flags
[+]an infinite loop can result from zero-length matches and ensuing forth-and-back jumps which should be recognized and terminated[+]Lexemes could gain properties[+]::is_systemfor lexemes in levels starting with$[+]::is_errorfor lexemes in level$error[+]::is_signalfor lexemes in level$signal[+]::is_userfor lexemes in user-defined levels
[+]implement option to turn exceptions into error signals[+]Grammar.cfg.loop_errors: {'emit'|'throw'}[+]Grammar.cfg.earlystop_errors: {'emit'|'throw'}
[+]Grammar::cfg.reset_lnr:false; when set totrue,lnrwill be held constant to the value ofGrammar:cfg.lnr[+]Grammar::cfg.reset_data:false; when set totrue, all enumarable properties ofdatawill be removed before each scan and then get re-assigned enumerable properties ofGrammar::cfg.data, with the provision that functions are going to be called (in the context of the grammar) and their result used[+]renameGrammar::clear_errors()->Grammar::reset_errors():false; when set totrue,Grammar::state.errorswill be cleared before each scan[+]disallow non-unique token names[+]build lexer for EffString specs[+]implement callbacks to e.g. cast data items to target values (as indata: { count: '4', }->data: { count: 4, })[+]documentcastsetting forGrammar::,Level::,Token::[+]allowcast()to be either afunctionor ageneratorfunction; in the latter case,cast()is responsible to actuallyyieldthe current lexeme where intended; canyieldany number of additional or replacement lexemes as seen fit, or nothing withyield return null.[+]Generator functions replacelexeme.emit()[+]related: can we provide a way for users to issue error signals? create own error tokens?[+]if possible, implement a way to signal from a token'scastfunction that the token is invalid or that another token should be chosen; if that was possible, one could use somehwat more generic matchers for stuff like, say, float literals and then simply test whether in thecastfunctionparseFloat()returns afloatorNaN; in the latter case, one could either- reject the current token and continue, from the current position, to match tokens from the same(?) level that come after the current one; or
- issue an error signal (probably the simpler option).
[+]Grammar.cfg.linkingneed setting forGrammar:cfg:continuation(either'staccato','legato','lines');continuation: 'lines'implies setting[+]Grammar::cfg.reset_stackcannot betrueif grammar or any level or any token issticky[+]implement 'continuation' i.e. the ability of the lexer to stay on the same level across scans, meaning that when scanning line by line constructs (such as HTML tags) can extend across line boundaries[+]supply_eolto'\n'[+]movefqnameformation to token, use API[+]allow positional arguments toLevel::new_token():( name, fit, cfg )
Don't
[—]'wrap' tokenizing so that handling of line splitting, line numbering is isolated yet transparent[—]use syntax as before ([level,..]vslevel[,]..)[—]implement 'inclusive' ('progressive'), 'exclusive' ('regressive') jumps[—]allow to configure grammar defaults for fore- and backward jumps
[—]bundlestart,stopand laterlnr&c underposition?[—]can we put the functionalities ofGrammar::_scan_1b_merge_jumps()andGrammar::_scan_3_merge()into a single method?[—]replaced by option to replace exceptions with error signals allow the lexer to stop 'silently' when a non-jump token matched the empty string? Add token declaration fieldallow_empty_end? Better name?[–]flattenjumpproperty?[—]what should theactionof a merged jumped be?[—]LATER implement:[—]Grammar::reset: ({ lnr: 1, data: null, }) ->[+]reset_lnr: ( lnr = 1 ) ->[+]reset_data: ( data = null ) ->[—]Grammar::cfg.reset_on_scan.lnr:falseorinteger[—]default tofalseto skip resetting[—]or an integer[—]use ClearType to implement as type
[—]Grammar::cfg.reset_on_scan.data:falseorpod[—]default tofalseto skip resetting[—]or a template object (functions will be called); use{}to reset to empty[—]use ClearType to implement as type
[—]? allowcastto be an object whose keys are functions that will be applied to properties ofLexeme::data; ex.:{ fit: /(?<num>[0-9]+):(?<den>[0-9]+)/, cast: { num: parseInt, den: parseInt, }, }[—]consider to use terms 'legato', 'staccato'[—]better name thanstickyfor tokens that may extend acrossscan()calls?[—]implement[—]Grammar::cfg.sticky[—]Level::cfg.sticky[—]Token::cfg.stickymay be needed to distinguish 're-scan from first token in previous level' form 're-scan from this token in previous level'; the problem in the latter case is do we want it at all and would we still have to try the other tokens in that previous level even when they come before the sticky one? Then the implementation could 'rotate' the tokens so the sticky one comes first, then the ones behind it, then the ones before it until the level is exhausted; implementation ofLevelshould useSymbol.iteratorto implement that. But the question remains whether it's necessary or at least advantageous to have several entry points to a level and how to judge which tokens should be sticky and which ones non-sticky (maybe the level should be sticky)
[—]replaced by$signal.pause,$signal.resume,Grammar::scan nulllegato scanning mode necessitates issuing an 'end of text' signal that is either of a higher order than the existing$signal.stopor replaces it- as long as
Grammar::scan()is called individually for each chunk of text, that can only be triggered explicitly by the user; an alternative would be a new methodGrammar::scan_all()(scan_lines(),scan_chunks()) that exhaust an iterator over chunks of source
- as long as
[—]implement anoffset(other names:delta,dx;cudfor Code Unit Delta;dcufor Delta Code Units) that must be added tostartto get the index position in the concatenated sources, call itabspos := offset + start[—]shouldLexeme::posrepresent- only
absposreplacing localstart, or offsetseparately, or- only local position (no change required)?
- only
