@millennial-sky/lex-with-regexps

v0.1.3

Published

3 years ago

A small library the provides a simple API to tokenize strings using regular expressions

0High
0Medium
0Low

millennial-sky

Lexing with Regular Expressions

This is a small (~50 lines of code) library the provides a simple API to tokenize strings using regular expressions.

Overview

Here is an example that we will use throughout the documentation

import {lex} from "@millennial-sky/lex-with-regexps"

lex("one 42", {
  id: /[a-zA-Z_][a-zA-Z0-9_]*/,
  num: /\d+/,
  ws: /\s+/,
})

lex will return an array of tokens, which are objects of this form

{
  kind: "id",
  text: "one",
  loc: {index: 0, line: 1, column: 1}
}

The focus of the library is minimalism and ease of use in articles that have to parse small languages, not on performance (which shouldn't be too bad anyway, but we go over the input more than once to compute the location information).

The library is written in literate JavaScript, using @millennial-sky/literate.

Implementation

Given some string and an index, the lexer will need to find which (if any) of the regular expressions matches at that index.

Merging the regular expressions

We could loop through each regexp and try it, but it's better to let the JavaScript RegExp engine do the heavy lifting.

The idea is to build a regular expression that uses named capture groups to differentiate between the token kinds

const log = console.log
log(
  /(?<id>[a-zA-Z_][a-zA-Z0-9_]*)|(?<num>\d+)|(?<ws>\s+)/sy.exec("one 42")
)
//>  [
//>    'one',
//>    'one',
//>    undefined,
//>    undefined,
//>    index: 0,
//>    input: 'one 42',
//>    groups: [Object: null prototype] { id: 'one', num: undefined, ws: undefined }
//>  ]

The y flag (called "sticky") will allow us to decide at which index the match should begin. The s flag ("dotAll") makes the . match newlines.

The function that merges the regular expressions is this

const buildLexerRegExp = (regExps) => {
  let buf = []
  for (let [tokenKind, regexp] of Object.entries(regExps)) {
    buf.push(`(?<${tokenKind}>${regexp.source})`)
  }
  return new RegExp(buf.join("|"), "sy")
}

So in our example

const regExps = {
  id: /[a-zA-Z_][a-zA-Z0-9_]*/,
  num: /\d+/,
  ws: /\s+/,
}

const lexerRegExp = buildLexerRegExp(regExps)
log(lexerRegExp)
//>  /(?<id>[a-zA-Z_][a-zA-Z0-9_]*)|(?<num>\d+)|(?<ws>\s+)/sy

We could have used regular capture groups instead of the named ones, but that would have been less robust: the user would have had to remember to use non-capture groups (?:...) in their regexps to avoid bugs. On the other hand users have no reason to use named capture groups.

Finding the next token

Now that we have a single regular expression, we can use it to find the next token

const matchNext = (index, str, lexerRegExp) => {
  // Setting `lastIndex` (while using the sticky flag "y") is how you tell RegExp#exec 
  // that the match should start exactly at that index
  lexerRegExp.lastIndex = index
  const match = lexerRegExp.exec(str)
  if (!match) return null
  
  const tokenKind = Object.keys(match.groups).find((k) => match.groups[k])
  
  return {
    kind: tokenKind,
    text: match[0]
  }
}

log(matchNext(0, "one 42", lexerRegExp))
log(matchNext(3, "one 42", lexerRegExp))
log(matchNext(4, "one 42", lexerRegExp))
log(matchNext(0, "+", lexerRegExp))
//>  { kind: 'id', text: 'one' }
//>  { kind: 'ws', text: ' ' }
//>  { kind: 'num', text: '42' }
//>  null

The source location and its advancement

We want to add location information to the tokens, i.e. the index, line and column where the tokens begin.

The location of the first token is always

{index: 0, line: 1, column: 1}

After we match a token we can use its text to get the starting location of the next token. The rules are easy:

the index increases by the length of the token text
the line increases by the number of newlines in the token text
the column
- if the token text contains a newline, it is the number of characters after the last newline
- otherwise it increases by the length of the token text

const nextLocation = (loc, text) => {
  const index = loc.index + text.length
  
  const line = loc.line + (text.match(/\n/g)?.length ?? 0)
  
  const lastNewlineIndex = text.lastIndexOf("\n")
  const column = (lastNewlineIndex >= 0) ?
    text.length - lastNewlineIndex :
    loc.column + text.length

  return {index, line, column}
}

We try it out on a few made up token strings

log(nextLocation({index: 0, line: 1, column: 1}, "aaa \n bb"))
log(nextLocation({index: 0, line: 1, column: 1}, "aaa \n  \n bb"))
log(nextLocation({index: 0, line: 1, column: 1}, "aa"))
//>  { index: 8, line: 2, column: 4 }
//>  { index: 11, line: 3, column: 4 }
//>  { index: 2, line: 1, column: 3 }

Errors

If the lexer regexp doesn't match we want to throw a specialized Error containing all the relevant information

export class SyntaxError extends Error {
  constructor(loc, src, msg) {
    const line = src.split("\n")[loc.line - 1]
    const marker = " ".repeat(loc.column - 1) + "^"
    super(`${loc.line}:${loc.column}: ${msg}\n${line}\n${marker}`)
    this.loc = loc
    this.src = src
  }
}

For example

try {
  throw new SyntaxError({index: 3, line: 1, column: 4}, "one 42", "Made up error")
} catch (e) {
  log(e.message)
}
//>  1:4: Made up error
//>  one 42
//>     ^

Putting it all together

We can now write the function that does the lexing

export const lex = (src, regExps) => {
  const lexerRegExp = buildLexerRegExp(regExps)
  const tokens = []
  let loc = {index: 0, line: 1, column: 1}
  while (loc.index < src.length) {
    const match = matchNext(loc.index, src, lexerRegExp)
    if (match == null) throw new SyntaxError(loc, src, "Unknown token")
    tokens.push({...match, loc})
    loc = nextLocation(loc, match.text)
  }
  return tokens
}

Lets try it out

log(lex("one 42", regExps))
//>  [
//>    { kind: 'id', text: 'one', loc: { index: 0, line: 1, column: 1 } },
//>    { kind: 'ws', text: ' ', loc: { index: 3, line: 1, column: 4 } },
//>    { kind: 'num', text: '42', loc: { index: 4, line: 1, column: 5 } }
//>  ]

while a failure looks like this

try {
  lex("one + 42", regExps)
} catch (e) {
  log(e.message)
}
//>  1:5: Unknown token
//>  one + 42
//>      ^

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme