baa-lexer

v0.3.1

Published

a year ago

![](img/baa-sheep-lemmling.svg)

Downloads

0High
0Medium
0Low

knappi

Original image by lemmling on OpenClipArt.org

Baa!

Baa is a highly-optimised tokenizer/lexer written in TypeScript. It is inspired by moo , but completely rewritten.

It accepts most of moo's configurations, but lacks some features.

No support for arrays of keywords.
No support for rules that are arrays of rule definitions.
No support for regular expressions with unicode flag
Less dynamic checks (e.g. silently drops all provided regex flags)

Advantages:

Compiles to a reusable concurrency-save lexer instead of creating an iterable object directly (see "Usage").
Different token format.
Slightly faster than moo (at least not much slower)
About 2.2kb of size.
Strong typings, including state-names and token-types
Understandable code

Note: This was mostly an exercise for me to practice test-driven development and think about architecture a bit. In the end, I tried to optimize speed and build size. I don't think it makes a lot of difference whether you use moo or baa. moo is more popular and may be better supported in the long run. I will use baa in handlebars-ng though.

Installation

Install the baa-lexer with

npm install baa-lexer

Usage

The examples/ show you how to use baa. One of the simple examples is this:

import { baa } from "baa-lexer";

const lexer = baa({
  main: {
    A: "a",
    FALLBACK: { fallback: true },
    B: "b",
  },
});

for (const token of lexer.lex("a b")) {
  console.log(token);
}

This will print in the following tokens:

{ type: 'A',  original: 'a', value: 'a', start: { line: 1, column: 0 }, end: { line: 1, column: 1 } }
{ type: 'FALLBACK', original: ' ', value: ' ', start: { line: 1, column: 1 }, end: { line: 1, column: 2 } }
{ type: 'B', original: 'b', value: 'b', start: { line: 1, column: 2 }, end: { line: 1, column: 3 } }

For a complete list of rules, have a look at the tests

Using types

If you create a type

interface Typings {
  tokenType: "my" | "token" | "types";
  stateName: "my" | "state" | "names";
}

and pass it as generic to the baa function, you will get auto-completion for types within the configuration as well as for the "type" field in the created tokens. The following screenshot highlights all places that are type-checked and auto-completed.

Benchmarks

See performance/ for the exact tests and run then yourself with

yarn perf

These are the results, but be aware that results may vary a lot:

 BENCH  Summary

  moo - performance/moo-baa.bench.ts > moo-baa test: './tests/abab.ts' (+0)
    1.07x faster than baa

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/fallback.ts' (+0)
    1.19x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/handlears-ng.ts' (+0)
    1.50x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/handlears-ng.ts' (1)
    1.25x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/handlears-ng.ts' (2)
    1.19x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/json-regex.ts' (+0)
    1.15x faster than moo

  moo - performance/moo-baa.bench.ts > moo-baa test: './tests/json.ts' (+0)
    1.04x faster than baa

Readable / Extendable code

What bothered me most about moo was that it is just one large JavaScript file, and it took me a long while to understand all the optimizations they implemented.

It tried to take modular approach. Basically the whole program is divided into

The Lexer: Responsible for creating an IterableIterator of tokens which then manages state transitions. Uses the TokenFactory to create the actual tokens.
The Matcher: Finds the next token match. There are different strategies
- RegexMatcher: Creates a large regex to find the next match
- StickySingleCharMatcher: Uses an array to map char-codes to rules. Can only find single-char tokens, but this can be done much faster than with Regex.
The StateProcessor: Uses the Matcher to find the next match, interleaves matches for fallback and error rules.
The TokenFactory: Keeps track of the current location and creates tokens from matches.
The mooAdapter takes a moo-config and combines all those components so that they do what they should.

Advances usage

You do not have to use the mooAdapter though: Most the internal components are exposed, so you can use them yourself. You can create a StateProcess and pass your own Matcher instance to it. You can create a completely new StateProcessor with completely custom logic.

The program could also be extended to allow a custom TokenFactory, applying the token format that you need (but I won't do this unless somebody needs it).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme