npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

sever-tokens

v0.1.5

Published

A moo compatible tokenizer/lexer generator, sacrificing some performance for features.

Downloads

5

Readme

Sever

Sever is a mostly-optimized tokenizer/lexer generator. It is designed to support back-references, capture groups and modern language features while maintaining compatiblity with most of the 🐄 moo API.

Why not just use moo?

Moo is a fantastic lexer, and I have personally been using it for years. With that comes some caveats, especially around which features of RegExp objects can be used without complaints. Being able to parse things such as Lua, where back-references are nessesary to capture multiline strings easily. Additionally, capture groups make value tranformations simpler.

This results in a few trade offs:

  1. Convenience features deprioritize speed as the primary optimization metric.
  2. I do not support legacy runtime environments
  3. Undocumented features, or ones that can easily be replaced with ES6 operations are not a firm requirement.

Usage

Basic installation is as follows: npm install sever-tokens for a node like environment, alternatively, you can include sever.js in a script tag directly.

const lexer = tokenizer.compile({
    WhiteSpace:  { match: /\s+/, lineBreaks: true },
    Number: { match: /\d+\.\d*|\d*\.\d+|\d+/, value: (v) => Number(v) },
    Keyword: {
        match: [
            "and",   "break", "do",       "else",  "elseif", "end",
            "false", "for",   "function", "goto",  "if",     "in",
            "local", "nil",   "not",      "or",    "repeat", "return",
            "then",  "true",  "until",    "while",
        ],
        wordBound: true
    },
    Tuple: { match: /(\w+),(\w+),(\w+)/, value: (token, ... groups) => groups  },
    Identifer: /\w+/,
    LeftParen: '(',
    RightParn: ')',
    String: { match: /\[(?<smark>.*?)\[(?<string>(?:(?!\]\k<smark>\])\\?.)*)\]\k<smark>\]/, value: (token, groups) => groups.string },
    LuaComment: /--\[(?<cmark>.*?)\[(?<content>(?:(?!\]\k<cmark>\])\\?.)*)\]\k<cmark>\]/
});

Finally, you must feed a source file to the returned Tokenizer.

lexer.reset('Hello world');
lexer.next(); // { type: 'Identifier', value: 'Hello', ... }
lexer.next(); // { type: 'Identifier', value: 'world', ... }

API

sever.compile(tokens)

Generates a lexer. Tokens must be an object providing keys (Token type name), and values (matching option for type). ** Order of matches determines priority **. Returns a Tokenizer,

sever.compile({
    keyword: ["and", "xor", "or", "not"],
    identifier: /\w+/,
    number: /[0-9]+/,
    semicolon: ";",
    ws: { match: /\s+/, discard: true }
})

Matching objects can be:

sever.states(states)

Generates a stateful lexer. This works much like .compile(), with the exception that you have multiple matching groups that can be statefully

sever.states({
    main: {
        enter: { match: '"', push: "string", discard: true },
        whitespace: { match: /\s+/, discard: true },
        identifier: /\w+/,
        number: /[0-9]+/
    },
    string: {
        exit: { match: '"', pop: 1, discard: true },
        escape: /\\./,
        character: /./
    }
})

sever.error

This is a default error catch all that can be inserted into any token group. It matches an arbitrary group of characters from the feed of a similar type.

const lexer = sever.compile({
    Error: sever.error
});

lexer.reset("BadToken!");
lexer.next(); // { type: 'Error', value: 'BadToken', error: true, ... }

class Tokenize

Tokenize.reset(source, state?)

Initialize the tokenizer, or restore it to a previous state.

Source is the string to be tokenized, and state is an optional parameter (returned from .save())

Tokenize.next()

Returns the next matched token in the provided source

Token Format

  • type: the name of the group, as passed to compile.
  • text: the string that was matched.
  • value: the string that was matched, transformed by your value function (if any).
  • offset: the number of bytes from the start of the buffer where the match starts.
  • lineBreaks: the number of line breaks found in the match. (Always zero if this rule does not have lineBreaks)
  • line: the line number of the beginning of the match, starting from 1.
  • col: the column where the match begins, starting from 1.

Tokenize.*[Symbol.iterator]

Provides iterator support.

const lexer = sever.compile({...});
for (let token of sever) {
    console.log(token);
}

Tokenize.save()

Preserves the current state of the lexer. The value may be passed to .reset(...) as it's second argument to rewind history

Tokenize.has(token)

States wether or not the tokenizer will match the name of a given token type.

Tokenize.formatError(token)

Returns a formatted message describing the token. Useful for printing debug information to console.

Error: BadOperator at line 5 col 15:

This is a?failure
         ^

Optioned Token

Some tokens require a little more complex matching than a basic regular expression or literal string.

Token Options

  • .match (required): Matching pattern for token
  • .value: Value transformation function
  • .error: Token should be treated as a failure
  • .push: Push state to stack, and swap context
  • .pop: Restore previously pushed context from stack
  • .next: Set context to new state
  • .discard: Match but discard token
  • .wordBound: Token is word bound
  • .lineBreaks: Token may contain linebreaks

match

This is the matching pattern for the token. This will accept: Literal Strings, RegExp objects, and Arrays of Literal Strings or RegExp objects.

const complexMatches = {
    whitespace: { match: /\s+/ },
    number: { match: /[0-9]*.[0-9]+|[0-9]+.[0-9]*|[0-9]+/ }
    keywords: { match: [ 'yes' , 'no', 'maybe' ] }
};

Optioned tokens cannot be nested inside one another, and will result in an error.

error

This will define the token as a failure case, causing tokenization to stop following this, and an error flag will be set inside of the return token object.

discard

This flags a token as matched, but unused. This is helpful for things such as whitespace which must be removed, but are generally unused inside of your parser.

wordBound

This defines the match as being wordbound. This is especially helpful for keywords, as you no longer need to sort the values by length.

const lexer = sever.compile({
    ...
    keywords: {
        match: [
            "and",   "break", "do",       "else",  "elseif", "end",
            "false", "for",   "function", "goto",  "if",     "in",
            "local", "nil",   "not",      "or",    "repeat", "return",
            "then",  "true",  "until",    "while",
        ],
        wordBound: true
    },
    ...
});

lineBreaks

This specifies that the matched token contains line breaks. If the value is a number, the token is treated as a fixed number of line breaks.

{ match: /\n\r?|\r\n?/, lineBreaks: 1 }

Any other truthy value will simply specify that the token may contain line breaks and a second pass must be taken to determine what the location is after the match. ** This case should be avoided, as it will have a significant impact on matching this token type **

{ match: /[\s]+/, lineBreaks: true }

value

This provides a value transformation function to the token. This operates in 3 different calling conventions depending on if you are matching a RegExp, and if that RegExp contains capture groups. Regardless of which mode, the provided function will be called with the match (and optionally the capture groups), and should return a value to be placed in the returned token as a property named .value.

Named capture group value transformation

If you provided a regular expression with named groups, the 2nd argument of your value transformation will be an object containing all of your capture groups by name.

{ match: /'(?<string>[^']*)'/, value: (match, groups) => (groups.string) }

Numbered capture groups

If you provided a regular expression with ordered capture groups, they will be provided as a spread of arguments. If no capture groups are supplied, or a literal string is matched only the match will be passed to the transformation function.

{ match: /(\w+).(\w)/, value: (match, ident, property) => ({ident, property}) }

States

States provide a method for dynamically changing the pool of matchable tokens available to the lexer on the fly. This is useful for context sensitive parsing where you may need to be aware of a feature that has appeared before that could affect future parsing, such as strings. This can be done in a direct swap, or as a stack operations.

const lexer = sever.states({
    main: {
        'string': { match: '"', next: 'string' },
        'WS': /s+/,
        'identifier': /\w+/
    },
    string: {
        'end': { '"', discard: true },
        'escaped': { match: /\\./, value: (v) => JSON.parse(`'${v}'`) },
        'char': /./
    }
});

Additionally, context can be done on a stack basis, in the case of nexted operations. The stack can only move one level at a time, regardless what the value provided to pop may be.

const common = {
    open_words: { match: '{', push: 'words', discard: true },
    open_numbers: { match: '[', push: 'numbers', discard: true },
    close_words: { match: '}', pop: 1, discard: true },
    close_numbers: { match: ']', pop: 1, discard: true },
    WS: { match:/\s+/, discard: true },
};

const lexer = sever.states({
    words: {
        ... common,
        word: /[a-z]+/
    },
    numbers: {
        ... common,
        number: /[0-9]+/
    }
});