npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

lexr

v1.0.4

Published

Lexical analyzer built in Javascript

Downloads

97

Readme

lexr

Build Status via Travis CI NPM version NPM Downloads
Dependency Status devDependency Status
Coverage Status Maintainability

lexr is meant to be a lightweight tokenizer built in Javascript to be more modern and clean than the C ancestor.

Goals

lexr is compartmentalized to be able to work on its own however is aimed to be used hand in hand with grammr once the project is finished.
lexr and grammr are an effort to re-think how the traditional process of flex and bison work together and move development to a more modern process.

Both of these projects are developed in order to be implemented in the re-work of ivi a project which aims to visualize code for the purpose of teaching intro programming.

Features

The current Lexical Analyzer has built-in support for Javascript with a plan on extending to other languages.
If you do not see your language supported or would like to simply use custom tokens it is possible to do so as well.

What is currently supported is

  • Using built-in or custom tokens
  • Adding Tokens either one by one or in a set to the tokenizer
  • Error Detection
  • Functions on Token Recognition
  • Overwriting error token name
  • Removing Tokens from the token set
  • Ignoring tokens for output either one by one or in a set
  • UnIgnoring tokens from the token set
  • Custom Output for tokens
  • Removing custom output for tokens

Built-In Language Support

  • Javascript
  • Json

Usage

The entire library wraps around a Tokenizer class.
First import the library

let lexr = require('lexr');

In order to use built-in languages, initialize the tokenizer as so:

let tokenizer = new lexr.Tokenizer("Javascript");

If you would like to use fully custom tokens then simply initialize as so:

let tokenizer = new lexr.Tokenizer("");

If you have selected a built-in language you will not be able to add or remove tokens until you disable strict mode for tokens.
To do so call the disableStrict() function on the tokenizer instance.

Tokens

Once you have done so or if you are working on a fully custom tokenizer you can add tokens 2 ways:

// Add a single token
// Arguments: tokenName, RegExp pattern
tokenizer.addToken("L_PAREN", /\(/);

// Add multiple tokens
// Must be in the form of a Javascript Object
const tokens = {
  L_BRACE : /{/,
  R_BRACE : new RegExp('}'),
};
tokenizer.addTokenSet(tokens);

You can also remove pre-existing tokens if you are using a custom language or have disabled strict mode.

tokenizer.removeToken("L_PAREN");

White Space and New Lines

Tokenizer has ability to ignore whitespace and newlines by calling the methods on the instance.
Example:

let x = lexr.Tokenizer("Json");
x.ignoreWhiteSpace();
x.ignoreNewLine();

Functions

If you would like to add functions when tokens are recognized you can add them through a set or through individual addition by calling the proper addFunction function.

// Add functions through set
let funs = {
  WHITESPACE  : function() { whitespaceCount += 1 },
  IDENTIFIER  : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);

// Add single function
tokenizer.addFunction("NEW_LINE", function() { lineCount += 1 });

// Remove function
tokenizer.removeFunction("IDENTIFIER");

Function Usage

lexr slightly separates itself from flex in how functions are handled.
Your functions should not use any of the information taken from the current token since you have access to that information post tokenization.
This keeps the functions that are being executed smaller and cleaner.

Example Function Usage

An example of code that would go with the functions is as follows

let funs = {
  WHITESPACE  : function() { whitespaceCount += 1 },
  NEW_LINE    : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);
let input = `var a = 4;
             var b = 3;`;            
let whitespaceCount = 0;
let idNum = 0;
tokenizer.tokenize(input);

Function Scoping Oddities

Since functions are contained within an object in the tokenizer, scoping can get a bit iffy.
You can use the example above however, a suggested usage is to make a functions.js in order to:

  • declare all of your tokens to functions
  • declare any variables you need outside of the function object
  • export the function object as well as any variables you want access to to the proper files

yytext Function Alternative

Instead of using yytext within your functions the suggested usage is to analyze post tokenization.
An example of grabbing all identifier names and inserting them into let's say a symbol table would look like:

let input = `var a = 4;
             var b = 3;`;
let output = tokenizer.tokenize(input);

let symbolTable = {};
for (let i = 0; i < output.length; i++) {
  if (output[i].token === "IDENTIFIER") {
    symbolTable[output[i].value] = undefined;
  }    
}

Error Token

By default the error token name when detecting an uncaught token will be ERROR however, if you would like to change the name you can do so by calling setErrTok as so:

tokenizer.setErrTok("DIFF_ERROR");

Ignore Tokens

You can also ignore certain tokens from appearing in the output by either calling addIgnore

tokenizer.addIgnore("WHITESPACE");

Or by adding an entire set through an array or an object

let ignore = ["WHITESPACE", "VAR"];
tokenizer.addIgnoreSet(ignore);

// Or through an object which allows true or false
let ignore2 = {
  "WHITESPACE"  : true,
  "VAR"         : false,
};
tokenizer.addIgnoreSet(ignore2);

If you would like to unIgnore tokens programatically just call the unIgnore method

tokenizer.unIgnore("WHITESPACE");

Custom Output

There are options to make your output more verbose by adding a customOut field to the output object.
Similarly to how other operations work you can either add a set of tokens or a single token as well as remove them.

// Add a set of custom outputs
let customOut = {
  "WHITESPACE"  : 2,
  "VAR"         : 'declaration',
}
tokenizer.addCustomOutSet(customOut);

// Add a single custom output
tokenizer.addCustomOut("SEMI_COLON", 111);

// Remove a custom out
tokenizer.removeCustomOut("VAR");

A sample output object would then look like:

{ token: 'WHITESPACE', value: ' ', customOut: 2 }

Tokenization

Lastly in order to tokenize your input code simply call the tokenizer's tokenize method.

let output = tokenizer.tokenize(aString);

Output

In its current form the output from tokenize(aString) will be in the form of a list of Objects each having 2 properties.
token being the token captured, and value being what determined the token.

Examples

Sample Program

let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("Javascript");
let input = "var a = null;";
let output = tokenizer.tokenize(input);
console.log(output);

Output would then be

[ { token: 'VAR', value: 'var' },
  { token: 'WHITESPACE', value: ' ' },
  { token: 'IDENTIFIER', value: 'a' },
  { token: 'WHITESPACE', value: ' ' },
  { token: 'ASSIGN', value: '=' },
  { token: 'WHITESPACE', value: ' ' },
  { token: 'NULL_LIT', value: 'null' },
  { token: 'SEMI_COLON', value: ';' } ]

Sample Program with Token Errors

let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("");
tokenizer.addToken("PLUS", /\+/);
tokenizer.setErrTok("DIFF_ERROR");
let input = "5+5;";
let output = tokenizer.tokenize(input);
console.log(output);

Output would then be

[ { token: 'DIFF_ERROR', value: '5' },
  { token: 'PLUS', value: '+' },
  { token: 'DIFF_ERROR', value: '5;' } ]

Suggested Workflow

How I suggest development if you are not using built-in languages is to separate each part of the tokenization into its own file.
If you are using a complex language where the regexes can become very large, separate the building up of those regexes to another file and only export the final regex to your token object.

Since the Tokenizer can take in sets of information it is easiest to separate everything and use exports between files.

Sample project structure

+-- src/
| +-- index.js
| +-- functions/
|   +-- functions.js
| +-- tokens/
|   +-- tokens.js
|   +-- regexPatterns.js
| +-- ignore/
|   +-- ignoreTokens.js
| +-- customOut/
|   +-- customOutput.js

Future Features

If there are any good freatures missing feel free to open an issue for a feature request.