@tamim.jabr/tokenizer

v1.0.7

Published

4 years ago

It is a package to help you detect tokens in strings based on the grammar you choose

0High
0Medium
0Low

tamim.jabr

Tokenizer

It is a package to help you detect tokens in strings based on the grammar you choose

How to install it?

npm i @tamim.jabr/tokenizer

How to import it?

import tokenizer from '@tamim.jabr/tokenizer'

How to use it?

A tokenizer is created by sending the grammar object

import tokenizer from '@tamim.jabr/tokenizer'

const {
  Tokenizer,
  Grammar,
  WordAndDotGrammar,
  ArithmeticGrammar,
  ExclamationGrammar,
  MaximalMunchGrammar
} = tokenizer

const grammar = new WordAndDotGrammar()

const newTokenizer = new Tokenizer(grammar, 'hello World .')

newTokenizer.getActiveToken()
// expected : { tokenType: 'WORD', tokenValue: 'hello' }

newTokenizer.moveActiveTokenToNext()
newTokenizer.getActiveToken()
// expected : { tokenType: 'WORD', tokenValue: 'World' }

newTokenizer.moveActiveTokenToNext()
newTokenizer.getActiveToken()
// expected : { tokenType: 'DOT', tokenValue: '.' }
newTokenizer.hasNext()
// expected : true

newTokenizer.moveActiveTokenToNext()
newTokenizer.getActiveToken()
// expected : { tokenType: 'END', tokenValue: '' }
newTokenizer.hasNext()
// expected : false
newTokenizer.moveActiveTokenToNext()
// expected : Error: Invalid index for active token

Available Grammars:

Word and dot grammar: detects words (even words including:åäöÅÄÖ) and dots using the following regex:

 [
    {
      tokenType: 'WORD',
      regex: /^[\w|åäöÅÄÖ]+/
    },
    {
      tokenType: 'DOT',
      regex: /^\./
    }
  ]

const wordAndDotGrammar = new WordAndDotGrammar()

Arithmetic grammar: detects arithmetic expressions using the following regex:

 [
    {
      tokenType: 'NUMBER',
      regex: /^[0-9]+(\.([0-9])+)?/
    },
    {
      tokenType: 'ADD',
      regex: /^[+]/
    },
    {
      tokenType: 'MUL',
      regex: /^[*]/
    },
    {
      tokenType: 'DIV',
      regex: /^[/]/
    },
    {
      tokenType: 'SUB',
      regex: /^[-]/
    },
    {
      tokenType: 'LEFT_PARENTHESES',
      regex: /^[\(]/
    },
    {
      tokenType: 'RIGHT_PARENTHESES',
      regex: /^[\)]/
    }
  ]

const arithmeticGrammar = new ArithmeticGrammar()

Maximal munch grammar: detects Numbers and distinguishes between Float and Integer using the following regex:

[
    {
      tokenType: 'INTEGER',
      regex: /^[0-9]+/
    },
    {
      tokenType: 'FLOAT',
      regex: /^[0-9]+\.[0-9]+/
    }
  ]

const maximalMunchGrammar = new MaximalMunchGrammar()

Exclamation grammar: detects words (even words including:åäöÅÄÖ) and exclamation marks using the following regex:

 [
    {
      tokenType: 'WORD',
      regex: /^[\w|åäöÅÄÖ]+/
    },
    {
      tokenType: 'EXCLAMATION',
      regex: /^\!/
    }
  ]

const exclamationGrammar = new ExclamationGrammar()

Use your own grammar:

import tokenizer from '@tamim.jabr/tokenizer'

const {
  Tokenizer,
  Grammar
} = tokenizer


const regexAndTypesList = [
  {
    tokenType: 'WORD_WITHOUT_NUMBERS',
    regex: /^[a-zA-Z]+/
  },
  {
    tokenType: 'DOLLAR_SIGN',
    regex: /^\$+/
  }
] 

const ownGrammar = new Grammar(regexAndTypesList)

const testString = '$ test string'
const newTokenizer = new Tokenizer(ownGrammar, testString)

newTokenizer.getActiveToken()
// expected:{ tokenType: 'DOLLAR_SIGN', tokenValue: '$' }

Public Interface (Method to use):

getActiveToken() : return the current active token
moveActiveTokenToNext(): move the active token to next token if exists
moveActiveTokenToPrevious() : move the active token to previous token if exists
hasNext(): return a boolean indicating if there is a token after the current active token or not. !notice that the token of type "END" is a valid token and you will get true if you invoke the method when the current token is before "END"

Lexical Errors:

Lexical Errors happens when you send string to the tokenizer that have a part that the chosen grammar doesn't include in its regex list. You get the lexical error when you first try to move the acitve token to it or when you get the acitve token when the lexical error exists in the first token.

const arithmeticGrammar = new Grammar.ArithmeticGrammar()
const newTokenizer = new Tokenizer(arithmeticGrammar, '4 hej + 2')
newTokenizer.moveActiveTokenToNext()
// expected: LexicalError: No lexical element matches "hej + 2"

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme