@david-heineback/tokenizer

v1.1.5

Published

4 years ago

Takes a string and make tokens

0High
0Medium
0Low

david-heineback

Tokenizer

Description:

A small package to help you create tokens from a string.

How to install:

npm i @david-heineback/tokenizer

How to use:

Import Tokenizer and Grammars:

import tokenizer from '@david-heineback/tokenizer'
const { Tokenizer, Grammar } = tokenizer

Tokenizer

Tokenizer is a class that takes a grammar and a string as arguments and provides methods to see active token and step through array of tokens

const grammarType = new Grammar() // use your desired grammar from the list of provided  grammars or create your own grammar by following the instructions inte the Create own grammar section below.
const token = new Tokenizer(grammarType, 'String to tokenize')

Provided grammars

WordGrammar - Finds words and dots.
- WORDS (Including åäö)
- DOT (.)
ArithmeticGrammar - Finds arithetic operations
- NUMBERS
- ADD (+)
- MULTIPLY (*)
- SUBTRACT (-)
- DIVIDE (/)
- LEFT PARANTHESES (
- RIGHT PARANTHESES )
- EQUALS =
MaximalMunchGrammar
- FLOAT
- INTEGER
ExclamationGrammar
- WORDS (Excluding åäö)
- EXCLAMATION (!)

Create own grammar

Create your own grammar by either create a new Grammar and use the setRegexExpressionsWithTypes or create your own class by extending the Grammar class and adding regex with tokenType and regex as a object to the this.#regexWithTypes array. See example below:

Note that it is very important to use the ^ in the beginning of the regex to ensure that it works with the tokenizer

Use the set method directly on the Grammar class.

import tokenizer from '@david-heineback/tokenizer'

const { Grammar } = tokenizer
const newGrammar = new Grammar.Grammar()

const grammarArray = [
       {
        tokenType: 'YOUR TOKEN TYPE',
        regex: /^[\w|åäöÅÄÖ]+/g
      },
      {
        tokenType: 'DOT',
        regex: /^\./g
      }
]
newGrammar.setRegexExpressionsWithTypes(grammarArray)

Create your own grammar as a class by extending the Grammar class.

import Grammar from './Grammar.js'
export default class YourGrammar extends Grammar {
  #regexWithTypes

  constructor () {
    super()
    this.#regexWithTypes = [
      {
        tokenType: 'YOUR TOKEN TYPE',
        regex: /^[\w|åäöÅÄÖ]+/g
      },
      {
        tokenType: 'DOT',
        regex: /^\./g
      }
    ]
    this.setRegexExpressionsWithTypes(this.#regexWithTypes)
  }
}

Methods:

Get the token att current index.

getActiveToken() returns the currently active token as object with token type, regex used to match and token value. Throw Error Throws a LexicalError if trying to access a index that contains a token with a lexical error

Set the active token to the next index.

setActiveTokenToNext() Throw Error Throws a LexicalError if trying to access a index that contains a token with a lexical error

Set the active token to the previous index.

setActiveTokenToPrevious() Throw Error Throws a IndexError if trying to access a index before index 0.

Reuse same Tokenizer object:

Set a new string to tokenize by passing a new string.

token.setNewStringToTokenize('New string to tokenize')

Set a new lexical grammar to use to tokenize by passing a new grammar and a string to tokenize.

setNewLexicalGrammar(arithmeticGrammar, 'string to tokenize')

Count number of tokens. (Will include count of invalid token!)

countTokens()

Example


import tokenizer from '@david-heineback/tokenizer'

const { Tokenizer, Grammar } = tokenizer

const wordGrammar = new Grammar.WordGrammar()
const arithmeticGrammar = new Grammar.ArithmeticGrammar()
const maximalMunchGrammar = new Grammar.MaximalMunchGrammar()
const exclamationGrammar = new Grammar.ExclamationGrammar()

// create a new Tokenizer
const token = new Tokenizer(wordGrammar, 'Hello World')

// Get the token att current index.
token.getActiveToken()
// expected : { tokenType: 'WORD', regex: /^[\w|åäöÅÄÖ]+/g tokenValue: 'Hello' }

// Set the active token to the next index.
token.setActiveTokenToNext()
token.getActiveToken()
// expected : { tokenType: 'WORD', regex: /^[\w|åäöÅÄÖ]+/g tokenValue: 'World' }


// Set the active token to the previous index.
token.setActiveTokenToPrevious()
token.getActiveToken()
// expected : { tokenType: 'WORD', regex: /^[\w|åäöÅÄÖ]+/g tokenValue: 'Hello' }

// Set a new string to tokenize by passing a new string.
token.setNewStringToTokenize('New string to tokenize')

//Set a new lexical grammar to use to tokenize by passing a new grammar and a new string to use with  the new lexical grammar.
token.setNewLexicalGrammar(arithmeticGrammar, 'string to tokenize')

// Count number of tokens. (Will include count of invalid token!)
token.countTokens()```

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Tokenizer

Description:

How to install:

How to use:

Import Tokenizer and Grammars:

Tokenizer

Provided grammars

Create own grammar

Note that it is very important to use the ^ in the beginning of the regex to ensure that it works with the tokenizer

Use the set method directly on the Grammar class.

Create your own grammar as a class by extending the Grammar class.

Methods:

Get the token att current index.

Set the active token to the next index.

Set the active token to the previous index.

Reuse same Tokenizer object:

Set a new string to tokenize by passing a new string.

Set a new lexical grammar to use to tokenize by passing a new grammar and a string to tokenize.

Count number of tokens. (Will include count of invalid token!)

Example