npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@millennial-sky/lex-with-regexps

v0.1.3

Published

A small library the provides a simple API to tokenize strings using regular expressions

Readme

Lexing with Regular Expressions

This is a small (~50 lines of code) library the provides a simple API to tokenize strings using regular expressions.

Overview

Here is an example that we will use throughout the documentation

import {lex} from "@millennial-sky/lex-with-regexps"

lex("one 42", {
  id: /[a-zA-Z_][a-zA-Z0-9_]*/,
  num: /\d+/,
  ws: /\s+/,
})

lex will return an array of tokens, which are objects of this form

{
  kind: "id",
  text: "one",
  loc: {index: 0, line: 1, column: 1}
}

The focus of the library is minimalism and ease of use in articles that have to parse small languages, not on performance (which shouldn't be too bad anyway, but we go over the input more than once to compute the location information).

The library is written in literate JavaScript, using @millennial-sky/literate.

Implementation

Given some string and an index, the lexer will need to find which (if any) of the regular expressions matches at that index.

Merging the regular expressions

We could loop through each regexp and try it, but it's better to let the JavaScript RegExp engine do the heavy lifting.

The idea is to build a regular expression that uses named capture groups to differentiate between the token kinds

const log = console.log
log(
  /(?<id>[a-zA-Z_][a-zA-Z0-9_]*)|(?<num>\d+)|(?<ws>\s+)/sy.exec("one 42")
)
//>  [
//>    'one',
//>    'one',
//>    undefined,
//>    undefined,
//>    index: 0,
//>    input: 'one 42',
//>    groups: [Object: null prototype] { id: 'one', num: undefined, ws: undefined }
//>  ]

The y flag (called "sticky") will allow us to decide at which index the match should begin. The s flag ("dotAll") makes the . match newlines.

The function that merges the regular expressions is this

const buildLexerRegExp = (regExps) => {
  let buf = []
  for (let [tokenKind, regexp] of Object.entries(regExps)) {
    buf.push(`(?<${tokenKind}>${regexp.source})`)
  }
  return new RegExp(buf.join("|"), "sy")
}

So in our example

const regExps = {
  id: /[a-zA-Z_][a-zA-Z0-9_]*/,
  num: /\d+/,
  ws: /\s+/,
}

const lexerRegExp = buildLexerRegExp(regExps)
log(lexerRegExp)
//>  /(?<id>[a-zA-Z_][a-zA-Z0-9_]*)|(?<num>\d+)|(?<ws>\s+)/sy

We could have used regular capture groups instead of the named ones, but that would have been less robust: the user would have had to remember to use non-capture groups (?:...) in their regexps to avoid bugs. On the other hand users have no reason to use named capture groups.

Finding the next token

Now that we have a single regular expression, we can use it to find the next token

const matchNext = (index, str, lexerRegExp) => {
  // Setting `lastIndex` (while using the sticky flag "y") is how you tell RegExp#exec 
  // that the match should start exactly at that index
  lexerRegExp.lastIndex = index
  const match = lexerRegExp.exec(str)
  if (!match) return null
  
  const tokenKind = Object.keys(match.groups).find((k) => match.groups[k])
  
  return {
    kind: tokenKind,
    text: match[0]
  }
}

so

log(matchNext(0, "one 42", lexerRegExp))
log(matchNext(3, "one 42", lexerRegExp))
log(matchNext(4, "one 42", lexerRegExp))
log(matchNext(0, "+", lexerRegExp))
//>  { kind: 'id', text: 'one' }
//>  { kind: 'ws', text: ' ' }
//>  { kind: 'num', text: '42' }
//>  null

The source location and its advancement

We want to add location information to the tokens, i.e. the index, line and column where the tokens begin.

The location of the first token is always

{index: 0, line: 1, column: 1}

After we match a token we can use its text to get the starting location of the next token. The rules are easy:

  • the index increases by the length of the token text
  • the line increases by the number of newlines in the token text
  • the column
    • if the token text contains a newline, it is the number of characters after the last newline
    • otherwise it increases by the length of the token text
const nextLocation = (loc, text) => {
  const index = loc.index + text.length
  
  const line = loc.line + (text.match(/\n/g)?.length ?? 0)
  
  const lastNewlineIndex = text.lastIndexOf("\n")
  const column = (lastNewlineIndex >= 0) ?
    text.length - lastNewlineIndex :
    loc.column + text.length

  return {index, line, column}
}

We try it out on a few made up token strings

log(nextLocation({index: 0, line: 1, column: 1}, "aaa \n bb"))
log(nextLocation({index: 0, line: 1, column: 1}, "aaa \n  \n bb"))
log(nextLocation({index: 0, line: 1, column: 1}, "aa"))
//>  { index: 8, line: 2, column: 4 }
//>  { index: 11, line: 3, column: 4 }
//>  { index: 2, line: 1, column: 3 }

Errors

If the lexer regexp doesn't match we want to throw a specialized Error containing all the relevant information

export class SyntaxError extends Error {
  constructor(loc, src, msg) {
    const line = src.split("\n")[loc.line - 1]
    const marker = " ".repeat(loc.column - 1) + "^"
    super(`${loc.line}:${loc.column}: ${msg}\n${line}\n${marker}`)
    this.loc = loc
    this.src = src
  }
}

For example

try {
  throw new SyntaxError({index: 3, line: 1, column: 4}, "one 42", "Made up error")
} catch (e) {
  log(e.message)
}
//>  1:4: Made up error
//>  one 42
//>     ^

Putting it all together

We can now write the function that does the lexing

export const lex = (src, regExps) => {
  const lexerRegExp = buildLexerRegExp(regExps)
  const tokens = []
  let loc = {index: 0, line: 1, column: 1}
  while (loc.index < src.length) {
    const match = matchNext(loc.index, src, lexerRegExp)
    if (match == null) throw new SyntaxError(loc, src, "Unknown token")
    tokens.push({...match, loc})
    loc = nextLocation(loc, match.text)
  }
  return tokens
}

Lets try it out

log(lex("one 42", regExps))
//>  [
//>    { kind: 'id', text: 'one', loc: { index: 0, line: 1, column: 1 } },
//>    { kind: 'ws', text: ' ', loc: { index: 3, line: 1, column: 4 } },
//>    { kind: 'num', text: '42', loc: { index: 4, line: 1, column: 5 } }
//>  ]

while a failure looks like this

try {
  lex("one + 42", regExps)
} catch (e) {
  log(e.message)
}
//>  1:5: Unknown token
//>  one + 42
//>      ^