@gruhn/regex-utils

v2.11.0

Published

a month ago

TypeScript library for regex equivalence, intersection, complement and other utilities that go beyond string matching.

0High
0Medium
0Low

Regex Utils

Zero-dependency TypeScript library for regex utilities that go beyond string matching. These are surprisingly hard to come by for any programming language. ✨

Documentation
Online demos:
- RegExp Equivalence Checker
- Random Password Generator

API Overview 🚀

🔗 Set-style operations:
- .and(...) - Compute intersection of two regex.
- .not() - Compute the complement of a regex.
- .without(...) - Compute the difference of two regex.
✅ Set-style predicates:
- .isEquivalent(...) - Check whether two regex match the same strings.
- .isSubsetOf(...)
- .isSupersetOf(...)
- .isDisjointFrom(...)
- .isEmpty() - Check whether a regex matches no strings.
📜 Generate strings:
- .sample(...) - Generate random strings matching a regex.
- .enumerate() - Exhaustively enumerate strings matching a regex.
🔧 Miscellaneous:
- .size() - Count the number of strings that a regex matches.
- .derivative(...) - Compute a Brzozowski derivative of a regex.
and others...

Installation 📦

npm install @gruhn/regex-utils

import { RB } from '@gruhn/regex-utils'

Syntax Support

| Feature | Support | Examples | |---------|---------|-------------| | Quantifiers | ✅ | a*, a+, a{3,10}, a? | | Lazy Quantifiers | ✅ | a*?, a+?, a{3,10}?, a?? | | Alternation | ✅ | a\|b | | Character classes | ✅ | ., \w, [a-zA-Z], ... | | Escaping | ✅ | \$, \., ... | | (Non-)capturing groups | ✅ | (?:...), (...) | | Start/end anchors | ⚠️1 | ^, $ | | Lookahead | ⚠️2 | (?=...), (?!...) | | Lookbehind | ⚠️2 | (?<=...), (?<!...) | | Word boundary | ❌ | \b, \B | | Unicode property escapes | ❌ | \p{...}, \P{...} | | Backreferences | ❌ | \1 \2 ... | | dotAll flag | ✅ | /.../s, (?s:...) | | global flag | ✅ | /.../g | | hasIndices flag | ✅ | /.../d | | ignoreCase flag | ❌ | /.../i (?i:...) | | multiline flag | ❌ | /.../m (?m:...) | | unicode flag | ❌ | /.../u | | unicodeSets flag | ❌ | /.../v | | sticky flag | ❌ | /.../y |

Some complex patterns are not supported like anchors inside quantifiers (^a)+ or anchors inside lookaheads (?=^a).
Not supported are nested lookaheads/lookbehinds like (?=a(?=b)) and lookaheads/lookbehinds combinations like (?=a)b(?<=c).

An UnsupportedSyntaxError is thrown when unsupported patterns are detected. The library SHOULD ALWAYS either throw an error or respect the regex specification exactly. Please report a bug if the library silently uses a faulty interpretation.

Handling syntax-related errors:

import { RB, ParseError, UnsupportedSyntaxError } from '@gruhn/regex-utils'

try {
  RB(/^[a-z]*$/)
} catch (error) {
  if (error instanceof SyntaxError) {
    // Invalid regex syntax! Native error, not emitted by this library.
    // E.g. this will also throw a `SyntaxError`: new RegExp(')')
  } else if (error instanceof ParseError) {
    // The regex syntax is valid but the internal parser could not handle it.
    // If this happens it's a bug in this library.
  } else if (error instanceof UnsupportedSyntaxError) {
    // Regex syntax is valid but not supported by this library.
  }
}

Example use cases 💡

Generate test data from regex 📜

Generate 5 random email addresses:

const email = RB(/^[a-z]+@[a-z]+\.[a-z]{2,3}$/)
for (const str of email.sample().take(5)) {
  console.log(str)
}

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Generate 5 random email addresses, which have exactly 20 characters:

const emailLength20 = email.and(/^.{20}$/)
for (const str of emailLength20.sample().take(5)) {
  console.log(str)
}

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Refactor regex then check equivalence 🔄

ONLINE DEMO

Say we found this incredibly complicated regex somewhere in the codebase:

const oldRegex = /^a|b$/

This can be simplified, right?

const newRegex = /^[ab]$/

But to double-check we can use .isEquivalent to verify that the new version matches exactly the same strings as the old version. That is, whether oldRegex.test(str) === newRegex.test(str) for every possible input string:

RB(oldRegex).isEquivalent(newRegex) // false

Looks like we made some mistake. We can generate counterexamples using .without(...) and .sample(...). First, we derive new regex that match exactly what newRegex matches but not oldRegex and vice versa:

const onlyNew = RB(newRegex).without(oldRegex)
const onlyOld = RB(oldRegex).without(newRegex)

onlyNew turns out to be empty (onlyNew.isEmpty() === true) but onlyOld has some matches:

for (const str of onlyOld.sample().take(5)) {
  console.log(str)
}

aaba
aa
aba
bab
aababa

Why does oldRegex match all these strings with multiple characters? Shouldn't it only match "a" or "b" like newRegex? Turns out we thought that oldRegex is the same as ^(a|b)$ but in reality it's the same as (^a)|(b$).

Comment regex using complement 💬

How do you write a regex that matches HTML comments like:

<!-- This is a comment -->

A straightforward attempt would be:

<!--.*-->

The problem is that .* also matches the end marker -->, so this is also a match:

<!-- This is a comment --> and this shouldn't be part of it -->

We need to specify that the inner part can be any string that does not contain -->. With .not() (aka. regex complement) this is easy:

import { RB } from '@gruhn/regex-utils'

const commentStart = RB('<!--')
const commentInner = RB(/^.*-->.*$/).not()
const commentEnd = RB('-->')

const comment = commentStart.concat(commentInner).concat(commentEnd)

With .toRegExp() we can convert back to a native JavaScript regex:

comment.toRegExp()

/^<!--(---*[^->]|-?[^-])*---*>$/

Password regex using intersections 🔐

ONLINE DEMO

It's difficult to write a single regex for multiple independent constraints. For example, to specify a valid password. But with regex intersections it's very natural:

import { RB } from '@gruhn/regex-utils'

const passwordRegex = RB(/^[a-zA-Z0-9]{12,32}$/) // 12-32 alphanumeric characters
  .and(/[0-9]/) // contains a number
  .and(/[A-Z]/) // contains an upper case letter
  .and(/[a-z]/) // contains a lower case letter

We can convert this back to a native JavaScript RegExp with:

passwordRegex.toRegExp()

[!NOTE] The output RegExp can be very large.

We can also use other utilities like .size() to determine how many potential passwords match this regex:

console.log(passwordRegex.size())

2301586451429392354821768871006991487961066695735482449920n

With .sample() we can generate some of these matches:

for (const str of passwordRegex.sample().take(10)) {
  console.log(str)
}

NEWJIAXQISWT0Wwm
lxoegadrzeynezkmtfcIBzzQ9e
ypzvhvtwpWk4u6
MSZXXKIKEKWKXLQ8HQ7Ds
BCBSFBSMNOLKlgQN5L
8950244600709IW1pg
UOTQBLVOTZQWFSAJYBXZNQBEeom0l
520302447164378435bv4dp4ysC
71073970686490eY2Jt4
afgpnxqwUK5B

Solve Advent Of Code 2023 - Day 12 🎄

In the coding puzzle Advent Of Code 2023 - Day 12 you are given pairs of string patterns. An example pair is .??..??...?##. and 1,1,3. Both patterns describe a class of strings and the task is to count the number of strings that match both patterns.

In the first pattern, . and # stand for the literal characters "dot" and "hash". The ? stands for either . or #. This can be written as a regular expression:

for # we simply write #
for . we write o (since . is a reserved symbol in regular expressions)
for ? we write (o|#)

So the pattern .??..??...?##. would be written as:

const firstRegex = /^o(o|#)(o|#)oo(o|#)(o|#)ooo(o|#)##o$/

In the second pattern, each digit stands for a sequence of # separated by at least one o. This can also be written as a regular expression:

For a digit like 3 we write #{3}.
Between digits we write o+.
Additionally, arbitrary many o are allowed at the start and end, so we add o* at the start and end.

Thus, 1,1,3 would be written as:

const secondRegex = /^o*#{1}o+#{1}o+#{3}o*$/

To solve the task and find the number of strings that match both regex, we can use .and(...) and .size() from regex-utils. .and(...) computes the intersection of two regular expressions. That is, it creates a new regex which exactly matches the strings matched by both input regex.

const intersection = RB(firstRegex).and(secondRegex)

With .size() we can then determine the number of matched strings:

console.log(intersection.size())

4n

While at it, we can also try .enumerate() to list all these matches:

for (const str of intersection.enumerate()) {
  console.log(str)
}

oo#ooo#ooo###o
o#oooo#ooo###o
oo#oo#oooo###o
o#ooo#oooo###o

For a full solution checkout: ./benchmark/aoc2023-day12.ts.

References 📖

Heavily informed by these papers:

https://www.khoury.northeastern.edu/home/turon/re-deriv.pdf
https://courses.grainger.illinois.edu/cs374/fa2017/extra_notes/01_nfa_to_reg.pdf