@davidflanagan/tsrex

v0.1.0

Published

a month ago

RE2 regular expressions for TypeScript: linear-time pattern matching, immune from ReDos.

Downloads

0High
0Medium
0Low

davidflanagan

TypeScript RE2 RegExp RegEx

tsrex

A TypeScript RE2 regular expression engine. Ported from Go 1.26.2's regexp package with a Lazy DFA for fast boolean matching.

[!WARNING] This port of the Go regexp package to TypeScript was created (including most of this README file) almost entirely by Claude Code in a vibe coding experiment. Because the Go code has such a robust test suite (which has also been ported) I have reasonable confidence that the code works correctly. Benchmarks (which are not yet ready for release) suggest that this package is generally faster than re2js and also produces smaller bundles. I'm releasing this early version even though I have not yet reviewed all of the generated code because I believe it may be generally useful to the JavaScript ecosystem.

[!NOTE] I created this port during my time at Buf Technologies, Inc. (https://buf.build). Buf open-sourced the code under the Apache 2.0 license by publishing this commit: https://github.com/bufbuild/buf-examples/commit/22e0e81c3204a2631cc73d911d87a7adafe3ba39 This repo is effectively a fork of the work I did while at Buf. The NOTICE file preserves Buf's copyright in the work.

Features

RE2 semantics: linear-time matching, no catastrophic backtracking; should be immune to ReDoS attacks
String or Uint8Array: Rex accepts either a JS string (UTF-16 code-unit offsets) or a Uint8Array (UTF-8 byte offsets)
Native inputs throughout: string calls never encode to UTF-8; byte calls never decode
TinyRex for minimal package size: users who only need boolean matching on strings can import TinyRex to minimize bundle size.
Idiomatic TypeScript: rich Match objects with lazy fields, iterator support, well-known Symbol.* integration
Zero runtime dependencies

Quick Start

import { Rex } from "@davidflanagan/tsrex";

const re = Rex.compile("(\\w+)@(\\w+)");

// String input — UTF-16 code-unit offsets
re.test("user@host");                     // true
re.match("email: user@host")?.text;       // "user@host"
re.match("email: user@host")?.start;      // 7
re.matchAll("a@b c@d").map((m) => m.text); // ["a@b", "c@d"]
re.replace("a@b c@d", "X");               // "X c@d"   (first match)
re.replace("a@b c@d", "X", -1);           // "X X"     (all matches)
re.replace("a@b", (m) => `${m.capture(1)}=${m.capture(0)}`); // "b=a"
Rex.compile(",\\s*").split("a, b, c");    // ["a", "b", "c"]

// Byte input — UTF-8 byte offsets
const buf = new TextEncoder().encode("email: user@host");
re.test(buf);                             // true
re.match(buf)?.text;                      // Uint8Array of bytes "user@host"
re.match(buf)?.start;                     // 7
re.replace(buf, new TextEncoder().encode("X"));
re.split(buf);                            // Uint8Array[]

The same compiled Rex serves both input types: pass whichever you have and you get matches back in the same units. Positions (match.start, match.end, match.captureRange(i)) are UTF-16 code-unit offsets for string inputs (compatible with s.substring(start, end), s.charAt(pos), s.length) and UTF-8 byte offsets for Uint8Array inputs (matching Go's regexp on []byte).

If you only want boolean pattern matching against strings (i.e. no Uint8Array inputs and no match position or capturing group outputs) you can use TinyRex instead of Rex to significantly decrease your bundle size:

import { TinyRex } from "@davidflanagan/tsrex/tiny";

const re = TinyRex.compile("\\b\\w+@\\w+\\b");
re.test("contact: user@host");  // true
re.test("no email here");       // false

API

Rex satisfies both Matcher<string> and Matcher<Uint8Array> via overloaded methods, so code written against Matcher<T> for either T works unchanged.

Factory and compile options

Rex.compile(pattern: string, options?: CompileOptions): Rex

type CompileOptions = {
  posix?: boolean;           // POSIX ERE — disables Perl extensions
  longest?: boolean;         // leftmost-longest (POSIX) semantics
  caseInsensitive?: boolean; // like (?i)
  multiline?: boolean;       // like (?m) — ^/$ match line boundaries
  dotAll?: boolean;          // like (?s) — . matches newline
  swapGreedy?: boolean;      // like (?U) — flip greedy/non-greedy
  literal?: boolean;         // treat pattern as a plain string
  global?: boolean;          // "pretend g flag" — makes the Rex compatible
                             // with String.prototype.replaceAll / matchAll
};

Rex.compile throws ParseError (a subclass of the built-in SyntaxError) on invalid syntax.

Methods

// Boolean
re.test(input): boolean

// First leftmost match (or null)
re.match(input): Match<T> | null

// All non-overlapping matches. Omit limit for unlimited; limit === 0 returns [].
re.matchAll(input, limit?): Match<T>[]
re.matches(input, limit?): IterableIterator<Match<T>>  // to iterate over matches

// Replace. `replacement` is either a literal T (no $-expansion) or a
// callback that receives each Match and returns its replacement.
// Omit limit for unlimited; limit === 0 is a no-op.
re.replace(src, replacement, limit?): T

// Split. limit < 0 or omitted = all pieces; limit === 0 returns null;
// limit > 0 returns at most `limit` pieces.
re.split(input, limit?): T[] | null

// Metadata
re.captureNames: readonly (string | null)[]  // one entry per capturing
                                             // group, null if unnamed
re.flags: string                             // "" or "g" (see `global` option)
re.toString(): string                        // original pattern

Well-known symbols

Rex implements Symbol.match, Symbol.matchAll, Symbol.search, Symbol.split, and Symbol.replace, so native String.prototype methods delegate to it:

const re = Rex.compile("\\d+");
"abc 42".match(re);                    // Match with .text === "42"
"abc 42".search(re);                   // 4
"a,b,c".split(Rex.compile(","));       // ["a", "b", "c"]
"1 22 333".replace(re, "X");           // "X 22 333" (first match)

// String.prototype.replaceAll and matchAll require flags === "g",
// so compile with { global: true }:
const reg = Rex.compile("\\d+", { global: true });
"1 22 333".replaceAll(reg, "X");       // "X X X"
[..."1 22 333".matchAll(reg)].map((m) => m.text); // ["1", "22", "333"]

The `Match<T>` object

class Match<T> {
  readonly input: T;
  readonly start: number;     // overall match start
  readonly end: number;       // overall match end (half-open)
  readonly text: T;           // overall match text (lazy)

  // group(g): historical semantics — g=0 is the overall match,
  //   g=1..N are the capturing groups; strings look up named groups.
  group(g: number | string): T | undefined;
  groupRange(g: number | string): readonly [number, number] | undefined;

  // capture(c): zero-indexed over capturing groups — c=0 is the first
  //   capture. Strings look up named groups (same as group(name)).
  capture(c: number | string): T | undefined;
  captureRange(c: number | string): readonly [number, number] | undefined;

  // Aggregate views — lazy.
  readonly captures: readonly (T | undefined)[];            // zero-indexed
  readonly namedCaptures: Readonly<Record<string, T | undefined>>;
}

undefined means the group didn't participate in the match (for example, the untaken branch of an alternation), or the name/index didn't resolve.

Offsets

Passing a string returns UTF-16 code-unit positions — the same units JS uses for s.substring(...), s.length, etc.
Passing a Uint8Array returns UTF-8 byte positions, matching Go's regexp on []byte.

If you hold a JS string and need byte offsets, encode it yourself and call with the resulting Uint8Array. If you have byte data and want code-unit offsets, decode first and call with the resulting string. The library does not translate between them.

Supported Syntax

Full RE2/Perl syntax including:

Character classes: [abc], [a-z], [[:alpha:]], \d, \w, \s
Repetition: *, +, ?, {n}, {n,}, {n,m}, non-greedy *?, +?, ??
Grouping: (...), (?:...), (?P<name>...)
Anchors: ^, $, \A, \z, \b, \B
Inline flags: (?i), (?m), (?s), (?-m) (also settable via CompileOptions)
Escape sequences: \n, \t, \x{FF}, \077, etc.

Unicode property classes (\p{Lu}, \p{Greek}, etc.) work out of the box — see below.

Unicode Properties

Unicode character classes (\p{...} / \P{...}) work with no setup:

import { Rex } from "@davidflanagan/tsrex";

Rex.compile("\\p{Greek}+").match("hello αβγ world")?.text;       // "αβγ"
Rex.compile("\\p{Uppercase_Letter}+").match("ABC def")?.text;    // "ABC"
Rex.compile("\\p{Script=Devanagari}+").match("नमस्ते")?.text;       // "नमस्ते"

Under the hood there are two paths, chosen automatically per property:

Registered tables (tree-shakeable). Import a category, script, or binary property as a side effect to pre-load its range + fold tables:
```
import "@davidflanagan/tsrex/unicode16/categories/Lu";
import "@davidflanagan/tsrex/unicode16/scripts/Greek";
import "@davidflanagan/tsrex/unicode16/properties/Alphabetic";
import "@davidflanagan/tsrex/unicode16/all";            // or load everything
```
Registered tables are generated from @unicode/unicode-16.0.0 (a dev-only dependency) and encoded with that package's compact base64 RLE format, decoded lazily on first use. Every canonical alias is pre-registered, so importing .../categories/Lu also makes \p{Uppercase_Letter} resolve. A sibling fold table ships alongside so (?i)\p{...} folds without walking the fold orbit at parse time.
Runtime extraction (fallback). If a property isn't registered, tsrex extracts its range table from the host JS engine's Unicode database on first use (via a native \p{...}-with-/u probe) and caches it for subsequent compiles. First-use extraction is ~30–50 ms per property; subsequent uses are free. The extracted table tracks whatever Unicode version the host engine ships, and (?i) folding happens on demand via the generic fold path.

For latency-sensitive startup where you know which properties you'll need but don't want to import them, warm the extraction cache up front:

Rex.preloadUnicode(["Lu", "Ll", "Greek", "Han"]);

The built-ins \p{Any}, \p{ASCII}, and \p{Assigned} need neither an import nor extraction. Long aliases (\p{Uppercase_Letter} for \p{Lu}) and case-insensitive names work on both paths. References to properties neither registered nor recognized by the host engine cause Rex.compile to throw a ParseError.

A handful of very recent script names (Beria_Erfe, Katakana_Or_Hiragana, Sidetic, Tai_Yo, Tolong_Siki) exist in the Unicode property-value aliases database but haven't been added to @unicode/unicode-16.0.0's data tables yet; these aren't pre-registered. They still resolve via the extraction fallback if your JS engine recognizes them.

Architecture

The engine selects the fastest execution strategy automatically:

Literal fast path: pure literal patterns use a single bytesIndex / String.indexOf.
Lazy DFA: cached NFA state-set transitions for boolean matching (test). Can't report positions or captures — only yes/no.
OnePass: single-pass DFA for patterns whose alternations are unambiguous by rune. Handles captures.
Backtracker / NFA: general-purpose, capture-capable fallbacks.

Input is consumed via a RuneGetter<P> abstraction with two concrete implementations — a string getter that reads via charCodeAt / codePointAt (UTF-16 surrogate-aware) and a bytes getter that reads via direct indexing and UTF-8 decoding. Every engine (DFA, OnePass, Backtracker, NFA) talks to the getter rather than to a raw input, so one engine implementation serves both input shapes.

A prefilter extracts required literal substrings from the AST and rejects non-matching inputs before engaging any automaton.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme