@davidflanagan/tsrex
v0.1.0
Published
RE2 regular expressions for TypeScript: linear-time pattern matching, immune from ReDos.
Downloads
26
Maintainers
Readme
tsrex
A TypeScript RE2 regular expression engine. Ported from Go 1.26.2's regexp
package with a Lazy DFA for fast boolean matching.
[!WARNING] This port of the Go regexp package to TypeScript was created (including most of this README file) almost entirely by Claude Code in a vibe coding experiment. Because the Go code has such a robust test suite (which has also been ported) I have reasonable confidence that the code works correctly. Benchmarks (which are not yet ready for release) suggest that this package is generally faster than re2js and also produces smaller bundles. I'm releasing this early version even though I have not yet reviewed all of the generated code because I believe it may be generally useful to the JavaScript ecosystem.
[!NOTE] I created this port during my time at Buf Technologies, Inc. (https://buf.build). Buf open-sourced the code under the Apache 2.0 license by publishing this commit: https://github.com/bufbuild/buf-examples/commit/22e0e81c3204a2631cc73d911d87a7adafe3ba39 This repo is effectively a fork of the work I did while at Buf. The NOTICE file preserves Buf's copyright in the work.
Features
- RE2 semantics: linear-time matching, no catastrophic backtracking; should be immune to ReDoS attacks
- String or Uint8Array:
Rexaccepts either a JS string (UTF-16 code-unit offsets) or aUint8Array(UTF-8 byte offsets) - Native inputs throughout: string calls never encode to UTF-8; byte calls never decode
- TinyRex for minimal package size: users who only need boolean matching
on strings can import
TinyRexto minimize bundle size. - Idiomatic TypeScript: rich
Matchobjects with lazy fields, iterator support, well-knownSymbol.*integration - Zero runtime dependencies
Quick Start
import { Rex } from "@davidflanagan/tsrex";
const re = Rex.compile("(\\w+)@(\\w+)");
// String input — UTF-16 code-unit offsets
re.test("user@host"); // true
re.match("email: user@host")?.text; // "user@host"
re.match("email: user@host")?.start; // 7
re.matchAll("a@b c@d").map((m) => m.text); // ["a@b", "c@d"]
re.replace("a@b c@d", "X"); // "X c@d" (first match)
re.replace("a@b c@d", "X", -1); // "X X" (all matches)
re.replace("a@b", (m) => `${m.capture(1)}=${m.capture(0)}`); // "b=a"
Rex.compile(",\\s*").split("a, b, c"); // ["a", "b", "c"]
// Byte input — UTF-8 byte offsets
const buf = new TextEncoder().encode("email: user@host");
re.test(buf); // true
re.match(buf)?.text; // Uint8Array of bytes "user@host"
re.match(buf)?.start; // 7
re.replace(buf, new TextEncoder().encode("X"));
re.split(buf); // Uint8Array[]The same compiled Rex serves both input types: pass whichever you have and
you get matches back in the same units. Positions (match.start, match.end,
match.captureRange(i)) are UTF-16 code-unit offsets for string inputs
(compatible with s.substring(start, end), s.charAt(pos), s.length) and
UTF-8 byte offsets for Uint8Array inputs (matching Go's regexp on
[]byte).
If you only want boolean pattern matching against strings (i.e. no Uint8Array
inputs and no match position or capturing group outputs) you can use TinyRex
instead of Rex to significantly decrease your bundle size:
import { TinyRex } from "@davidflanagan/tsrex/tiny";
const re = TinyRex.compile("\\b\\w+@\\w+\\b");
re.test("contact: user@host"); // true
re.test("no email here"); // falseAPI
Rex satisfies both Matcher<string> and Matcher<Uint8Array> via overloaded methods, so code written against Matcher<T> for either T works unchanged.
Factory and compile options
Rex.compile(pattern: string, options?: CompileOptions): Rex
type CompileOptions = {
posix?: boolean; // POSIX ERE — disables Perl extensions
longest?: boolean; // leftmost-longest (POSIX) semantics
caseInsensitive?: boolean; // like (?i)
multiline?: boolean; // like (?m) — ^/$ match line boundaries
dotAll?: boolean; // like (?s) — . matches newline
swapGreedy?: boolean; // like (?U) — flip greedy/non-greedy
literal?: boolean; // treat pattern as a plain string
global?: boolean; // "pretend g flag" — makes the Rex compatible
// with String.prototype.replaceAll / matchAll
};Rex.compile throws ParseError (a subclass of the built-in SyntaxError) on invalid syntax.
Methods
// Boolean
re.test(input): boolean
// First leftmost match (or null)
re.match(input): Match<T> | null
// All non-overlapping matches. Omit limit for unlimited; limit === 0 returns [].
re.matchAll(input, limit?): Match<T>[]
re.matches(input, limit?): IterableIterator<Match<T>> // to iterate over matches
// Replace. `replacement` is either a literal T (no $-expansion) or a
// callback that receives each Match and returns its replacement.
// Omit limit for unlimited; limit === 0 is a no-op.
re.replace(src, replacement, limit?): T
// Split. limit < 0 or omitted = all pieces; limit === 0 returns null;
// limit > 0 returns at most `limit` pieces.
re.split(input, limit?): T[] | null
// Metadata
re.captureNames: readonly (string | null)[] // one entry per capturing
// group, null if unnamed
re.flags: string // "" or "g" (see `global` option)
re.toString(): string // original patternWell-known symbols
Rex implements Symbol.match, Symbol.matchAll, Symbol.search, Symbol.split, and Symbol.replace, so native String.prototype methods delegate to it:
const re = Rex.compile("\\d+");
"abc 42".match(re); // Match with .text === "42"
"abc 42".search(re); // 4
"a,b,c".split(Rex.compile(",")); // ["a", "b", "c"]
"1 22 333".replace(re, "X"); // "X 22 333" (first match)
// String.prototype.replaceAll and matchAll require flags === "g",
// so compile with { global: true }:
const reg = Rex.compile("\\d+", { global: true });
"1 22 333".replaceAll(reg, "X"); // "X X X"
[..."1 22 333".matchAll(reg)].map((m) => m.text); // ["1", "22", "333"]The Match<T> object
class Match<T> {
readonly input: T;
readonly start: number; // overall match start
readonly end: number; // overall match end (half-open)
readonly text: T; // overall match text (lazy)
// group(g): historical semantics — g=0 is the overall match,
// g=1..N are the capturing groups; strings look up named groups.
group(g: number | string): T | undefined;
groupRange(g: number | string): readonly [number, number] | undefined;
// capture(c): zero-indexed over capturing groups — c=0 is the first
// capture. Strings look up named groups (same as group(name)).
capture(c: number | string): T | undefined;
captureRange(c: number | string): readonly [number, number] | undefined;
// Aggregate views — lazy.
readonly captures: readonly (T | undefined)[]; // zero-indexed
readonly namedCaptures: Readonly<Record<string, T | undefined>>;
}undefined means the group didn't participate in the match (for example, the untaken branch of an alternation), or the name/index didn't resolve.
Offsets
- Passing a string returns UTF-16 code-unit positions — the same units JS uses for
s.substring(...),s.length, etc. - Passing a
Uint8Arrayreturns UTF-8 byte positions, matching Go'sregexpon[]byte.
If you hold a JS string and need byte offsets, encode it yourself and call with the resulting Uint8Array. If you have byte data and want code-unit offsets, decode first and call with the resulting string. The library does not translate between them.
Supported Syntax
Full RE2/Perl syntax including:
- Character classes:
[abc],[a-z],[[:alpha:]],\d,\w,\s - Repetition:
*,+,?,{n},{n,},{n,m}, non-greedy*?,+?,?? - Grouping:
(...),(?:...),(?P<name>...) - Anchors:
^,$,\A,\z,\b,\B - Inline flags:
(?i),(?m),(?s),(?-m)(also settable viaCompileOptions) - Escape sequences:
\n,\t,\x{FF},\077, etc.
Unicode property classes (\p{Lu}, \p{Greek}, etc.) work out of the box — see below.
Unicode Properties
Unicode character classes (\p{...} / \P{...}) work with no setup:
import { Rex } from "@davidflanagan/tsrex";
Rex.compile("\\p{Greek}+").match("hello αβγ world")?.text; // "αβγ"
Rex.compile("\\p{Uppercase_Letter}+").match("ABC def")?.text; // "ABC"
Rex.compile("\\p{Script=Devanagari}+").match("नमस्ते")?.text; // "नमस्ते"Under the hood there are two paths, chosen automatically per property:
Registered tables (tree-shakeable). Import a category, script, or binary property as a side effect to pre-load its range + fold tables:
import "@davidflanagan/tsrex/unicode16/categories/Lu"; import "@davidflanagan/tsrex/unicode16/scripts/Greek"; import "@davidflanagan/tsrex/unicode16/properties/Alphabetic"; import "@davidflanagan/tsrex/unicode16/all"; // or load everythingRegistered tables are generated from
@unicode/unicode-16.0.0(a dev-only dependency) and encoded with that package's compact base64 RLE format, decoded lazily on first use. Every canonical alias is pre-registered, so importing.../categories/Lualso makes\p{Uppercase_Letter}resolve. A sibling fold table ships alongside so(?i)\p{...}folds without walking the fold orbit at parse time.Runtime extraction (fallback). If a property isn't registered, tsrex extracts its range table from the host JS engine's Unicode database on first use (via a native
\p{...}-with-/u probe) and caches it for subsequent compiles. First-use extraction is ~30–50 ms per property; subsequent uses are free. The extracted table tracks whatever Unicode version the host engine ships, and (?i) folding happens on demand via the generic fold path.
For latency-sensitive startup where you know which properties you'll need but don't want to import them, warm the extraction cache up front:
Rex.preloadUnicode(["Lu", "Ll", "Greek", "Han"]);The built-ins \p{Any}, \p{ASCII}, and \p{Assigned} need neither an
import nor extraction. Long aliases (\p{Uppercase_Letter} for \p{Lu})
and case-insensitive names work on both paths. References to properties
neither registered nor recognized by the host engine cause Rex.compile
to throw a ParseError.
A handful of very recent script names (Beria_Erfe, Katakana_Or_Hiragana,
Sidetic, Tai_Yo, Tolong_Siki) exist in the Unicode property-value
aliases database but haven't been added to @unicode/unicode-16.0.0's
data tables yet; these aren't pre-registered. They still resolve via the
extraction fallback if your JS engine recognizes them.
Architecture
The engine selects the fastest execution strategy automatically:
- Literal fast path: pure literal patterns use a single
bytesIndex/String.indexOf. - Lazy DFA: cached NFA state-set transitions for boolean matching (
test). Can't report positions or captures — only yes/no. - OnePass: single-pass DFA for patterns whose alternations are unambiguous by rune. Handles captures.
- Backtracker / NFA: general-purpose, capture-capable fallbacks.
Input is consumed via a RuneGetter<P> abstraction with two concrete implementations — a string getter that reads via charCodeAt / codePointAt (UTF-16 surrogate-aware) and a bytes getter that reads via direct indexing and UTF-8 decoding. Every engine (DFA, OnePass, Backtracker, NFA) talks to the getter rather than to a raw input, so one engine implementation serves both input shapes.
A prefilter extracts required literal substrings from the AST and rejects non-matching inputs before engaging any automaton.
