ez-regex-patterns

v1.0.2

Published

5 hours ago

Readable, composable patterns that compile to native RegExp.

0High
0Medium
0Low

oosukeren

regex regexp pattern dsl composable template-literal

ez-regex-patterns

Readable, composable patterns that compile to a native RegExp.

Regex isn't unreadable because of its primitives — [\w-]+ is terse and fine. It's unreadable because it has no naming and no composition: one undifferentiated expression, no sub-parts you can name and reuse, capture groups far from their meaning.

This library adds exactly that and nothing else. You write fragments in a small, grammar-like language, name them as ordinary variables, and compose them by interpolation. The result compiles — once, at definition time — to a real regex.

Install

npm install ez-regex-patterns

Usage

import { pattern } from "ez-regex-patterns";

const alpha = pattern`'a'..'z' | 'A'..'Z'`;
const digit = pattern`'0'..'9'`;
const word  = pattern`${alpha} | ${digit} | '_' | '-'`;
const ident = pattern`${word}+`;

ident.source;   // "[a-zA-Z0-9_\\-]+"

A Pattern is usable anywhere a RegExp is — it carries the native surface (exec, lastIndex, test, source, flags, and the Symbol.replace/match/split hooks) and drives the String methods directly:

ident.test(input);
ident.exec(input);                         // and `pattern.lastIndex = i` to scan from a cursor
input.replace(ident, (m) => m.toUpperCase());

// The DSL also adds a parts-returning matcher:
ident.match(input);                        // { text, index, parts } | undefined
ident.matchAll(input);                     // every match, with parts

Patterns are flag-agnostic; ask for a flag only where you need it. .global, .ignoreCase, and .sticky each return a flag-flavoured copy (the node is shared, only the flag differs, and they compose):

const stmt = grammar`r = start 'use ' word+ ';'`.r;
input.replace(stmt.global, fix);           // every occurrence
tag.ignoreCase.test('<SCRIPT>');           // case-insensitive
scanner.sticky;                            // anchors each match at lastIndex

.toRegExp(flags) is still there for an explicit copy under a one-off flag set.

The emitter does the two things hand-written regex makes you do by eye:

Merges an alternation of single chars / ranges into one class — 'a'..'z' | 'A'..'Z' | '_' becomes [a-zA-Z_], not (?:[a-z]|[A-Z]|_).
Parenthesizes only where precedence demands it — ${word}+ becomes [a-zA-Z0-9_\-]+, and a quantified alternation is auto-grouped, so the + never silently binds to the last branch.

Grammar

| Form | Meaning | Example | | --------------------- | ---------------------------------- | ----------------------------- | | 'x' | literal | '=', '.' | | 'a'..'z' | character range | '0'..'9' | | char(202A) | character by Unicode codepoint | char(00A0) | | char(a)..char(b) | codepoint range | char(00C0)..char(00D6) | | a b | sequence (juxtaposition) | ${sigil} ${ident} | | a \| b | alternation | '.' \| '#' \| ':' | | ( … ) | grouping | ('=' ${ident})? | | a+ a* a? | quantifiers | ${word}+ | | a{n} a{n..} a{n..m} | bounded repetition | ${word}{1..6} | | !a | negate a char / class | !digit, !('a' \| '_') | | before x after x | lookahead / lookbehind (zero-width)| before ')' | | same name | backreference to a captured rule | quote ... same quote | | until x | lazy run up to (not eating) x | until '</script>' | | ${fragment} | splice another pattern as an atom | ${word} | | // … | line comment (stripped) | |

Whitespace between tokens is insignificant, so patterns can be laid out and annotated — the free-spacing mode regex literals never had.

char — codepoints and ranges

char(HEX) names a single character by its Unicode codepoint, emitting \uXXXX (BMP) or \u{…} (astral) — the readable way to put an otherwise-invisible character (a bidi control, a zero-width space) into a pattern. char(a)..char(b) is a codepoint range. Both are class-safe (they merge into […] and negate like a literal); an astral codepoint forces the engine's unicode (u) flag on automatically.

grammar`r = char(202A)`.r.source;                  // "\\u202a"
grammar`r = char(00C0)..char(00D6)`.r.source;      // "[\\u00c0-\\u00d6]"

Bounded quantifiers

Beyond + * ?, {…} gives explicit counts, reusing .. for the range (decimal, not hex like char):

grammar`r = ${word}{3}`.r.source;     // "…{3}"      exactly 3
grammar`r = ${word}{2..}`.r.source;   // "…{2,}"     2 or more
grammar`r = ${word}{2..5}`.r.source;  // "…{2,5}"    2 to 5
grammar`r = ${word}{..5}`.r.source;   // "…{0,5}"    up to 5

Lookaround

before x / after x are zero-width positive lookahead / lookbehind; ! in front gives the negatives:

grammar`r = before ')'`.r.source;     // "(?=\\))"
grammar`r = !before ')'`.r.source;    // "(?!\\))"
grammar`r = after '('`.r.source;      // "(?<=\\()"
grammar`r = !after '('`.r.source;     // "(?<!\\()"

Backreference

same name re-matches the exact text an earlier capture (a rule reference) of that name matched — emits \k<name>, and only means anything inside a grammar where the named capture exists:

const quoted = grammar`
  quote = '"' | char(27)
  r = quote 'hi' same quote
`.r;
quoted.test('"hi"');   // true   — closing quote must match the opener
quoted.test('"hi\'');  // false

Negation

!a matches one character that is not a. It compiles to a negated character class, so its operand must be a single character or a class — a shorthand, a literal, a range, or an alternation of those:

pattern`!'a'..'z'`.source;          // "[^a-z]"
pattern`!('a' | '_')`.source;       // "[^a_]"
grammar`x = !digit`.x.source;       // "\\D"  — named terminals resolve inside a grammar
grammar`x = !whitespace+`.x.source; // "\\S+" — a trailing quantifier binds to the negation

Negating a sequence (!('a' 'b')) is an error: regex has no single consuming token for "not the string ab" — that's a negative lookahead (!before 'ab').

until

until x matches the shortest run of any characters (newlines included) up to the first x, without consuming x — the readable form of the [\s\S]*? block-grab you'd otherwise hand-write:

const g = grammar`
  body   = until '</script>'
  script = '<script>' body '</script>'
`;
g.body.source;                                  // "[\\s\\S]*?(?=</script>)"
g.script.match("<script>a</script>b</script>").parts.body;   // "a"  — stops at the first

Built-in terminals

A handful of names resolve to common character classes and zero-width anchors without you defining them. They emit inline (a terminal, never a capture), and a rule of the same name in your own grammar shadows the built-in:

| Name | Compiles to | Matches | | ------------ | ----------- | -------------------------------- | | whitespace | \s | space, tab, newline, … | | digit | \d | 0–9 | | word | \w | [A-Za-z0-9_] | | letter | [a-zA-Z] | an ASCII letter | | boundary | \b | a word boundary (zero-width) | | start | ^ | start of input/line (zero-width) | | end | $ | end of input/line (zero-width) |

Built-in terminals only resolve inside a grammar block. A bare pattern`…` has no terminal scope — it composes by interpolating fragments you defined yourself (pattern`${word}+`), so pattern`boundary` throws unresolved reference.

Unicode property classes

Unicode.* is a namespace of engine-supplied character categories — the General_Category set plus the identifier properties — that can't be enumerated by hand. Each emits a \p{…} escape (class-safe, so it merges and negates like a shorthand) and forces the u flag on:

grammar`r = Unicode.Letter.Uppercase`.r.source;   // "\\p{Lu}"
grammar`r = Unicode.Number.Decimal`.r.source;     // "\\p{Nd}"
grammar`r = Unicode.Identifier.Start`.r.source;   // "\\p{ID_Start}"

Names follow the categories: Unicode.Letter(.Uppercase/.Lowercase/…), Unicode.Number, Unicode.Mark, Unicode.Punctuation, Unicode.Symbol, Unicode.Separator, Unicode.Other, and Unicode.Identifier.Start/.Continue.

Grammar blocks & captures

A grammar block defines named rules that reference each other by bare name. A reference becomes a named capture, so a match hands back parts keyed by rule name — no separate capture syntax, the rule name is the key.

import { grammar } from "ez-regex-patterns";

const g = grammar`
  alpha = 'a'..'z' | 'A'..'Z'
  digit = '0'..'9'
  word  = alpha | digit | '_' | '-'
  ident = word+
  sigil = '.' | '#' | ':'
  name  = ident
  value = ident
  part  = sigil name ('=' value)?
`;

const m = g.part.match("#hp=low");
m.parts;   // { sigil: "#", name: "hp", value: "low" }

A rule reused under two names (name, value both ident) gets distinct keys, and nested references flatten to plain regex so they never collide. Recursion is rejected — regex can't recurse.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ez-regex-patterns

Install

Usage

Grammar

char — codepoints and ranges

Bounded quantifiers

Lookaround

Backreference

Negation

until

Built-in terminals

Unicode property classes

Grammar blocks & captures

License