ez-regex-patterns
v1.0.2
Published
Readable, composable patterns that compile to native RegExp.
Maintainers
Readme
ez-regex-patterns
Readable, composable patterns that compile to a native RegExp.
Regex isn't unreadable because of its primitives — [\w-]+ is terse and fine. It's
unreadable because it has no naming and no composition: one undifferentiated
expression, no sub-parts you can name and reuse, capture groups far from their meaning.
This library adds exactly that and nothing else. You write fragments in a small, grammar-like language, name them as ordinary variables, and compose them by interpolation. The result compiles — once, at definition time — to a real regex.
Install
npm install ez-regex-patternsUsage
import { pattern } from "ez-regex-patterns";
const alpha = pattern`'a'..'z' | 'A'..'Z'`;
const digit = pattern`'0'..'9'`;
const word = pattern`${alpha} | ${digit} | '_' | '-'`;
const ident = pattern`${word}+`;
ident.source; // "[a-zA-Z0-9_\\-]+"A Pattern is usable anywhere a RegExp is — it carries the native surface
(exec, lastIndex, test, source, flags, and the Symbol.replace/match/split
hooks) and drives the String methods directly:
ident.test(input);
ident.exec(input); // and `pattern.lastIndex = i` to scan from a cursor
input.replace(ident, (m) => m.toUpperCase());
// The DSL also adds a parts-returning matcher:
ident.match(input); // { text, index, parts } | undefined
ident.matchAll(input); // every match, with partsPatterns are flag-agnostic; ask for a flag only where you need it. .global,
.ignoreCase, and .sticky each return a flag-flavoured copy (the node is shared, only
the flag differs, and they compose):
const stmt = grammar`r = start 'use ' word+ ';'`.r;
input.replace(stmt.global, fix); // every occurrence
tag.ignoreCase.test('<SCRIPT>'); // case-insensitive
scanner.sticky; // anchors each match at lastIndex.toRegExp(flags) is still there for an explicit copy under a one-off flag set.
The emitter does the two things hand-written regex makes you do by eye:
- Merges an alternation of single chars / ranges into one class —
'a'..'z' | 'A'..'Z' | '_'becomes[a-zA-Z_], not(?:[a-z]|[A-Z]|_). - Parenthesizes only where precedence demands it —
${word}+becomes[a-zA-Z0-9_\-]+, and a quantified alternation is auto-grouped, so the+never silently binds to the last branch.
Grammar
| Form | Meaning | Example |
| --------------------- | ---------------------------------- | ----------------------------- |
| 'x' | literal | '=', '.' |
| 'a'..'z' | character range | '0'..'9' |
| char(202A) | character by Unicode codepoint | char(00A0) |
| char(a)..char(b) | codepoint range | char(00C0)..char(00D6) |
| a b | sequence (juxtaposition) | ${sigil} ${ident} |
| a \| b | alternation | '.' \| '#' \| ':' |
| ( … ) | grouping | ('=' ${ident})? |
| a+ a* a? | quantifiers | ${word}+ |
| a{n} a{n..} a{n..m} | bounded repetition | ${word}{1..6} |
| !a | negate a char / class | !digit, !('a' \| '_') |
| before x after x | lookahead / lookbehind (zero-width)| before ')' |
| same name | backreference to a captured rule | quote ... same quote |
| until x | lazy run up to (not eating) x | until '</script>' |
| ${fragment} | splice another pattern as an atom | ${word} |
| // … | line comment (stripped) | |
Whitespace between tokens is insignificant, so patterns can be laid out and annotated — the free-spacing mode regex literals never had.
char — codepoints and ranges
char(HEX) names a single character by its Unicode codepoint, emitting \uXXXX (BMP) or
\u{…} (astral) — the readable way to put an otherwise-invisible character (a bidi
control, a zero-width space) into a pattern. char(a)..char(b) is a codepoint range. Both
are class-safe (they merge into […] and negate like a literal); an astral codepoint
forces the engine's unicode (u) flag on automatically.
grammar`r = char(202A)`.r.source; // "\\u202a"
grammar`r = char(00C0)..char(00D6)`.r.source; // "[\\u00c0-\\u00d6]"Bounded quantifiers
Beyond + * ?, {…} gives explicit counts, reusing .. for the range (decimal, not
hex like char):
grammar`r = ${word}{3}`.r.source; // "…{3}" exactly 3
grammar`r = ${word}{2..}`.r.source; // "…{2,}" 2 or more
grammar`r = ${word}{2..5}`.r.source; // "…{2,5}" 2 to 5
grammar`r = ${word}{..5}`.r.source; // "…{0,5}" up to 5Lookaround
before x / after x are zero-width positive lookahead / lookbehind; ! in front gives
the negatives:
grammar`r = before ')'`.r.source; // "(?=\\))"
grammar`r = !before ')'`.r.source; // "(?!\\))"
grammar`r = after '('`.r.source; // "(?<=\\()"
grammar`r = !after '('`.r.source; // "(?<!\\()"Backreference
same name re-matches the exact text an earlier capture (a rule reference) of that name
matched — emits \k<name>, and only means anything inside a grammar where the named
capture exists:
const quoted = grammar`
quote = '"' | char(27)
r = quote 'hi' same quote
`.r;
quoted.test('"hi"'); // true — closing quote must match the opener
quoted.test('"hi\''); // falseNegation
!a matches one character that is not a. It compiles to a negated character
class, so its operand must be a single character or a class — a shorthand, a literal, a
range, or an alternation of those:
pattern`!'a'..'z'`.source; // "[^a-z]"
pattern`!('a' | '_')`.source; // "[^a_]"
grammar`x = !digit`.x.source; // "\\D" — named terminals resolve inside a grammar
grammar`x = !whitespace+`.x.source; // "\\S+" — a trailing quantifier binds to the negationNegating a sequence (!('a' 'b')) is an error: regex has no single consuming token
for "not the string ab" — that's a negative lookahead (!before 'ab').
until
until x matches the shortest run of any characters (newlines included) up to the
first x, without consuming x — the readable form of the [\s\S]*? block-grab
you'd otherwise hand-write:
const g = grammar`
body = until '</script>'
script = '<script>' body '</script>'
`;
g.body.source; // "[\\s\\S]*?(?=</script>)"
g.script.match("<script>a</script>b</script>").parts.body; // "a" — stops at the firstBuilt-in terminals
A handful of names resolve to common character classes and zero-width anchors without you defining them. They emit inline (a terminal, never a capture), and a rule of the same name in your own grammar shadows the built-in:
| Name | Compiles to | Matches |
| ------------ | ----------- | -------------------------------- |
| whitespace | \s | space, tab, newline, … |
| digit | \d | 0–9 |
| word | \w | [A-Za-z0-9_] |
| letter | [a-zA-Z] | an ASCII letter |
| boundary | \b | a word boundary (zero-width) |
| start | ^ | start of input/line (zero-width) |
| end | $ | end of input/line (zero-width) |
Built-in terminals only resolve inside a
grammarblock. A barepattern`…`has no terminal scope — it composes by interpolating fragments you defined yourself (pattern`${word}+`), sopattern`boundary`throws unresolved reference.
Unicode property classes
Unicode.* is a namespace of engine-supplied character categories — the General_Category
set plus the identifier properties — that can't be enumerated by hand. Each emits a
\p{…} escape (class-safe, so it merges and negates like a shorthand) and forces the u
flag on:
grammar`r = Unicode.Letter.Uppercase`.r.source; // "\\p{Lu}"
grammar`r = Unicode.Number.Decimal`.r.source; // "\\p{Nd}"
grammar`r = Unicode.Identifier.Start`.r.source; // "\\p{ID_Start}"Names follow the categories: Unicode.Letter(.Uppercase/.Lowercase/…),
Unicode.Number, Unicode.Mark, Unicode.Punctuation, Unicode.Symbol,
Unicode.Separator, Unicode.Other, and Unicode.Identifier.Start/.Continue.
Grammar blocks & captures
A grammar block defines named rules that reference each other by bare name. A
reference becomes a named capture, so a match hands back parts keyed by rule name —
no separate capture syntax, the rule name is the key.
import { grammar } from "ez-regex-patterns";
const g = grammar`
alpha = 'a'..'z' | 'A'..'Z'
digit = '0'..'9'
word = alpha | digit | '_' | '-'
ident = word+
sigil = '.' | '#' | ':'
name = ident
value = ident
part = sigil name ('=' value)?
`;
const m = g.part.match("#hp=low");
m.parts; // { sigil: "#", name: "hp", value: "low" }A rule reused under two names (name, value both ident) gets distinct keys, and
nested references flatten to plain regex so they never collide. Recursion is rejected —
regex can't recurse.
License
MIT
