oniguruma-parser

v0.12.1

Published

10 months ago

Parse, validate, traverse, transform, and optimize Oniguruma regular expressions

0High
0Medium
0Low

slevithan

regex regexp oniguruma parser

oniguruma-parser 🌿

A TypeScript library for parsing, validating, traversing, transforming, and optimizing Oniguruma regular expressions.

[!NOTE] Oniguruma is a regular expression engine written in C that's used in Ruby (via a fork named Onigmo), PHP (mb_ereg, etc.), TextMate grammars (used by VS Code, Shiki, etc.), and many other tools.

This library has been battle-tested by Oniguruma-To-ES and tm-grammars, which are used by Shiki to process tens of thousands of real-world Oniguruma regexes.

📜 Contents

🕹️ Install and use

npm install oniguruma-parser

import {toOnigurumaAst} from 'oniguruma-parser';

The following modules are available in addition to the root 'oniguruma-parser' export:

Parser module: Includes numerous functions and types for constructing and working with OnigurumaAst nodes. Also includes the parse function, wrapped by toOnigurumaAst.
Traverser module: Traverse and transform an OnigurumaAst.
Generator module: Convert an OnigurumaAst to pattern and flags strings.
Optimizer module: Minify and improve the performance of Oniguruma regexes.

🌿 Convert a pattern to an AST

To parse an Oniguruma pattern (with optional flags and compile-time options) and return an AST, call toOnigurumaAst, which uses the following type definition:

function toOnigurumaAst(
  pattern: string,
  options?: {
    flags?: string;
    rules?: {
      captureGroup?: boolean;
      singleline?: boolean;
    };
  }
): OnigurumaAst;

For example:

import {toOnigurumaAst} from 'oniguruma-parser';

const ast = toOnigurumaAst('A.*');
console.log(ast);
/* →
{ type: 'Regex',
  body: [
    { type: 'Alternative',
      body: [
        { type: 'Character',
          value: 65,
        },
        { type: 'Quantifier',
          kind: 'greedy',
          min: 0,
          max: Infinity,
          body: {
            type: 'CharacterSet',
            kind: 'dot',
          },
        },
      ],
    },
  ],
  flags: {
    type: 'Flags',
    ignoreCase: false,
    dotAll: false,
    extended: false,
    digitIsAscii: false,
    posixIsAscii: false,
    spaceIsAscii: false,
    wordIsAscii: false,
    textSegmentMode: null,
  },
}
*/

An error is thrown if the provided pattern or flags aren't valid in Oniguruma.

Note: toOnigurumaAst is a wrapper around the parser module's parse function that makes it easier to use by automatically providing the appropriate Unicode property validation data.

🌀 Traverse and transform an AST

See details and examples in the traverser module's readme.

↩️ Convert an AST to a pattern

See details and examples in the generator module's readme.

🪄 Optimize regexes

This library includes one of the few implementations (for any regex flavor) of a "regex optimizer" that can minify and improve the performance and readability of regexes prior to use.

Example:

(?x) (?:\!{1,}) (\b(?:ark|arm|art)\b) [[^0-9A-Fa-f]\P{^Nd}\p{ Letter }]

Becomes:

!+\b(ar[kmt])\b[\H\d\p{L}]

Optimized regexes always match exactly the same strings.

See more details and examples in the optimizer module's readme.

[!TIP] 🧪 Try the optimizer demo.

🆚 Known differences

Known differences will be resolved in future versions.

Unsupported features

The following rarely-used features throw errors since they aren't yet supported:

Rarely-used character specifiers: Non-A-Za-z with \cx \C-x, meta \M-x \M-\C-x, octal code points \o{…}, and octal encoded bytes ≥ \200.
Code point sequences: \x{H H …} \o{O O …}.
Absence expressions (?~|…|…), stoppers (?~|…), and clearers (?~|).
Conditionals: (?(…)…), etc.
Non-built-in callouts: (?{…}), etc.
Numbered forward backreferences (incl. relative \k<+N>) and backreferences with recursion level (\k<N+N>, etc.).
Flags D P S W y{g} y{w} within pattern modifiers, and whole-pattern modifiers C I L.

Despite these gaps, more than 99.99% of real-world Oniguruma regexes are supported, based on a sample of ~55k regexes used in TextMate grammars (conditionals were used in three regexes, and other unsupported features weren't used at all). Some of the Oniguruma features above are so exotic that they aren't used in any public code on GitHub.

This library currently treats it as an error if a numbered backreference comes before its referenced group. This is a rare issue because:

Most such placements are mistakes and can never match, due to Oniguruma's behavior for backreferences to nonparticipating groups.
Erroring matches the correct behavior of named backreferences.
For unenclosed backreferences, this only affects \1–\9 since it's not a backreference in the first place if using \10 or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).

The following don't yet throw errors, but should:

Special characters that are invalid in backreference names even when referencing a valid group with that name.
- Named backreferences should use a more limited set of allowed characters than named groups and subroutines.
- Note that an error is already correctly thrown for any backreference name that includes - or + (which is separate from how these symbols are used in relative numbered backreferences).
Subroutines used in ways that resemble infinite recursion (#5).
- Such subroutines error at compile time in Oniguruma.

Behavior differences

Unenclosed four-digit backreferences

Although any number of digits are supported for enclosed \k<…>/\k'…' backreferences (assuming the backreference refers to a valid capturing group), unenclosed backreferences currently support only up to three digits (\999). In other words, \1000 is handled as \100 followed by 0 even if 1,000+ captures appear to the left.

Note: An apparent bug in vscode-oniguruma (v2.0.1 tested) prevents any regex with more than 999 captures from working. They fail to match anything, with no error.

Erroring on patterns that trigger Oniguruma bugs

This library intentionally doesn't reproduce bugs, and it currently throws errors for several edge cases that trigger Oniguruma bugs and undefined behavior.

Although nested absence functions like (?~(?~…)) don't throw an error in Oniguruma, they produce self-described "strange" results, and Oniguruma's docs state that "nested absent functions are not supported and the behavior is undefined".

In this library, nested absence functions throw an error. In future versions, parsing of nested absence functions will follow Oniguruma and no longer error.

In Oniguruma, \x is an escape for the NUL character (equivalent to \0, \x00, etc.) if it's not followed by { or a hexadecimal digit.

In this library, bare \x throws an error.

Additional behavior details for \x in Oniguruma:

\x is an error if followed by a { that's followed by a hexadecimal digit but doesn't form a valid \x{…} code point escape. Ex: \x{F and \x{0,2} are errors.
\x matches a literal x if followed by a { that isn't followed by a hexadecimal digit. Ex: \x{ matches x{, \x{G matches x{G, and \x{,2} matches 0–2 x characters, since {,2} is a quantifier with an implicit 0 min.
In Oniguruma 6.9.10 and earlier (report), \x matches a literal x if it appears at the very end of a pattern. This is a bug.

In future versions, parsing of \x will follow the Oniguruma rules above (excluding bugs), removing some cases where it currently errors.

Normally, any incomplete \uHHHH (including bare \u) throws an error. However, in Oniguruma 6.9.10 and earlier (report), bare \u matches a literal u if it appears at the very end of a pattern. This is a bug.

In this library, incomplete \u is always an error.

Context: Unlike \uHHHH and enclosed \x{H…} (which match code points), Oniguruma's unenclosed \xHH represents an encoded byte, which means that, unlike in other regex flavors, \x80 to \xFF are treated as fragments of a code unit. Ex: [\0-\xE2\x82\xAC] is equivalent to [\0-\u20AC].

Invalid standalone encoded bytes should throw an error, but several related bugs are present in Oniguruma 6.9.10 and earlier (report).

In this library, they always throw an error.

Behavior details in Oniguruma:

Standalone \x80 to \xF4 throw an error.
Standalone \xF5 to \xFF fail to match anything, but don't throw. This is a bug.
When used as the end value of a character class range:
- Standalone \x80 to \xBF and \xF5 to \xFF are treated as \x7F. This is a bug.
- If the range is within a negated, non-nested character class (ex: [^\0-\xFF]), \xF5 to \xFF are treated as \x{10FFFF}. This is a bug.

🔢 Oniguruma version

All versions of this library to date have followed the rules of Oniguruma 6.9.10 (released 2025-01-01), which uses Unicode 16.0.0.

At least since Oniguruma 6.0.0 (released 2016-05-09), regex syntax changes in new versions have been backward compatible. Some versions added new syntax that was previously an error (such as new Unicode property names), and in a few cases, edge case parsing bugs were fixed.

Oniguruma 6.9.8 (released 2022-04-29) is an important baseline for JavaScript projects, since that's the version used by vscode-oniguruma 1.7.0 to the latest 2.0.1. It's therefore used in recent versions of various projects, including VS Code and Shiki. However, the regex syntax differences between Oniguruma 6.9.8 and 6.9.10 are so minor that this is a non-issue.

🧩 Contributing

Contributions are welcome. See the guide to help you get started.

🏷️ About

Created by Steven Levithan and contributors.

If you want to support this project, I'd love your help by contributing improvements (guide), sharing it with others, or sponsoring ongoing development.

MIT License.