meta-parser-generator
v1.1.2
Published
A PEG parser generator with packrat parsing written in JavaScript for JavaScript
Readme
Meta Parser Generator
npm install meta-parser-generatorMeta Parser Generator will help you generate an efficient parser using grammar and a token definition. Meta programming is used to generate a single self-contained parser file.
Characteristics
- PEG parser (Parsing Expression Grammar) with ordered choice
- Packrat parsing with memoization for linear time complexity
- Direct left recursion support using Guido van Rossum's algorithm
- Parser code is generated from a grammar
- Good parsing performance (O(n) with memoization)
- Excellent error reporting with context
- Small source code (~600 lines of code), no dependencies
Important: Grammar Order Matters
Unlike LL or LR parsers, PEG parsers use ordered choice. The first matching alternative is selected, and no backtracking occurs across alternatives. This means:
// This grammar will NEVER match 'number' because 'name' matches first!
'VALUE': [
['name'], // matches ANY identifier including '123abc'
['number'], // NEVER reached if name is defined as /^[\w]+/
]
// Correct order: more specific rules first
'VALUE': [
['number'], // try number first
['name'], // then try name
]How to generate and use a parser
This will generate a mathematical operation parser
// generator.js
const { generateParser } = require('meta-parser-generator');
const path = require('path');
// only 3 possible tokens
const tokensDefinition = {
'number': { reg: /^[0-9]+(\.[0-9]*)?/ },
'math_operator': { reg: /^(\+|-|\*|%)/ },
'newline': { str: '\n' }
}
const grammar = {
// START is the convention keyword for the entry point of the grammar
'START': [
// necessary to accept the first line be MATH
['MATH', 'LINE*', 'EOS'], // EOS is the End Of Stream token, added automatically by the tokenizer
// * is the repeating modifier {0,∞}. Better than recursion as it does not use the call stack
['LINE*', 'EOS'],
],
'LINE': [
// we define a line as always starting with a newline
['newline', 'MATH'],
['newline'],
],
'MATH': [
// direct left recursion here
['MATH', 'math_operator', 'number'],
['number'],
],
};Then execute this script node generate.js
// generate.js
const { tokensDefinition, grammar } = require('./generator');
// this generate the executable parser file
generateParser(grammar, tokensDefinition, path.resolve(__dirname, './parser.js'));
console.log('parser generated');Then you can use the generated parser this way
const parser = require('./parser');
const { tokensDefinition, grammar } = require('./generator');
const { displayError } = require('meta-parser-generator');
function parse(input) {
const tokens = parser.tokenize(tokensDefinition, input);
const ast = parser.parse(tokens);
if (!ast.success) {
displayError(tokens, tokensDefinition, grammar, ast);
}
return ast;
}
let ast = parse('9+10-190.3');
console.log(ast)How does the generated parser work?
Each grammar rule you write is transformed into a function, and those grammar functions call each other until the input parsing is successful. The parser uses:
- PEG Ordered Choice: For each rule with multiple alternatives, tries them in order and returns the first match
- Packrat Parsing: Memoization prevents re-parsing the same position, guaranteeing O(n) time complexity
- Left Recursion Handling: Uses a special memoization strategy based on Guido van Rossum's algorithm
The JavaScript call stack is used by the generated parser. So, if you design a very recursive grammar, you might trigger a "Maximum call stack size exceeded" error for a large input.
In our example, the MATH grammar rule has left recursion, allowing you to parse expressions like 1+2+3+4+5+...X, where X is limited by V8's stack size.
To find out the default maximum stack size of V8, run node --v8-options | grep stack-size. If the default size is not enough, you can extend it or rewrite your grammar.
Best practice: Use modifiers (*, +, ?) instead of recursion when possible - they don't use the call stack and handle large inputs better.
Note: For very large files, the memoization cache can grow significantly. The parser clears the cache between parse calls, but memory usage during parsing is proportional to input size × grammar complexity.
AST interface
type ASTNode = RuleNode | Token
export interface RuleNode {
stream_index: number // position of the first token for this rule in the token stream
type: string // name of the rule
sub_rule_index: number // index of this grammar rule in the sub_rule_index array
children: [ASTNode] // list of children
named: { [key: string]: ASTNode; } // named elements withing this rule, see named aliases
}At the leaf of the AST you will find the final tokens. They have a slightly different interface
export interface Token {
stream_index: number // position of the token in the token stream
type: string // name of token
value: string // the value of the token
len: number // shortcut for value.length
line_start: number // line start position in the input
column_start: number // column start position in the input
start: number // character start position in the input
}Modifiers
There is 3 modifiers you can add at the end of a rule or token
* is the {0,∞} repeating modifier
+ is the {1,∞} repeating modifier
? is the {0,1} conditional modifierExample
['PREPOSITION', 'ADJECTIVE*', 'NAME']Named alias
To facilitate your work with the AST, you can name a rule or a token using a colon
'MATH': [
['MATH', 'math_operator:operator', 'number:num'], // tokens math_operator and number
// are named with operator and num
['number:num'], // here only number is named with num
]Then in the corresponding RuleNode you will find the math_operator in the children, but also in the named object.
This is useful to more easily handle and differenciate your grammar rules:
// a function that handle both MATH grammar rules defined above
function handle_MATH_node(node: RuleNode) {
const named = node.named
// if there is an operator, we are dealing with sub rule 0
if(named['operator']) {
const left_recursion = handle_MATH_node(node.children[0])
console.log(`${left_recursion} ${named['operator'].value} ${named['num'].value}`)
} else {
console.log(named['num'].value)
}
}Errors
The util function displayError will display detailed informations about a tokenizer or parsing error. The hint given
is based on the first grammar rule found that consume the most token from the stream.
Projects using this parser
- The Blop language https://github.com/batiste/blop-language
