@prostojs/parser

v0.6.0

Published

3 months ago

Parse anything — a composable, hooks-based parser toolkit

0High
0Medium
0Low

mav-rik

parser prostojs tokenizer ast

Build a parser for anything — in minutes, not months.

Stop writing ad-hoc regex spaghetti or reaching for heavyweight parser generators. @prostojs/parser gives you composable building blocks: define your nodes, wire them together, and get a working parser with structured output — fast.

Why This Parser?

It's LEGO for parsers. Each node is a self-contained piece — a tag, a string, a comment, an attribute. Snap them together and you have a full grammar. Need to change something? Swap one block, everything else stays.

Output is built during parsing. Hooks fire as tokens are matched — onOpen, onClose, onContent, onChild. Your data is in its final shape the moment parsing ends. No AST-to-output conversion step. No tree walking.

Near-zero boilerplate. Write data: { tag: '', attrs: {} } and it just works — auto-cloned per match, regex named groups auto-mapped to fields. A full XML-to-JSON parser is ~400 lines.

Competitive performance. A general-purpose toolkit parsing XML is within 4-36% of fast-xml-parser, a dedicated XML-only library. For most formats you'll parse, there is no dedicated alternative — and this is fast enough.

Install

npm install @prostojs/parser

30-Second Overview

Every parser is a tree of Nodes. Each node knows how to start, how to end, and what it can contain:

import { Node, parse } from '@prostojs/parser'

// A string: starts with a quote, ends with the same quote
const string = new Node<{ quote: string }>({
  name: 'string',
  start: { token: /(?<quote>["'])/, omit: true },
  end: { token: (ctx) => ctx.node.data.quote, omit: true },
  data: { quote: '' },
})

// A key=value pair: key captured from regex, value from content
const pair = new Node<{ key: string; value: string }>({
  name: 'pair',
  start: { token: /(?<key>\w+)\s*=\s*/, omit: true },
  end: { token: /\n|$/, omit: true },
  recognizes: [string],
  data: { key: '', value: '' },
  mapContent: 'value',
})

// Root: contains pairs, closes at EOF
const root = new Node({ name: 'root', eofClose: true, recognizes: [pair] })

const result = parse(root, 'name = "Alice"\nage = "30"')
// result.content → [ParsedNode{key:'name', value:'Alice'}, ...]

That's a working config file parser. No grammar files, no build step, no code generation.

How It Works

1. Define Nodes

A node is a pattern with a start token, an end token, and typed data:

const comment = new Node<{ text: string }>({
  name: 'comment',
  start: { token: '<!--', omit: true },
  end: { token: '-->', omit: true },
  data: { text: '' },
  mapContent: 'text',  // auto-joins text content into data.text
})

Tokens can be strings, RegExps (with named capture groups), or dynamic functions:

// String — exact match
start: '{'

// RegExp — captures data automatically
start: { token: /<(?<tag>\w+)/, omit: true }

// Dynamic — computed from current node's data
end: { token: (ctx) => `</${ctx.node.data.tag}>`, omit: true }

Token modifiers:

omit — strip the token from node content
eject — don't consume the match, let the parent handle it
backslash — ignore the token if preceded by \

2. Compose Them

Tell each node what children it can contain:

const root = new Node({ name: 'root', eofClose: true })
root.recognize(comment, tag, cdata)
tag.recognize(attribute, innerContent)
innerContent.recognize(comment, tag, cdata)

That's your grammar. No separate DSL — it's just JavaScript.

3. Add Hooks to Shape Output

Hooks fire during parsing — use them to build your output in its final format:

tag
  .onOpen((node, match) => {
    // start token matched — node.data is ready (named groups already mapped)
    // return false to reject this match
  })
  .onChild((child, node) => {
    // a child node was fully parsed
    // route its data wherever you need it
    if (child.node === attribute) {
      node.data.attrs[child.data.key] = child.data.value
    }
  })
  .onContent((text, node) => {
    // text is about to be added — transform or suppress it
    return text.trim()
  })
  .onClose((node) => {
    // end token matched — finalize the output
  })

4. Parse

import { parse } from '@prostojs/parser'

const result = parse(root, sourceString)
// result: ParsedNode with .content, .data, .start, .end

Key Features

Named Group Auto-Mapping

Regex named groups map directly to data fields — available before onOpen fires:

const tag = new Node<{ tag: string }>({
  start: { token: /<(?<tag>\w+)/ },
  data: { tag: '' },
})
.onOpen((node) => {
  console.log(node.data.tag) // already populated
})

Plain Data Templates

No factory functions. Just declare a plain object — it's auto-cloned per match with an optimized cloner:

data: { tag: '', attrs: {}, children: [] }
// primitives → spread clone
// objects/arrays → shallow clone

`mapContent`

Auto-join all text content into a data field on node close. Replaces the most common onClose pattern:

data: { text: '' },
mapContent: 'text',
// equivalent to: .onClose(node => { node.data.text = textContent(node) })

Utilities

import { textContent, children, findChild, findChildren, walk, printTree } from '@prostojs/parser'

textContent(node)              // joined string content
children(node)                 // child ParsedNodes (no strings)
findChild(node, targetNode)    // first child of a specific node type
findChildren(node, targetNode) // all children of a specific node type
walk(node, (child, depth) => { ... })  // depth-first walk
printTree(node)                // debug visualization

Node Options Reference

| Option | Type | Description | |--------|------|-------------| | name | string | Identifier (for debugging / printTree) | | start | TokenDef \| TokenDef[] | Start token(s) | | end | TokenDef \| TokenDef[] | End token(s) | | recognizes | Node[] | Child nodes this node can contain | | skip | Token \| Token[] | Tokens to silently skip (e.g. whitespace) | | bad | Token \| Token[] | Tokens that trigger a parse error | | eofClose | boolean | Allow this node to close at end of input | | data | T \| () => T | Data template (auto-cloned) or factory | | mapContent | string | Auto-join text content into this data field | | hooks | NodeHooks<T> | Inline hook definitions |

Error Handling

import { ParseError } from '@prostojs/parser'

try {
  parse(root, source)
} catch (e) {
  if (e instanceof ParseError) {
    console.log(e.message) // includes line, column, and context
  }
}

Throws on unclosed nodes and bad tokens with precise source positions.

Examples

Each example is a standalone parser showcasing different aspects of the API. All source is in the examples/ directory on GitHub.

| Example | What it parses | Highlights | |---------|---------------|------------| | XML-to-JSON | Full XML → JSON (fast-xml-parser compatible) | Dynamic end tokens, hooks-based output, entity decoding, ~400 lines | | JSON | JSON strings → JS values | onContent for bare primitives, state tracking for key/value disambiguation | | Math Evaluator | 2 + 3 * (4 - 1) → 11 | Recursive group nodes, result computed during parsing — no AST | | Template String | Hello, {{name}}! → parts array | Minimal 2-node parser, mapContent for zero-hook data capture | | CSS Selector | div.cls > span:hover → structured parts | Dynamic quote matching, regex tokenization in onContent | | URL Parser | URLs → protocol/host/path/query/hash | Named group auto-mapping, eject for boundary detection | | ESM Analyzer | JS/TS source → imports, exports, unused | String/comment nodes as "shields" against false positives |

Migration from v0.5

See MIGRATION.md for a comprehensive guide.

License

MIT