@uih-dsl/tokenizer

v0.0.2

Published

5 months ago

Tokenizer for UIH DSL - AI-friendly UI declaration language

0High
0Medium
0Low

leejaebae

uih dsl tokenizer lexer compiler ui ai

@uih-dsl/tokenizer

Lexical analyzer (tokenizer) for UIH DSL source code.

Overview

The UIH DSL tokenizer converts source code into a stream of tokens for parsing. It's the first stage of the UIH DSL compiler pipeline.

Design Principles:

Deterministic - Same input always produces same output
Strict validation - Enforces identifier and tag name rules
No semantic interpretation - Context determination delegated to parser
FSM-based - Operates in NORMAL, ATTRIBUTES, and STRING_LITERAL modes

Installation

npm install @uih-dsl/tokenizer

Usage

Basic Tokenization

import { tokenize, TokenType } from '@uih-dsl/tokenizer';

const input = `
meta {
  title: "My App"
  route: "/"
}

layout {
  Container {
    H1 {
      "Welcome"
    }
  }
}
`;

const tokens = tokenize(input);

// Examine tokens
tokens.forEach(token => {
  console.log(`${token.type}: "${token.value}" at line ${token.range.start.line}`);
});

Error Handling

try {
  tokenize('meta { title: "Test"; }');  // Semicolon is forbidden
} catch (error) {
  console.error(error.message);
  // "Tokenizer Error at line 1, column 21: Forbidden character ';'"
}

API

`tokenize(input: string): Token[]`

Converts a UIH DSL string into an array of tokens.

Input Requirements:

Only LF (\n) newlines allowed
No tab characters
Throws error on forbidden characters

Returns:

Token[] - Array of tokens

Type Definitions

interface Token {
  type: TokenType;
  value: string;
  range: Range;
}

interface Range {
  start: Position;
  end: Position;
}

interface Position {
  line: number;    // 1-based
  column: number;  // 1-based
  index: number;   // 0-based
}

enum TokenType {
  IDENTIFIER = 'IDENTIFIER',
  TAGNAME = 'TAGNAME',
  STRING = 'STRING',
  NUMBER = 'NUMBER',
  BOOLEAN = 'BOOLEAN',
  LBRACE = 'LBRACE',
  RBRACE = 'RBRACE',
  LPAREN = 'LPAREN',
  RPAREN = 'RPAREN',
  COLON = 'COLON',
  COMMA = 'COMMA',
  NEWLINE = 'NEWLINE',
  EOF = 'EOF'
}

Token Types

| TokenType | Description | Examples | |-----------|-------------|----------| | IDENTIFIER | Lowercase identifier (strict validation) | meta, title, color.primary | | TAGNAME | Component name starting with uppercase (no underscore) | Div, Container, H1 | | STRING | All string literals | "value", "text" | | NUMBER | Numbers | 42, 3.14 | | BOOLEAN | Booleans | true, false | | LBRACE | Opening brace | { | | RBRACE | Closing brace | } | | LPAREN | Opening parenthesis | ( | | RPAREN | Closing parenthesis | ) | | COLON | Colon | : | | COMMA | Comma | , | | NEWLINE | Newline (whitespace is skipped) | \n | | EOF | End of file | - |

Grammar Rules

Identifier (Strict Validation)

Must start with lowercase letter
Only lowercase letters, digits, and dots (.) allowed
No consecutive dots (..)
Cannot start or end with dot
Cannot start with digit

Valid examples:

meta
color.primary
font.size

Invalid examples:

.invalid - starts with dot
invalid. - ends with dot
invalid..name - consecutive dots
9invalid - starts with digit
INVALID - uppercase not allowed

TagName (Strict Validation)

Must start with uppercase letter
Only letters (uppercase/lowercase) and digits allowed
No underscores (_)

Valid examples:

Container
Button
H1
Div2

Invalid examples:

my_component - starts lowercase and has underscore
_Component - starts with underscore
component - starts with lowercase

String Literals

Enclosed in double quotes (")
Only \" escape allowed (Escape Strategy A)
- \" → " (escaped quote)
- Other escape sequences are errors (\n, \t, \\ not allowed)
No multiline strings
Backslash (\) forbidden outside strings

Valid examples:

"Hello"
"He said \"Hi\""
"Value: 123"

Invalid examples:

"Line\nBreak" - \n escape not allowed
"Tab\there" - \t escape not allowed
"Path\\to\\file" - \\ escape not allowed
'single quotes' - only double quotes allowed

Numbers

Integers and decimals supported
No negative numbers
No units

Examples:

42
3.14
0.5

Booleans

true
false

Forbidden Characters

The following characters throw errors when used outside strings:

| Character | Reason | |-----------|--------| | ; | Prevents statement separation confusion | | ' | Only double quotes allowed for strings | | ` | Prevents template literal confusion | | @ | Reserved for future decorators | | # | Reserved for future directives | | $ | Reserved for future template syntax | | %, ^, &, *, =, +, | | Prevents operator confusion | | \ | Forbidden outside strings; only \" allowed inside strings | | <, >, ?, ~ | Prevents comparison operator confusion | | \t | Only spaces allowed for indentation | | \r\n | Only LF newlines allowed |

Backslash Rules

Outside strings (NORMAL/ATTRIBUTES mode):

Backslash (\) throws error
Example: meta { value: \ } ❌

Inside strings (STRING_LITERAL mode):

Only \" escape allowed
All other escapes throw errors
Examples:
- "He said \"Hi\"" ✅
- "Line\nBreak" ❌

Examples

Extract Component Names

import { tokenize, TokenType } from '@uih-dsl/tokenizer';

const source = `
layout {
  Container {
    Button {
      "Click"
    }
  }
}
`;

const tokens = tokenize(source);
const components = tokens
  .filter(t => t.type === TokenType.TAGNAME)
  .map(t => t.value);

console.log(components); // ['Container', 'Button']

Token Position Information

const tokens = tokenize(source);

tokens.forEach(token => {
  const { line, column } = token.range.start;
  console.log(`${token.type} at ${line}:${column}: "${token.value}"`);
});

String Token Context

// All strings are tokenized as STRING type
// Context determination is the parser's responsibility

// Property value
meta { title: "Hello" }  // "Hello" is STRING

// Layout child
layout {
  H1 {
    "Hello"  // "Hello" is also STRING (parser determines context)
  }
}

Custom Token Processing

import { tokenize, TokenType, Token } from '@uih-dsl/tokenizer';

function extractStrings(source: string): string[] {
  const tokens = tokenize(source);
  return tokens
    .filter(t => t.type === TokenType.STRING)
    .map(t => t.value.slice(1, -1)); // Remove quotes
}

function findIdentifiers(source: string): string[] {
  const tokens = tokenize(source);
  return tokens
    .filter(t => t.type === TokenType.IDENTIFIER)
    .map(t => t.value);
}

const source = `
meta {
  title: "My App"
  color.primary: "#0E5EF7"
}
`;

console.log(extractStrings(source));    // ['My App', '#0E5EF7']
console.log(findIdentifiers(source));   // ['meta', 'title', 'color.primary']

Error Messages

The tokenizer provides clear, actionable error messages with precise location information:

Tokenizer Error at line 5, column 12: Forbidden character ';'
Tokenizer Error at line 3, column 8: Invalid identifier - must start with lowercase
Tokenizer Error at line 7, column 15: Invalid string escape sequence '\n'
Tokenizer Error at line 2, column 5: Consecutive dots in identifier

Performance

Speed: ~100,000 tokens/second on modern hardware
Memory: O(n) where n is input length
Complexity: Single-pass tokenization

Benchmark results (1000 lines UIH file):

Tokenization: ~2ms
Memory usage: ~50KB

Testing

# Run tests
npm test

# Run specific test files
npx tsx test-runner.ts
npx tsx test-advanced.ts

# Run with coverage
npm test -- --coverage

Integration with Parser

The tokenizer is designed to work seamlessly with @uih-dsl/parser:

import { tokenize } from '@uih-dsl/tokenizer';
import { parse } from '@uih-dsl/parser';

const source = `
meta {
  title: "My App"
}
`;

const tokens = tokenize(source);
const ast = parse(tokens);

TypeScript Support

Full TypeScript support with exported types:

import type {
  Token,
  TokenType,
  Range,
  Position
} from '@uih-dsl/tokenizer';

Related Packages

@uih-dsl/parser - AST parser
@uih-dsl/cli - CLI tool

License

MIT

Contributing

See CONTRIBUTING.md

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@uih-dsl/tokenizer

Overview

Installation

Usage

Basic Tokenization

Error Handling

API

tokenize(input: string): Token[]

Type Definitions

Token Types

Grammar Rules

Identifier (Strict Validation)

TagName (Strict Validation)

String Literals

Numbers

Booleans

Forbidden Characters

Backslash Rules

Examples

Extract Component Names

Token Position Information

String Token Context

Custom Token Processing

Error Messages

Performance

Testing

Integration with Parser

TypeScript Support

Related Documentation

Related Packages

License

Contributing

Support

`tokenize(input: string): Token[]`