hacklight

v1.1.0

Published

6 months ago

A lightweight, pragmatic syntax highlighter that's good enough™

0High
0Medium
0Low

panphora

syntax highlighter tokenizer lightweight

Hacklight

A lightweight, pragmatic syntax highlighter that's good enough™.

What It Is

Hacklight is a simple tokenizer/syntax highlighter that aims to correctly parse ~95% of real-world code in the 8 most popular languages. It's designed to be:

Small - Single file, minimal dependencies
Fast - Simple regex-based tokenization
Browser-friendly - Works anywhere JavaScript runs
Pragmatic - Handles the common cases well, doesn't sweat the edge cases

What It Isn't

This is not a perfect parser. It will get some things wrong, especially:

Complex nested template literals
Exotic regex patterns
Language-specific edge cases
Mixed-language contexts beyond HTML/CSS/JS

If you need 100% accuracy, use a proper AST parser. If you need something that works well for 95% of code you'll encounter and fits in a single file, this might be for you.

Supported Languages

JavaScript (including JSX, modern ES2024 features)
TypeScript
HTML (with embedded CSS/JS)
CSS (including CSS variables, at-rules)
Python
Java
C/C++
Go

Usage

const { tokenize } = createTokenizer('auto');  // auto-detects HTML or JS
const tokens = tokenize(sourceCode);

// Convert to HTML with syntax highlighting
const html = tokensToHtml(tokens);

Token Types

keyword - Language keywords (if, for, class, etc.)
identifier - Variables, function names, etc.
string - String literals
number - Numeric literals
comment - Comments
operator - Operators (+, -, =, etc.)
punctuation - Brackets, semicolons, etc.
regex - Regular expressions (JS)
html_tag - HTML tags
attr_name / attr_bool - HTML attributes
css_selector - CSS selectors
css_variable - CSS custom properties
css_at - CSS at-rules (@media, etc.)
error_string - Unterminated strings
whitespace / newline - Formatting

Philosophy

Perfect is the enemy of good. This tokenizer makes practical trade-offs:

Speed over correctness - Uses regex instead of full parsing
Simplicity over completeness - One file, minimal state machine
Common cases over edge cases - Handles typical code patterns well
Pragmatism over purity - Some heuristics and "good enough" decisions

Known Limitations

Regex literals vs division operators use heuristics that can be fooled
Template literal interpolations are tokenized as one string
HTML/CSS/JS context switching is simplified
No semantic understanding (can't distinguish types from variables)
Some exotic syntax constructs may tokenize incorrectly

License

MIT

Contributing

This is intentionally kept simple. Bug fixes welcome, but feature additions that add complexity will likely be declined. The goal is to stay small and "good enough".

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme