tree-sitter-markdown-text
v0.2.1
Published
Markdown grammar for tree-sitter, with a textlint-style AST shape
Maintainers
Readme
tree-sitter-markdown-text
Markdown grammar for tree-sitter, shaped so that its AST lines up with the textlint TxtNode model.
Parses .md (and .markdown, .mdown, .mkd, .mkdn) files into a concrete syntax tree covering the full CommonMark block structure plus common extensions (GFM pipe tables, task lists, GFM alerts, YAML/TOML front matter, Pandoc math and directive blocks, footnotes, MDX JSX). Inline content is surfaced as structured children of the inline wrapper: classified tokens (word_token, numeric_token, identifier_like_token, path_like_token) and punctuation-class nodes (terminator, separator, bracket, operator_like), plus inline structural nodes (emphasis, strong, strikethrough, link, image, autolink, inline_code, html_inline, math_inline, mdx_jsx_inline, footnote_reference).
Features
Block nodes
- Document structure —
document, nestedsectionwrappers around ATX headings,paragraph,blank_line(as a first-class node). - Headings — ATX (
#..######) and setext (===/---) with the heading level exposed as alevelfield on bothatx_headingandsetext_heading. - Code blocks — indented code blocks and fenced code blocks (backtick and tilde), with
info_string/languagechildren for the GFM language tag. - Math blocks — Pandoc/GitLab/KaTeX display math (
$$…$$) as a dedicatedmath_blockwithmath_block_delimiter/math_block_contentchildren. - Lists — unordered (
+/-/*) and ordered (1./1)) list markers. GFM task list items are promoted totask_list_item(distinct fromlist_item), withtask_list_marker_checked/task_list_marker_uncheckedmarkers. - Block quotes and callouts — nested quotes and lazy continuations. A block quote whose first paragraph begins with
[!NOTE]/[!TIP]/[!IMPORTANT]/[!WARNING]/[!CAUTION](or any uppercase-only label) is surfaced ascalloutwith acallout_typefield. - Thematic breaks —
---,***,___. - HTML blocks — all 7 CommonMark HTML block types; block-level HTML comments are aliased to
html_comment_blockfor easy metric extraction. - MDX JSX blocks — shallow
mdx_jsx_blockfor lines that start with an MDX-style JSX element (<Component ...>,<Component/>,</Component>). Component-style mixed-case names disambiguate from all-caps HTML blocks such as<DIV>. - Pipe tables —
pipe_tablewithpipe_table_header,pipe_table_delimiter_row,pipe_table_row,pipe_table_cell,pipe_table_align_left/pipe_table_align_right. - Link reference definitions —
link_reference_definitionwithlink_label/link_destination/link_titlechildren. - Footnote definitions —
footnote_definition([^id]: …) with afootnote_labelchild. - Directive blocks — generic container directives (
:::name … :::, per remark-directive / MyST / Pandoc fenced divs) asdirective_blockwithdirective_block_delimiter/directive_name/directive_block_contentchildren. - Image blocks — a paragraph consisting of a single block-level image (
on its own line) is surfaced asimage_blockwithlink_label/link_destinationchildren. - Front matter — YAML (
---fenced) asminus_metadata, TOML (+++fenced) asplus_metadata.
Inline nodes (children of the inline wrapper)
Classified text tokens —
text_spanwraps runs of classified tokens:word_token(Unicode alphabetic),numeric_token(integers, decimals, versions),identifier_like_token(camelCase / PascalCase / snake_case),path_like_token(paths with/separators or dotted identifiers).Punctuation classes — every punctuation lexeme is classified:
terminator(.,?,!,。,…),separator(,,;,:),bracket((,),[,],{,},<,>),operator_like(::,->,=>,=,+,-,*,/,|,&, and other punctuation).Emphasis / strong / strikethrough —
emphasis(*…*or_…_),strong(**…**or__…__),strikethrough(~~…~~), each with a_delimiter/_content/_delimitersub-tree.Code spans —
inline_codewith matched backtick-run delimiters (1 or 2 backticks).Links and images —
link(inline, full-reference, collapsed-reference, shortcut-reference forms) andimage(or![alt][ref]). Both exposelink_label/link_destination/link_titlechildren.Autolinks —
autolinkwithurioremailchildren for<https://…>and<[email protected]>.Raw HTML inline —
html_inlinewithhtml_open_tag/html_close_tag/html_comment/html_cdata/html_declaration/html_processing_instructionchildren.MDX JSX inline — shallow
mdx_jsx_inlinewithmdx_jsx_open_tag/mdx_jsx_close_tag/mdx_jsx_expressionchildren.Inline math —
math_inline($…$) withmath_inline_delimiter/math_inline_contentchildren. Disambiguated frommath_block($$…$$).Footnote references —
footnote_reference([^id]inside prose) with afootnote_reference_labelchild.Injections query — ships a
queries/injections.scmthat injects into fenced-code-block info strings, HTML blocks, and front matter.
Example
# Heading
A paragraph with inline content.
- one
- two
```go
func main() {}
Parsed tree (abbreviated):
(document (section (atx_heading level: (atx_h1_marker) heading_content: (inline)) (blank_line) (paragraph (inline)) (blank_line) (list (list_item (list_marker_minus) (paragraph (inline))) (list_item (list_marker_minus) (paragraph (inline)))) (blank_line) (fenced_code_block (fenced_code_block_delimiter) (info_string (language)) (code_fence_content) (fenced_code_block_delimiter))))
## Relationship to textlint
The grammar is structurally close to the textlint AST. Every block-level `TxtNode` type has a direct counterpart here; inline `TxtNode` types (`Str`, `Emphasis`, `Strong`, `Link`, `Image`, `Code`, `Html`, `Delete`, `FootnoteReference`) also have direct counterparts as children of the `inline` wrapper. Names stay snake_case per the tree-sitter convention; consumers map names themselves. See [docs/textlint-mapping.md](docs/textlint-mapping.md) for the full table.
## Installation
### npm
```sh
npm install tree-sitter-markdown-textCargo
cargo add tree-sitter-markdown-textPyPI
pip install tree-sitter-markdown-textGo
import tree_sitter_markdown_text "github.com/ophidiarium/tree-sitter-markdown-text/bindings/go"The root package also exports the bundled queries via go:embed:
import markdown "github.com/ophidiarium/tree-sitter-markdown-text"
lang := markdown.GetLanguage()
query, _ := markdown.GetHighlightsQuery()Usage
Node.js
import Parser from "tree-sitter";
import Markdown from "tree-sitter-markdown-text";
const parser = new Parser();
parser.setLanguage(Markdown);
const tree = parser.parse("# hello\n");
console.log(tree.rootNode.toString());Rust
let mut parser = tree_sitter::Parser::new();
let language = tree_sitter_markdown_text::LANGUAGE;
parser.set_language(&language.into()).unwrap();
let tree = parser.parse("# hello\n", None).unwrap();
println!("{}", tree.root_node().to_sexp());Python
from tree_sitter import Language, Parser
import tree_sitter_markdown_text
parser = Parser(Language(tree_sitter_markdown_text.language()))
tree = parser.parse(b"# hello\n")
print(tree.root_node.sexp())Credits and references
- tree-sitter-grammars/tree-sitter-markdown — upstream grammar, specifically the
split_parserbranch's block grammar, which this grammar is derived from. - textlint TxtNode — the AST shape this grammar targets for compatibility.
- CommonMark Spec — the block structure this grammar implements.
- Github Flavored Markdown — for the pipe-table and task-list extensions.
