takedown
v0.1.4
Published
Customizable markdown parser
Maintainers
Readme
Takedown
A markdown parser that puts you in control.
The goal of this project is to have a compliant markdown parser that also allows for full control of the target document structure without going through an AST.
How do I use this?
Install.
> npm install takedown --saveimport or require...
import takedown from 'takedown'
// or, for commonjs
let takedown = require('takedown/cjs').defaultand then...
let markdown = 'Your markdown *here*!';
// create an "instance"
let td = takedown();
// make some HTML!
let html = td.parse(markdown).doc;
// => <p>Your markdown <em>here</em>!</p>Simple!
What's the API here?
This section details the api for a takedown parser instance.
let td = takedown();clone
td.clone(config: object): objectReturns a copy of the Takedown instance, optionally merging config atop the current configuration.
config
td.config: objectA proxy object for managing instance configuration.
Configuration values can be set when creating a parser instance.
let quotation = '<div class="blockquote">{value}</div>';
let td = takedown({ convert: { quotation } });And config allows them to be updated directly on the instance.
td.config = { convert: { quotation } };
// or
td.config.convert = { quotation };
// or
td.config.convert.quotation = quotation;All of the update methods above have the same effect (i.e., only config.convert.quotation setting is affected and previous defaults/changes remain in place). Errors are thrown for bad config settings.
All the config options are detailed later in this document.
parse
td.parse(markdown: string, config: object): objectWhere all the magic happens - takes markdown and converts it to HTML (or whatever document structure is configured).
Use config to set local options that will be merged atop current instance defaults.
let html = td.parse('Welcome to **Takedown**!').doc;
// => <p>Welcome to <strong>Takedown</strong>!</p>The returned object will have
doc: the document produced by the converterssource: the original markdown provided (almost - see below)matter: parsed front-matter (iffm.enabledand present in document)meta: data accumulated from the parsing process
Metadata (meta) will include:
id: unique timestamp-based hex value for the documentrefs: link reference data parsed from the documentglobalRefs: link reference data set asrefsconfig option
Note that
sourcemight be slightly different than the originalmarkdownprovided due to the removal of insecure characters (U+0000) and the replacement of structural tab characters with spaces.
parseMeta
td.parseMeta(markdown: string, fm: object): objectGets front-matter from a document as object data. Returns undefined if fm.enabled is false.
Use fm to set local options that will be merged atop config.fm instance defaults.
td.config.fm.enabled = true;
// front-matter is parsed as JSON by default
let fm = td.parseMeta(markdown);See the fm config option for more details on how front-matter is handled.
partition
td.partition(markdown: string, fm: object): arrayReturns unparsed markdown content and front matter as separated via fm.capture in an array.
Use fm to set local options that will be merged atop config.fm instance defaults.
If you do
let [ source, matter ] = td.partition(`
---
title: Markdown Page
---
# First Header Element
`);then source would be
# First Header Elementand matter would be
---
title: Markdown Page
---When not fm.enabled, matter is undefined and source is returned as-is.
What are the config options?
convert
Strings or functions that specify how markdown entities are converted to document structure.
A string will be interpolated using insertion variables (as per What is "string conversion"? section below).
A function should be of the form (data: object, vars: object): string where
datacontains converter insertion variables, andvarsare the configured variables (seevarsconfig option)
The string returned from a function can also be interpolated with insertion variables.
Here are the converters with default values and their insertion variables:
autolink
/*
Automatic hyperlink (inline).
- value: display URL
- url: encoded URL
*/
autolink: '<a href="{url}">{value}</a>'code
/*
Code span (inline).
- value: code text
- chars: opening ticks
*/
code: '<code>{value}</code>'codeblock
/*
Indented code block (block).
- value: code block source
*/
codeblock: '<pre><code>{value}</code></pre>\n'divide
/*
Thematic break (block).
- chars: symbols used for break
*/
divide: '<hr />\n',/*
Email address (inline).
- value: email address
- email: email address
*/
email: '<a href="mailto:{email}">{value}</a>',emphasis
/*
Emphasis (inline).
- value: emphasis text
- child: child data
*/
emphasis: '<em>{value}</em>'fenceblock
/*
Fenced code block (block).
- value: source content
- info: info-string
- fence: opening ticks
*/
fenceblock: e =>
{
e.lang = e.info?.match(/^\s*([^\s]+).*$/s)?.[1];
return '<pre><code{? class="language-{lang}"?}>{value}</code></pre>\n'
}header
/*
ATX Header (block).
- value: text content
- level: header level (1-6)
- child: child data
*/
header: '<h{level}>{value}</h{level}>\n'html
/*
HTML markup (inline).
- value: html content
*/
html: '{value}'htmlblock
/*
HTML markup (block).
- value: html content
*/
htmlblock: '{value}'image
/*
Image (inline).
- value: image description
- href: encoded image URL
- title: image description
- isref: is from a link ref definition?
- child: child data
*/
image: e =>
{
e.alt = e.value.replace(/<[^>]+?(?:alt="(.*?)"[^>]+?>|>)/ig, '$1');
return `<img src="{href}" alt="{alt}"{? title="{title}"?} />`;
}linebreak
/*
Hard line break (inline).
nada.
*/
linebreak: '<br />'link
/*
Hyperlink (inline).
- value: link text
- href: encoded link URL
- title: link description
- isref: is from a link ref definition?
- child: child data
*/
link: '<a href="{href??}"{? title="{title}"?}>{value}</a>'listitem
/*
List item (block).
- value: list item content
- tight: suppress paragraphs?
- child: child data
*/
listitem: e =>
{
e.nl = e.child.count && (!e.tight || e.child.first !== 'paragraph') ? '\n' : '';
return '<li>{nl}{value}</li>\n';
}olist
/*
Ordered list (block).
- value: list content
- start: starting index
- tight: suppress paragraphs?
- child: child data
*/
olist: e => `<ol${e.start !== 1 ? ` start="${e.start}"` : ''}>\n{value}</ol>\n`paragraph
/*
Paragraph (block).
- value: paragraph content
- child: child data
*/
paragraph: ({ parent: p, index }) =>
p.tight ? '{value}' + (p.child.count - 1 === index ? '' : '\n') : '<p>{value}</p>\n'quotation
/*
Blockquote (block).
- value: text content
- child: child data
*/
quotation: '<blockquote>\n{value}</blockquote>\n'root
/*
Document root (block).
- value: entire document output
- child: child data
*/
root: '{value}'setext
/*
Setext Header (block).
- value: setext header tag content
- level: setext header level (1-2)
- child: child data
*/
setext: '<h{level}>{value}</h{level}>\n'strong
/*
Strong emphasis (inline).
- value: text content
- child: child data
*/
strong: '<strong>{value}</strong>'ulist
/*
Unordered list (block).
- value: list content
- tight: suppress paragraphs?
- child: child data
*/
ulist: '<ul>\n{value}</ul>\n'All of the target document structure is defined in the convert settings.
Use only {value} to render unstructured.
// no header tags!
td.config.convert = { header: '{value}' }Omit {value} to suppress descendant output.
// no header content!
td.config.convert = { header: '<h{level}></h{level}>\n' }Set to null or empty string to turn off output completely.
// no more headers!
td.config.convert = { header: null }Where the child insertion variable is available, it will be an object having
count: number of child entities (including text nodes)first: converter name of the first child (or "text" for text node)last: converter name of the last child (or "text" for text node)
Some additional variables are also available for every converter.
name: the converter nameid: unique timestamp-based hex value for entitymeta: the same object returned fromtd.parseparent: parent converter's insertion variables (excludingvalue)index: 0-based position in the parent converter
The values of parent and index will be undefined for the root converter.
entities
Elements that parse individual markdown entities.
Each entity can look like
entities:
{
[name]:
{
// converter name
name: string,
// names of entities that can be children
nestable: [ ... string ],
// segment matching order
order: number,
// parent/child contested segment priority
priority: number,
// name of the parsing pattern to use
pattern: string,
// configuration for the pattern
patternData: { ... any },
// delouse settings
delouse:
{
// names of delousers to use for `output`
[output]: [ ... string ],
...
}
},
...
}With the exception of pattern, all of the individual entity settings are optional. An unset name will default to the entity name, and an unset order or priority defaults the value to making the entity be amongst the last considered.
The default settings here mostly correlate with the converters, but see the entities page and the delousing doc for additional details.
This area is not well documented yet, and much of it is highly subject to change. It is advised to directly consult the source code if you plan on modifying entities. The eventual idea here is to allow for custom entities to be implemented, but there is yet significant work ahead for this.
fm
Settings for handling markdown front-matter.
Here are the defaults:
fm:
{
enabled: false,
capture: /^---\s*\n(?<fm>.*?)\n---\s*/s,
parser: source => JSON.parse(source),
useConfig: 'takedown',
varsOnly: false
}Here's a rundown of the individual fm settings:
enabled(boolean)
Set totrueto activate front-matter features. Whenfalse,td.parseMetareturnsundefined, andtd.parseassumes everything in the document is markdown.capture(RegExp)
The regular expression to match front-matter. It must have an<fm>capture group as its contents will be passed to theparserfunction.parser(function)
Content fromcaptureis passed to this function. It should return an object with parsed data or a nullish value.useConfig(boolean|string)
Names a key in front-matter containing additional config options for the document. These options will be merged atop instance defaults and any manually set options (including those passed totd.parse).Set to
trueto indicate the front-matter itself is config options. Usefalseto turn this off completely.varsOnly(boolean)
When set totrue, front-matter configuration is assumed to consist solely of variable (vars) definitions, and will be merged accordingly. Has no effect ifuseConfigisfalse.
For obvious reasons,
fmsettings appearing in front-matter are ignored.
refs
Global link reference definitions.
This setting takes the following form:
refs:
{
[label]:
{
title: string,
url: string
},
...
}Each entry in refs is a markdown link reference definition identified by a label (link label) and having a url (link destination) and an optional title (link title).
It is also important to note that the link label must be of normalized form, or it will never be matched by a reference link.
"Normalized form" is effectively lowercasing the text, trimming leading and trailing whitespace, and replacing consecutive internal whitespace characters with a single space.
This convenience allows for the use of a set of references across multiple documents. Where a document ref label collides with a global one, the document ref wins.
vars
Insertion variables used in string conversion or passed to conversion functions.
Variable names can include only letters, numbers, and underscores. Nested variables (objects) are allowed and you can use dot-notation to access them in string conversion.
There are no default vars, but here's a shameless example.
vars:
{
something: 'Takedown rules'
}After setting a variable (above), use it in a converter like so
convert:
{
emphasis: '<em>I gotta tell you {something}!</em>'
}Dynamic Variables
To make a "dynamic" variable, use a function. Functions will be called with the current converter's insertion variables in string conversion. Functional converters will have to invoke a function variable directly.
What is "string conversion"?
It is how strings are interpolated with insertion variables.
There are two facets here:
variables
To insert a variable into a string, use{name}, wherenameis the name of the variable to be inserted. If the replacement value isnullorundefined, no replacement is made and the string remains as-is. Only letters, numbers, underscores, and periods are valid characters forname.To ensure replacement, use
{name??text}syntax wheretextis the literal value to use whennameis nullish.If a
nameis found in both entity data andvars, it is the entity data value that will be used in string conversion. A conversion function would need to be used in order to see both values.segments
Use{?content?}syntax to identify an optional portion (segment) of the string wherecontentwill only be rendered if at least one internal variable is replaced. That is, if variable replacement withincontentresults in the exact same string, the entire segment will be omitted.Nested segments are processed inside-out, with the results of inner segments constituting the initial state of outer ones.
Got any usage tips?
Yes!
Local Insertion Variables
The string returned from a function converter also gets interpolated for variables and segments. In the function, properties can be added to data (first parameter) and those will also be available for interpolation.
Metadata Accumulator
Use data.meta object in a function converter to capture information across the parsing run. It could be used by the header converter to build a TOC for the document, for example.
Configure Efficiently
A configuration change on an instance (td) causes it to internally be flagged to be "rebuilt" on the next parse call.
When converting lots of documents with distinct configuration needs, it will be more performant to configure a separate instance for each document group rather than configuring a single instance on a per-document basis.
Internally, when options are passed directly to an instance method, or when front-matter is allowed to inform the configuration, the instance is cloned before parsing begins, and this can have the same potential performance hit.
Hopefully, this can be mitigated somewhat in a future release :wink:
What else do I need to know?
CommonMark
Takedown's parsing and HTML generation out-of-the-box is CommonMark compliant as per spec version 0.31.2. The implementation is pure vanilla and does not add anything to the spec.
There are extra steps taken in the default convert settings (mostly concerning the placement of newlines) to get the output just right for matching the CM test-cases, but these have no effect on the structural correctness of the html output.
Test
To run tests, do
> npm testThe test runner will download the test-cases so an internet connection will be necessary.
Final Notes
Although Takedown is a fully standalone markdown parser, it was originally built to accomodate ACID, and its feature set is primarily driven by the same. As it matures, of course, it should be a great markdown parsing dependency for any application.
As an acknowledgement, this project was initially inspired by this article during the search for the markdown parser of my dreams. :smile:
Happy Markdown Parsing!
