tag-soup
v3.1.0
Published
The fastest pure JS SAX/DOM XML/HTML parser.
Maintainers
Readme
TagSoup is the fastest pure JS SAX/DOM XML/HTML parser and serializer.
- Extremely low memory consumption.
- Tolerant of malformed tag nesting, missing end tags, etc.
- Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.
- Supports both strict XML and forgiving HTML parsing modes.
- 20 kB gzipped ↗, including dependencies.
- Check out TagSoup dependencies: Speedy Entities ↗ and Flyweight DOM ↗.
npm install --save-prod tag-soupDOM parsing
TagSoup exports preconfigured HTMLDOMParser ↗
which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:
import { HTMLDOMParser, toHTML } from 'tag-soup';
const fragment = HTMLDOMParser.parseFragment('<p>hello<p>cool</br>');
// ⮕ DocumentFragment
toHTML(fragment);
// ⮕ '<p>hello</p><p>cool<br></p>'HTMLDOMParser decodes both HTML entities and numeric character references with
decodeHTML ↗.
XMLDOMParser ↗
parses XML markup as a DOM node. It throws
ParserError ↗ if markup doesn't
satisfy XML spec:
import { XMLDOMParser, toXML } from 'tag-soup';
XMLDOMParser.parseFragment('<p>hello</br>');
// ❌ ParserError: Unexpected end tag.
const fragment = XMLDOMParser.parseFragment('<p>hello<br/></p>');
// ⮕ DocumentFragment
toXML(fragment);
// ⮕ '<p>hello<br/></p>XMLDOMParser decodes both XML entities and numeric character references with
decodeXML ↗.
TagSoup uses Flyweight DOM ↗ nodes, which provide many standard DOM manipulation features:
const document = HTMLDOMParser.parseDocument('<!DOCTYPE html><html>hello</html>');
document.doctype.name;
// ⮕ 'html'
document.textContent;
// ⮕ 'hello'For example, you can use TreeWalker to traverse DOM nodes:
import { TreeWalker, NodeFilter } from 'flyweight-dom';
const fragment = XMLDOMParser.parseFragment('<p>hello<br/></p>');
const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);
treeWalker.nextNode();
// ⮕ Text { 'hello' }Create a custom DOM parser using
createDOMParser ↗:
import { createDOMParser } from 'tag-soup';
const myParser = createDOMParser({
voidTags: ['br'],
});
myParser.parseFragment('<p><br></p>');
// ⮕ DocumentFragmentSAX parsing
TagSoup exports preconfigured
HTMLSAXParser ↗ which parses
HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing and forgives
malformed markup:
import { HTMLSAXParser } from 'tag-soup';
HTMLSAXParser.parseFragment('<p>hello<p>cool</br>', {
onStartTagOpening(tagName) {
// Called with 'p', 'p', and 'br'
},
onText(text) {
// Called with 'hello' and 'cool'
},
});XMLSAXParser ↗
parses XML markup and calls handler methods when a token is read. It throws
ParserError ↗ if markup doesn't satisfy XML
spec:
import { XMLSAXParser } from 'tag-soup';
XMLSAXParser.parseFragment('<p>hello</br>', {});
// ❌ ParserError: Unexpected end tag.
XMLSAXParser.parseFragment('<p>hello<br/></p>', {
onEndTag(tagName) {
// Called with 'br' and 'p'
},
});Create a custom SAX parser using
createSAXParser ↗:
import { createSAXParser } from 'tag-soup';
const myParser = createSAXParser({
voidTags: ['br'],
});
myParser.parseFragment('<p><br></p>', {
onStartTagOpening(tagName) {
// Called with 'p' and 'br'
},
});Tokenization
TagSoup exports preconfigured
HTMLTokenizer ↗
which parses HTML markup and invokes a callback when a token is read. This tokenizer never throws errors during
tokenization and forgives malformed markup:
import { HTMLTokenizer } from 'tag-soup';
HTMLTokenizer.tokenizeFragment('<p>hello<p>cool</br>', (token, startIndex, endIndex) => {
// Handle token
});XMLTokenizer ↗
parses XML markup and invokes a callback when a token is read. It throws
ParserError ↗ if markup doesn't
satisfy XML spec:
import { XMLTokenizer } from 'tag-soup';
XMLTokenizer.tokenizeFragment('<p>hello</br>', (token, startIndex, endIndex) => {});
// ❌ ParserError: Unexpected end tag.
XMLTokenizer.tokenizeFragment('<p>hello<br/></p>', (token, startIndex, endIndex) => {
// Handle token
});Create a custom tokenizer using
createTokenizer ↗:
import { createTokenizer } from 'tag-soup';
const myTokenizer = createTokenizer({
voidTags: ['br'],
});
myTokenizer.tokenizeFragment('<p><br></p>', (token, startIndex, endIndex) => {
// Handle token
});Serialization
TagSoup exports two preconfigured serializers:
toHTML ↗ and
toXML ↗.
import { HTMLDOMParser, toHTML } from 'tag-soup';
const fragment = HTMLDOMParser.parseFragment('<p>hello<p>cool</br>');
// ⮕ DocumentFragment
toHTML(fragment);
// ⮕ '<p>hello</p><p>cool<br></p>'Create a custom serializer using
createSerializer ↗:
import { HTMLDOMParser, createSerializer } from 'tag-soup';
const mySerializer = createSerializer({
voidTags: ['br'],
});
const fragment = HTMLDOMParser.parseFragment('<p>hello</br>');
// ⮕ DocumentFragment
mySerializer(fragment);
// ⮕ '<p>hello<br></p>'Performance
Execution performance is measured in operations per second (± 5%), the higher number is better. Memory consumption (RAM) is measured in bytes, the lower number is better.
Performance was measured when parsing the 3.8 MB HTML file.
Tests were conducted using TooFast on Apple M1 with Node.js v23.11.1.
To reproduce the performance test suite results, clone this repo and run:
npm ci
npm run build
npm run perfLimitations
TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.
Assume the following markup:
<p><strong>okay
<p>nopeWith DOMParser ↗ this markup would be transformed to:
<p><strong>okay</strong></p>
<p><strong>nope</strong></p>TagSoup doesn't insert the second strong tag:
<p><strong>okay</strong></p>
<p>nope</p>