@biblioteksentralen/xml-utils
v0.0.3
Published
XML parsing utils
Readme
@biblioteksentralen/xml-utils
XML parsing and serialization utils based on saxes.
Limitations
XML has tons of different use cases. The XmlElement class focuses on handling XML as simple data
containers in a safe and fast way with more or less the same expressivity as JSON + attributes. It's
not intended for HTML, annotated documents or other use cases, but works well with ONIX, MARC and
similar data containers.
Some limitations:
- An element node can only contain one of the following: text, CDATA, child elements.
- Whitespace outside of text and cdata nodes is not preserved. This also means that you don't necessarily get the exactly same document back if you stringify a document that has just been parsed. But the serialization is stable, so you get the same serialization each time.
- Namespaces are stripped.
- Comments are stripped.
- No support for processing instructions or entity declarations.
- No concept of a document node, only element nodes.
Parsing XML
The parseXml function parses an XML element (provided as a string or buffer) and returns an
XmlElement instance, which includes methods for extracting data from the XML. Example:
import fs from "node:fs";
import { parseXml } from "@biblioteksentralen/xml-utils";
const source = Buffer.from(
"<products><product>Hello world</product></products>",
"utf-8",
);
const rootNode = await parseXml(source);
console.log(rootNode.first("product")?.text());Iterate over top level nodes
For large XML files, it is more efficient to parse the file iteratively and discard data after it has been processed, rather than reading the entire file into memory. This approach allows files of any size to be parsed with nearly constant memory usage.
The library provides a generator method that yields top level nodes from an XML file so these can be iterated over without reading the whole file into memory. Example:
import fs from "node:fs";
import {
streamTopLevelXmlNodes,
parseXml,
} from "@biblioteksentralen/xml-utils";
const readable = fs.createReadStream("onix.xml", { encoding: "utf-8" });
for await (const node of streamTopLevelXmlNodes(readable)) {
console.log(node.nodeName, node.xmlText);
if (node.nodeName === "header") {
const headerNode = await parseXml(node.xmlText);
// ...
}
}Note: If the stream contain characters considered invalid by the XML 1.0 specification such as
control characters, these can be ignored by adding the ignoreInvalidCharacters: true to
streamTopLevelXmlNodes. Invalid characters will then be returned in the response, but double
encoded to not cause harm. An error will also be logged.
Serializing XML
The XmlElement can be serialized back to XML:
import { serializeXml } from "./serializeXml.js";
const serialized = serializeXml(rootNode);
console.log(serialized);To create pretty-printed (indented) XML:
const serialized = serializeXml(rootNode, true);
console.log(serialized);Building XML
Example:
import {
createXmlElement,
serializeXml,
type XmlElement,
} from "@biblioteksentralen/xml-utils";
const fields: XmlElement[] = [
createXmlElement("leader", { text: input.leader }),
];
const recordNode = createXmlElement("record", {
attrs: { xmlns: "http://www.loc.gov/MARC21/slim" },
children: fields,
});
const result = serializeXml(recordNode);