@biblioteksentralen/xml-utils

v0.0.3

Published

a month ago

XML parsing utils

0High
0Medium
0Low

jfulse

danmichaelo-bs

`@biblioteksentralen/xml-utils`

XML parsing and serialization utils based on saxes.

Limitations

XML has tons of different use cases. The XmlElement class focuses on handling XML as simple data containers in a safe and fast way with more or less the same expressivity as JSON + attributes. It's not intended for HTML, annotated documents or other use cases, but works well with ONIX, MARC and similar data containers.

Some limitations:

An element node can only contain one of the following: text, CDATA, child elements.
Whitespace outside of text and cdata nodes is not preserved. This also means that you don't necessarily get the exactly same document back if you stringify a document that has just been parsed. But the serialization is stable, so you get the same serialization each time.
Namespaces are stripped.
Comments are stripped.
No support for processing instructions or entity declarations.
No concept of a document node, only element nodes.

Parsing XML

The parseXml function parses an XML element (provided as a string or buffer) and returns an XmlElement instance, which includes methods for extracting data from the XML. Example:

import fs from "node:fs";
import { parseXml } from "@biblioteksentralen/xml-utils";

const source = Buffer.from(
  "<products><product>Hello world</product></products>",
  "utf-8",
);
const rootNode = await parseXml(source);
console.log(rootNode.first("product")?.text());

Iterate over top level nodes

For large XML files, it is more efficient to parse the file iteratively and discard data after it has been processed, rather than reading the entire file into memory. This approach allows files of any size to be parsed with nearly constant memory usage.

The library provides a generator method that yields top level nodes from an XML file so these can be iterated over without reading the whole file into memory. Example:

import fs from "node:fs";
import {
  streamTopLevelXmlNodes,
  parseXml,
} from "@biblioteksentralen/xml-utils";

const readable = fs.createReadStream("onix.xml", { encoding: "utf-8" });
for await (const node of streamTopLevelXmlNodes(readable)) {
  console.log(node.nodeName, node.xmlText);
  if (node.nodeName === "header") {
    const headerNode = await parseXml(node.xmlText);
    // ...
  }
}

Note: If the stream contain characters considered invalid by the XML 1.0 specification such as control characters, these can be ignored by adding the ignoreInvalidCharacters: true to streamTopLevelXmlNodes. Invalid characters will then be returned in the response, but double encoded to not cause harm. An error will also be logged.

Serializing XML

The XmlElement can be serialized back to XML:

import { serializeXml } from "./serializeXml.js";

const serialized = serializeXml(rootNode);
console.log(serialized);

To create pretty-printed (indented) XML:

const serialized = serializeXml(rootNode, true);
console.log(serialized);

Building XML

Example:

import {
  createXmlElement,
  serializeXml,
  type XmlElement,
} from "@biblioteksentralen/xml-utils";

const fields: XmlElement[] = [
  createXmlElement("leader", { text: input.leader }),
];

const recordNode = createXmlElement("record", {
  attrs: { xmlns: "http://www.loc.gov/MARC21/slim" },
  children: fields,
});

const result = serializeXml(recordNode);

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme