gedcom555-token
v555.1.1
Published
Tokenizer for Gedcom 5.5.5
Readme
Gedcom 5.5.5 Token
A tokenizer for GEDCOM 5.5.5 files and individual GEDCOM lines.
Install
npm i gedcom555-tokenExports
The package exports:
tokenizeFromStringtokenizetagsInvalidGedcomLineErrorLevelNumberErrorLineTerminatorInconsistentErrorLineTerminatorMissingErrorLineTextEmptyErrorLineTextSingleAtEndErrorLineTextSingleAtStartErrorLineTextSingleErrorTagErrorTokenizeLineError
Usage
Tokenize a full GEDCOM string:
import { tokenizeFromString } from "gedcom555-token";
const tokenized = tokenizeFromString(`0 HEAD
1 GEDC
2 VERS 5.5.5
2 FORM LINEAGE-LINKED
3 VERS 5.5.5
1 CHAR UTF-8
1 SOUR gedcom.org
0 @U1@ SUBM
1 NAME gedcom.org
0 TRLR`);Result:
[
{ level: 0, tag: "HEAD" },
{ level: 1, tag: "GEDC" },
{ level: 2, tag: "VERS", lineItem: "5.5.5" },
{ level: 2, tag: "FORM", lineItem: "LINEAGE-LINKED" },
{ level: 3, tag: "VERS", lineItem: "5.5.5" },
{ level: 1, tag: "CHAR", lineItem: "UTF-8" },
{ level: 1, tag: "SOUR", lineItem: "gedcom.org" },
{ level: 0, tag: "SUBM", xrefId: "@U1@" },
{ level: 1, tag: "NAME", lineItem: "gedcom.org" },
{ level: 0, tag: "TRLR" },
];Tokenize a single line:
import { tokenize } from "gedcom555-token";
const tokenized = tokenize("0 head");
// { level: 0, tag: "HEAD" }Options
tokenizeFromString(str, options) supports:
unknownTagBehavior: "error" | "keep" | "remove"ignoreInconsistentLineTerminators: boolean
Default behavior is strict:
- unknown tags throw
- inconsistent line terminators throw
Keep unknown tags:
import { tokenizeFromString } from "gedcom555-token";
const lines = tokenizeFromString(`0 HEAD
1 _CUSTOM something
0 TRLR`, {
unknownTagBehavior: "keep",
});Remove unknown tags and their children:
import { tokenizeFromString } from "gedcom555-token";
const lines = tokenizeFromString(`0 HEAD
1 _CUSTOM
2 NAME hidden
0 TRLR`, {
unknownTagBehavior: "remove",
});Ignore mixed line terminators:
import { tokenizeFromString } from "gedcom555-token";
const lines = tokenizeFromString("0 HEAD\n1 GEDC\r\n0 TRLR\r", {
ignoreInconsistentLineTerminators: true,
});tokenize(line, options) supports:
unknownTagBehavior: "error" | "keep" | "remove"
For single-line tokenization, "keep" and "remove" both allow unknown tags through as uppercase strings. The subtree removal behavior only matters in tokenizeFromString.
Errors
Most validation errors are thrown as typed TypeError subclasses. When tokenizing a full string, line-level failures are wrapped in TokenizeLineError, which includes the failing line number in its message.
Example:
import {
TokenizeLineError,
tokenizeFromString,
} from "gedcom555-token";
try {
tokenizeFromString("0 HEAD\n1 BAD@\n0 TRLR");
} catch (err) {
if (err instanceof TokenizeLineError) {
console.error(err.message);
}
}Notes
- Does not check encoding. String is assumed unicode.
- Checks for line terminator consistency. (configurable)
- Checks tags against known list. (configurable)
- Checks line item form single "@" at signs.
- Does not check other grammar rules. These are left for the parser to implement.
- Gedcom 555 tags being case insensitive, tokenize converts them to upper case.
Issues / FAQ
- Empty CONT. As per the gedcom line definition, a CONT tag can appear without line value. If so, the line terminator MUST be directly after the tag. A trailing space or deliminator after the tag and before the terminator will cause an error.
"2 CONT" : is legal : +1 CONT[terminator]
"2 CONT " : is illegal : +1 CONT[delim space][terminator]
"2 CONT " : is legal : +1 CONT[delim space][line value space][terminator]License
MIT
