@pictalk-speech-made-easy/conllu-parser
v1.1.0
Published
A TypeScript parser for CoNLL-U and CoNLL-UL (Universal Dependencies) format with full type definitions
Downloads
12
Maintainers
Readme
conllu-parser
A TypeScript parser for CoNLL-U and CoNLL-UL (Universal Dependencies / Universal Lattices) formats with full type definitions.
Features
- 🔷 Full TypeScript support with comprehensive type definitions
- 📖 Parse CoNLL-U and CoNLL-UL files into structured objects
- ✍️ Serialize back to either format
- 🔀 Auto-detect format from input (10 columns → CoNLL-U, 8-9 → CoNLL-UL)
- 🔍 Query utilities for finding tokens by POS, dependency relation, etc.
- 🌳 Tree navigation helpers for dependency trees
- 🕸️ Lattice utilities for navigating morphological ambiguity graphs
- 🔄 Bidirectional conversion between CoNLL-U sentences and CoNLL-UL lattices
- 🌍 Universal Dependencies compliant — supports all UD features
Installation
npm install conllu-parserQuick Start — CoNLL-U
import { parseConllu, reconstructText, findByUpos, getRoot } from 'conllu-parser';
const conllu = `# sent_id = example
# text = The cat sat.
1 The the DET _ Definite=Def|PronType=Art 2 det _ _
2 cat cat NOUN _ Number=Sing 3 nsubj _ _
3 sat sit VERB _ Tense=Past 0 root _ SpaceAfter=No
4 . . PUNCT _ _ 3 punct _ _
`;
const doc = parseConllu(conllu);
console.log(doc.sentences[0].metadata.text); // "The cat sat."
const nouns = findByUpos(doc.sentences[0], 'NOUN');
console.log(nouns.map(t => t.form)); // ["cat"]
const root = getRoot(doc.sentences[0]);
console.log(root?.form); // "sat"
console.log(reconstructText(doc.sentences[0])); // "The cat sat."Quick Start — CoNLL-UL
import {
parseConllul,
groupArcsByForm,
getArcsFrom,
findPaths,
isLinearLattice,
latticeToSentence,
} from 'conllu-parser';
// Morphological lexicon with ambiguity
const conllul = `# sent_id = en-book
0 1 book book NOUN N#s Number=Sing _
0 1 book book AUX V#inf VerbForm=Inf _
0 1 book book VERB V#inf VerbForm=Inf _
`;
const doc = parseConllul(conllul);
const lattice = doc.lattices[0];
// See all competing analyses for a surface form
const grouped = groupArcsByForm(lattice);
console.log(grouped.get('book')?.length); // 3 (NOUN, AUX, VERB)
// Navigate the lattice graph
const arcs = getArcsFrom(lattice, 0);
console.log(arcs.map(a => a.upos)); // ["NOUN", "AUX", "VERB"]
// Check ambiguity and convert if linear
if (isLinearLattice(lattice)) {
const sentence = latticeToSentence(lattice);
}Auto-detection
import { parse, serialize } from 'conllu-parser';
// Automatically detects CoNLL-U (10 cols) vs CoNLL-UL (8-9 cols)
const doc = parse(input);
if (doc.format === 'conllu') {
console.log(doc.sentences.length);
} else {
console.log(doc.lattices.length);
}
// Serialize back to the original format
const output = serialize(doc);API Reference
Auto-detecting Functions
parse(input: string): ParsedDocument
Parse either format, auto-detecting from column count. Returns a ConlluDocument or ConllulDocument discriminated by the format field.
serialize(doc: ParsedDocument): string
Serialize either document type back to its format.
detectFormat(input: string): 'conllu' | 'conllul'
Detect the format without parsing.
CoNLL-U Functions
parseConllu(input: string): ConlluDocument
Parse a CoNLL-U formatted string into a structured document.
serializeConllu(doc: ConlluDocument): string
Serialize back to CoNLL-U format.
findByUpos(sentence: Sentence, upos: UPOS): Token[]
Find all tokens with a specific Universal POS tag.
findByDeprel(sentence: Sentence, deprel: DepRel): Token[]
Find all tokens with a specific dependency relation.
getRoot(sentence: Sentence): Token | undefined
Get the root token of a sentence's dependency tree.
getChildren(sentence: Sentence, tokenId: string | number): Token[]
Get all tokens that depend on a given token.
getHead(sentence: Sentence, token: Token): Token | undefined
Get the head (parent) of a token in the dependency tree.
reconstructText(sentence: Sentence): string
Reconstruct the original text from tokens, respecting SpaceAfter annotations.
CoNLL-UL Functions
parseConllul(input: string): ConllulDocument
Parse a CoNLL-UL formatted string into lattice structures.
serializeConllul(doc: ConllulDocument, options?): string
Serialize back to CoNLL-UL format. Pass { includeAnchors: false } to omit the 9th column.
getVertices(lattice: Lattice): number[]
Get all unique vertex indices in a lattice, sorted ascending.
getArcsFrom(lattice: Lattice, vertex: number): LatticeArc[]
Get all arcs leaving a vertex — these are competing morphological analyses.
getArcsTo(lattice: Lattice, vertex: number): LatticeArc[]
Get all arcs arriving at a vertex.
getFormsAtVertex(lattice: Lattice, vertex: number): string[]
Get distinct surface forms at a given vertex.
groupArcsByForm(lattice: Lattice): Map<string, LatticeArc[]>
Group all arcs by surface form. Useful for displaying ambiguity (e.g., "tapping" → VERB/Ger, VERB/Part, NOUN).
findPaths(lattice, start?, end?, maxPaths?): LatticeArc[][]
Enumerate all paths through the lattice. Each path is a possible morphological analysis. Use maxPaths (default 1000) to limit computation on highly ambiguous lattices.
isLinearLattice(lattice: Lattice): boolean
Check if a lattice has no ambiguity (each vertex has at most one outgoing arc). Linear lattices map directly to CoNLL-U.
getAmbiguityCount(lattice: Lattice, maxPaths?): number
Count distinct paths through the lattice. Returns -1 if maxPaths is exceeded.
findArcsByUpos(lattice: Lattice, upos: UPOS): LatticeArc[]
Find all arcs with a specific POS tag.
findArcsByLemma(lattice: Lattice, lemma: string): LatticeArc[]
Find all arcs with a specific lemma.
Conversion Functions
latticeToSentence(lattice: Lattice): Sentence
Convert a linear (unambiguous) CoNLL-UL lattice to a CoNLL-U sentence. Vertex IDs are converted from 0-based to 1-based. Throws if the lattice has ambiguity.
sentenceToLattice(sentence: Sentence): Lattice
Convert a CoNLL-U sentence to a linear CoNLL-UL lattice. Multi-word tokens become source spans.
Type Definitions
CoNLL-U Types
interface ConlluDocument {
format: 'conllu';
metadata: DocumentMetadata;
sentences: Sentence[];
}
interface Sentence {
metadata: SentenceMetadata;
tokens: Token[];
}
interface Token {
id: string; // "1", "1-2", or "1.1"
form: string; // Word form
lemma: string; // Lemma
upos: UPOS | '_'; // Universal POS tag
xpos: string; // Language-specific POS tag
feats: MorphFeatures | '_';
head: number | '_'; // Head token ID (0 for root)
deprel: DepRel | '_';
deps: EnhancedDep[] | '_';
misc: MiscFeatures | '_';
}CoNLL-UL Types
interface ConllulDocument {
format: 'conllul';
metadata: DocumentMetadata;
lattices: Lattice[];
}
interface Lattice {
metadata: SentenceMetadata;
sourceSpans: SourceTokenSpan[]; // Multi-segment surface forms
arcs: LatticeArc[]; // All edges in the lattice
}
interface LatticeArc {
from: number; // Start vertex (0-based)
to: number; // End vertex (0-based)
form: string; // Word form
lemma: string; // Lemma
upos: UPOS | '_'; // Universal POS tag
xpos: string; // Language-specific POS tag
feats: MorphFeatures | '_';
misc: MiscFeatures | '_';
anchors: Anchor | '_'; // Link to gold disambiguation (e.g. goldid=3)
}
interface SourceTokenSpan {
fromVertex: number; // Start vertex of the span
toVertex: number; // End vertex of the span
sourceForm: string; // Surface form of the source token
misc: string;
}POS Tags (UPOS)
All 17 Universal POS tags are supported:
type UPOS =
| 'ADJ' | 'ADP' | 'ADV' | 'AUX' | 'CCONJ' | 'DET'
| 'INTJ' | 'NOUN' | 'NUM' | 'PART' | 'PRON' | 'PROPN'
| 'PUNCT' | 'SCONJ' | 'SYM' | 'VERB' | 'X';Morphological Features
Full support for Universal Dependencies morphological features:
interface MorphFeatures {
Gender?: string; // Masc, Fem, Neut, Com
Number?: string; // Sing, Plur, Dual
Case?: string; // Nom, Acc, Dat, Gen, Voc, Loc, Ins
Definite?: string; // Def, Ind, Spec, Cons
VerbForm?: string; // Fin, Inf, Part, Conv, Ger, Sup
Mood?: string; // Ind, Imp, Cnd, Sub, Opt
Tense?: string; // Past, Pres, Fut, Imp, Pqp
Person?: string; // 1, 2, 3
Voice?: string; // Act, Pass, Mid
Degree?: string; // Pos, Cmp, Sup
PronType?: string; // Prs, Art, Int, Rel, Dem, Ind
// Layered features
'Gender[lex]'?: string;
'Number[ctxt]'?: string;
// And more...
[key: string]: string | undefined;
}Format Reference
CoNLL-U (10 columns)
ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISCSupported features: document-level comments, sentence metadata, all 10 token columns, multi-word tokens (1-2), empty nodes (2.1), enhanced dependencies, layered morphological features.
CoNLL-UL (8-9 columns)
FROM TO FORM LEMMA UPOS XPOS FEATS MISC [ANCHORS]CoNLL-UL extends CoNLL-U to represent morphological ambiguity via lattice structures. Key differences from CoNLL-U:
- Vertex-based indexing (0-based
FROM/TO) instead of linear token IDs - Multiple arcs from the same vertex represent competing analyses
- Source token spans (e.g.,
0-3 BCLM) declare surface forms spanning multiple vertices - ANCHORS column (optional 9th column) links arcs to gold disambiguation via
goldid=N - No dependency columns (HEAD, DEPREL, DEPS) — CoNLL-UL is pre-syntactic
A linear (unambiguous) CoNLL-UL lattice maps directly to CoNLL-U.
License
MIT
