@pictalk-speech-made-easy/conllu-parser
v1.0.0
Published
A TypeScript parser for CoNLL-U (Universal Dependencies) format with full type definitions
Maintainers
Readme
conllu-parser
A TypeScript parser for CoNLL-U (Universal Dependencies) format with full type definitions.
Features
- 🔷 Full TypeScript support with comprehensive type definitions
- 📖 Parse CoNLL-U files into structured objects
- ✍️ Serialize back to CoNLL-U format
- 🔍 Query utilities for finding tokens by POS, dependency relation, etc.
- 🌳 Tree navigation helpers for dependency trees
- 🌍 Universal Dependencies compliant - supports all UD features
Installation
npm install conllu-parserQuick Start
import { parseConllu, reconstructText, findByUpos, getRoot } from 'conllu-parser';
const conllu = `# sent_id = example
# text = The cat sat.
1 The the DET _ Definite=Def|PronType=Art 2 det _ _
2 cat cat NOUN _ Number=Sing 3 nsubj _ _
3 sat sit VERB _ Tense=Past 0 root _ SpaceAfter=No
4 . . PUNCT _ _ 3 punct _ _
`;
const doc = parseConllu(conllu);
// Access sentences
console.log(doc.sentences[0].metadata.text); // "The cat sat."
// Find tokens by POS
const nouns = findByUpos(doc.sentences[0], 'NOUN');
console.log(nouns.map(t => t.form)); // ["cat"]
// Get the root of the dependency tree
const root = getRoot(doc.sentences[0]);
console.log(root?.form); // "sat"
// Reconstruct text from tokens
console.log(reconstructText(doc.sentences[0])); // "The cat sat."API Reference
Parsing Functions
parseConllu(input: string): ConlluDocument
Parse a CoNLL-U formatted string into a structured document.
const doc = parseConllu(conlluString);
console.log(doc.sentences.length);
console.log(doc.metadata.columns);serializeConllu(document: ConlluDocument): string
Convert a parsed document back to CoNLL-U format.
const conlluString = serializeConllu(doc);Query Functions
findByUpos(sentence: Sentence, upos: UPOS): Token[]
Find all tokens with a specific Universal POS tag.
const verbs = findByUpos(sentence, 'VERB');
const nouns = findByUpos(sentence, 'NOUN');findByDeprel(sentence: Sentence, deprel: DepRel): Token[]
Find all tokens with a specific dependency relation.
const subjects = findByDeprel(sentence, 'nsubj');
const objects = findByDeprel(sentence, 'obj');Tree Navigation
getRoot(sentence: Sentence): Token | undefined
Get the root token of a sentence's dependency tree.
const root = getRoot(sentence);
console.log(root?.form, root?.upos);getChildren(sentence: Sentence, tokenId: string | number): Token[]
Get all tokens that depend on a given token.
const children = getChildren(sentence, root.id);
children.forEach(child => console.log(child.deprel, child.form));getHead(sentence: Sentence, token: Token): Token | undefined
Get the head (parent) of a token in the dependency tree.
const head = getHead(sentence, token);Utility Functions
reconstructText(sentence: Sentence): string
Reconstruct the original text from tokens, respecting SpaceAfter annotations.
const text = reconstructText(sentence);Type Definitions
Main Types
interface ConlluDocument {
metadata: DocumentMetadata;
sentences: Sentence[];
}
interface Sentence {
metadata: SentenceMetadata;
tokens: Token[];
}
interface Token {
id: string; // Token ID (can be "1", "1-2", or "1.1")
form: string; // Word form
lemma: string; // Lemma
upos: UPOS | '_'; // Universal POS tag
xpos: string | '_'; // Language-specific POS tag
feats: MorphFeatures | '_'; // Morphological features
head: number | '_'; // Head token ID (0 for root)
deprel: DepRel | '_'; // Dependency relation
deps: EnhancedDep[] | '_'; // Enhanced dependencies
misc: MiscFeatures | '_'; // Miscellaneous annotations
}POS Tags (UPOS)
All 17 Universal POS tags are supported:
type UPOS =
| 'ADJ' | 'ADP' | 'ADV' | 'AUX' | 'CCONJ' | 'DET'
| 'INTJ' | 'NOUN' | 'NUM' | 'PART' | 'PRON' | 'PROPN'
| 'PUNCT' | 'SCONJ' | 'SYM' | 'VERB' | 'X';Dependency Relations (DepRel)
Universal dependency relations including subtypes:
type DepRel =
| 'nsubj' | 'obj' | 'iobj' | 'csubj' | 'ccomp' | 'xcomp'
| 'obl' | 'vocative' | 'expl' | 'dislocated' | 'advcl'
| 'advmod' | 'discourse' | 'aux' | 'cop' | 'mark'
| 'nmod' | 'appos' | 'nummod' | 'acl' | 'amod' | 'det'
| 'clf' | 'case' | 'conj' | 'cc' | 'fixed' | 'flat'
| 'compound' | 'list' | 'parataxis' | 'orphan' | 'goeswith'
| 'reparandum' | 'punct' | 'root' | 'dep'
// Plus subtypes like 'nsubj:pass', 'obl:mod', 'acl:relcl', etc.
| string;Morphological Features
Full support for Universal Dependencies morphological features:
interface MorphFeatures {
// Nominal
Gender?: 'Masc' | 'Fem' | 'Neut' | 'Com';
Number?: 'Sing' | 'Plur' | 'Dual';
Case?: 'Nom' | 'Acc' | 'Dat' | 'Gen' | 'Voc' | 'Loc' | 'Ins';
Definite?: 'Def' | 'Ind' | 'Spec' | 'Cons';
// Verbal
VerbForm?: 'Fin' | 'Inf' | 'Part' | 'Conv' | 'Ger' | 'Sup';
Mood?: 'Ind' | 'Imp' | 'Cnd' | 'Sub' | 'Opt';
Tense?: 'Past' | 'Pres' | 'Fut' | 'Imp' | 'Pqp';
Person?: '1' | '2' | '3';
Voice?: 'Act' | 'Pass' | 'Mid';
// Pronominal
PronType?: 'Prs' | 'Art' | 'Int' | 'Rel' | 'Dem' | 'Ind';
Poss?: 'Yes';
Reflex?: 'Yes';
// And many more...
[key: string]: string | undefined;
}CoNLL-U Format
This parser supports the full CoNLL-U format including:
- ✅ Document-level comments (e.g.,
# global.columns = ...) - ✅ Sentence metadata (
# sent_id,# text, custom fields) - ✅ All 10 token columns
- ✅ Multi-word tokens (e.g.,
1-2 don't) - ✅ Empty nodes (e.g.,
2.1 ...) - ✅ Enhanced dependencies
- ✅ Morphological features with layered annotations (e.g.,
Gender[lex])
License
MIT
