conllu
The conllu module provides the standard CoNLL-U format implementation, including the Token and Sentence class and pre-configured Format instance for reading and writing CoNLL-U files.
Overview
This module is the primary entry point for working with CoNLL-U files. It provides:
Token- The CoNLL-U token schema with all standard fields.Sentence- The CoNLL-U sentence schema which can create a Tree model and provides access to metadata and tokens.conllu- A pre-configuredFormatinstance for CoNLL-U. This should be the default in most use cases as opposed tofast_conllu.fast_conllu- A pre-configuredFormatinstance for CoNLL-U which trades off parser speed for increased memory usage.ConllFormatwhich provides a type alias to abstract from having to use the full types ofconlluandfast_conllu.
The conllu Format Instance
The module exports pre-configured Format[Token] instances named conllu and fast_conllu that are ready to use and are completely interchangeable.
from pyconll.conllu import conllu, ConlluFormat
cformat: ConlluFormat = conllu # or conllu.fast_conllu
# Load entire file into memory
sentences = cformat.load_from_file('train.conllu')
# Stream large files
for sentence in cformat.iter_from_file('huge.conllu'):
process(sentence)
# Parse from string
text = """# sent_id = 1
1\tThe\tthe\tDET\t_\t_\t2\tdet\t_\t_
2\tcat\tcat\tNOUN\t_\t_\t0\troot\t_\tSpaceAfter=No
"""
sentences = cformat.load_from_string(text)
# Write back to file
with open('output.conllu', 'w') as f:
cformat.write_corpus(sentences, f)
The Token Class
The Token class defines the CoNLL-U format with 10 standard columns:
Fields
id: str- Token ID (e.g., “1”, “2-3”, “2.1”)form: Optional[str]- Word form or punctuation symbollemma: Optional[str]- Lemma or stemupos: Optional[str]- Universal part-of-speech tagxpos: Optional[str]- Language-specific part-of-speech tagfeats: dict[str, set[str]]- Morphological featureshead: Optional[str]- Head token IDdeprel: Optional[str]- Dependency relationdeps: dict[str, tuple[str, ...]]- Enhanced dependenciesmisc: dict[str, Optional[set[str]]]- Miscellaneous annotations
Example Usage
from pyconll.conllu import conllu
sentences = conllu.load_from_file('train.conllu')
for sentence in sentences:
for token in sentence.tokens:
# Access basic fields
if token.upos == 'VERB':
print(f"Verb: {token.form} -> {token.lemma}")
# Modify features
if token.upos == 'NOUN':
if 'Number' not in token.feats:
token.feats['Number'] = set()
token.feats['Number'].add('Sing')
# Add misc annotations
token.misc['Analyzed'] = None # Singleton feature
# Write modified corpus
with open('output.conllu', 'w') as f:
conllu.write_corpus(sentences, f)
Dictionary Fields
Three fields (feats, deps, misc) are dictionaries.
feats
Morphological features as key-value pairs:
# Example: Gender=Fem|Number=Sing
token.feats = {
'Gender': {'Fem'},
'Number': {'Sing'}
}
# Modify
token.feats['Case'] = {'Nom'}
token.feats['Number'].add('Plur') # Now {'Sing', 'Plur'}
# Serializes to: Case=Nom|Gender=Fem|Number=Sing,Plur
deps
Enhanced dependencies as head-to-relation mappings:
# Example: 4:nsubj
token.deps = {
'4': ('nsubj',)
}
# The tuple is a fixed size per element but can vary between elements.
# In CoNLL-U, there is usually only two elements in this field.
misc
Miscellaneous annotations with optional values:
# Singleton features (no value)
token.misc['SpaceAfter'] = None # Serializes as "SpaceAfter"
# Features with values
token.misc['Translit'] = {'example'} # Serializes as "Translit=example"
# Multiple values
token.misc['Gloss'] = {'cat', 'feline'} # Serializes as "Gloss=cat,feline"