conllu
The conllu module provides the standard CoNLL-U format implementation, including the Token and Sentence class and pre-configured Format instance for reading and writing CoNLL-U files.
Overview
This module is the primary entry point for working with CoNLL-U files. It provides:
Token- The CoNLL-U token schema with all standard fields.Sentence- The CoNLL-U sentence schema which can create a Tree model and provides access to metadata and tokens.conllu- A pre-configuredFormatinstance for CoNLL-U. This should be the default in most use cases as opposed tofast_conllu.fast_conllu- A pre-configuredFormatinstance for CoNLL-U which trades off parser speed for increased memory usage.ConllFormatwhich provides a type alias to abstract from having to use the full types ofconlluandfast_conllu.
The conllu Format Instance
The module exports pre-configured Format[Token] instances named conllu and fast_conllu that are ready to use and are completely interchangeable.
from pyconll.conllu import conllu, ConlluFormat
cformat: ConlluFormat = conllu # or conllu.fast_conllu
# Load entire file into memory
sentences = cformat.load_from_file('train.conllu')
# Stream large files
for sentence in cformat.iter_from_file('huge.conllu'):
process(sentence)
# Parse from string
text = """# sent_id = 1
1\tThe\tthe\tDET\t_\t_\t2\tdet\t_\t_
2\tcat\tcat\tNOUN\t_\t_\t0\troot\t_\tSpaceAfter=No
"""
sentences = cformat.load_from_string(text)
# Write back to file
with open('output.conllu', 'w') as f:
cformat.write_corpus(sentences, f)
The Token Class
The Token class defines the CoNLL-U format with 10 standard columns:
Fields
id: str- Token ID (e.g., “1”, “2-3”, “2.1”)form: Optional[str]- Word form or punctuation symbollemma: Optional[str]- Lemma or stemupos: Optional[str]- Universal part-of-speech tagxpos: Optional[str]- Language-specific part-of-speech tagfeats: dict[str, set[str]]- Morphological featureshead: Optional[str]- Head token IDdeprel: Optional[str]- Dependency relationdeps: dict[str, tuple[str, ...]]- Enhanced dependenciesmisc: dict[str, Optional[set[str]]]- Miscellaneous annotations
Example Usage
from pyconll.conllu import conllu
sentences = conllu.load_from_file('train.conllu')
for sentence in sentences:
for token in sentence.tokens:
# Access basic fields
if token.upos == 'VERB':
print(f"Verb: {token.form} -> {token.lemma}")
# Modify features
if token.upos == 'NOUN':
if 'Number' not in token.feats:
token.feats['Number'] = set()
token.feats['Number'].add('Sing')
# Add misc annotations
token.misc['Analyzed'] = None # Singleton feature
# Write modified corpus
with open('output.conllu', 'w') as f:
conllu.write_corpus(sentences, f)
Dictionary Fields
Three fields (feats, deps, misc) are dictionaries.
feats
Morphological features as key-value pairs:
# Example: Gender=Fem|Number=Sing
token.feats = {
'Gender': {'Fem'},
'Number': {'Sing'}
}
# Modify
token.feats['Case'] = {'Nom'}
token.feats['Number'].add('Plur') # Now {'Sing', 'Plur'}
# Serializes to: Case=Nom|Gender=Fem|Number=Sing,Plur
deps
Enhanced dependencies as head-to-relation mappings:
# Example: 4:nsubj
token.deps = {
'4': ('nsubj',)
}
# The tuple is a fixed size per element but can vary between elements.
# In CoNLL-U, there is usually only two elements in this field.
misc
Miscellaneous annotations with optional values:
# Singleton features (no value)
token.misc['SpaceAfter'] = None # Serializes as "SpaceAfter"
# Features with values
token.misc['Translit'] = {'example'} # Serializes as "Translit=example"
# Multiple values
token.misc['Gloss'] = {'cat', 'feline'} # Serializes as "Gloss=cat,feline"
API
Defines the Token type and parsing and output logic. A Token is the based unit in CoNLL-U and so the data and parsing in this module is central to the CoNLL-U format.
- class pyconll.conllu.Sentence[source]
A sentence in a CoNLL-U file. A sentence consists of several components.
First, are comments. Each sentence must have two comments per UD v2 guidelines, which are sent_id and text. Comments are stored as a dict in the meta field. For singleton comments with no key-value structure, the value in the dict has a value of None.
Note the sent_id field is also assigned to the id property, and the text field is assigned to the text property for usability, and their importance as comments. The text property is read only along with the paragraph and document id. This is because the paragraph and document id are not defined per Sentence but across multiple sentences. Instead, these fields can be changed through changing the metadata of the Sentences.
Then comes the token annotations. Each sentence is made up of many token lines that provide annotation to the text provided. While a sentence usually means a collection of tokens, in this CoNLL-U sense, it is more useful to think of it as a collection of annotations with some associated metadata. Therefore the text of the sentence cannot be changed with this class, only the associated annotations can be changed.
- to_tree() Tree[Token][source]
Create a tree from the default, pre-defined CoNLL-U tokens.
This follows the assumptions of the CoNLL-U format, such as that the root token has a parent id of “0”, and that empty and multiword tokens do not participate in the underlying tree structure.
- Parameters:
tokens – The token objects to create a tree structure from.
- Returns:
The constructed Tree object.
- class pyconll.conllu.Token(id, form, lemma, upos, xpos, feats, head, deprel, deps, misc)[source]
The base Token definition which will be used for both the Standard and Compact implementations.
This defines the attributes and any behavior on the CoNLL-U data model.
- __post_init__() None[source]
Post-initialization logic beyond per-field serialization needed to properly create Token.
Specifically, this handles the case where both the form and lemma are underscore in which case the behavior should be to treat these as their raw values.
- pyconll.conllu.conllu: ConlluFormat = <pyconll.format.Format object>
The Format instance which handles CoNLL-U and should be used in most scenarios. It is not as fast as fast_conllu (about 10-15% slower) but creates a much more compact in-memory representation (about 30%) smaller. Specifically, the largest treebanks in the CoNLL-U corpus can be difficult to load on a normal laptop with multiple processes open, and this change avoids memory issues. This instance provides both parsing and serialization capabilities in a single interface.
- pyconll.conllu.fast_conllu: ConlluFormat = <pyconll.format.Format object>
The Format instance which has the same interface as the default conllu Format instance, but runs about 10% faster but uses more memory. In the case of using the iter_* family of methods (where the full treebank is not loaded at once into memory anyway), this instance can be preferred. This provides both parsing and serialization capabilities in a single interface.