conllu

The conllu module provides the standard CoNLL-U format implementation, including the Token and Sentence class and pre-configured Format instance for reading and writing CoNLL-U files.

Overview

This module is the primary entry point for working with CoNLL-U files. It provides:

  • Token - The CoNLL-U token schema with all standard fields.

  • Sentence - The CoNLL-U sentence schema which can create a Tree model and provides access to metadata and tokens.

  • conllu - A pre-configured Format instance for CoNLL-U. This should be the default in most use cases as opposed to fast_conllu.

  • fast_conllu - A pre-configured Format instance for CoNLL-U which trades off parser speed for increased memory usage.

  • ConllFormat which provides a type alias to abstract from having to use the full types of conllu and fast_conllu.

The conllu Format Instance

The module exports pre-configured Format[Token] instances named conllu and fast_conllu that are ready to use and are completely interchangeable.

from pyconll.conllu import conllu, ConlluFormat

cformat: ConlluFormat = conllu # or conllu.fast_conllu

# Load entire file into memory
sentences = cformat.load_from_file('train.conllu')

# Stream large files
for sentence in cformat.iter_from_file('huge.conllu'):
    process(sentence)

# Parse from string
text = """# sent_id = 1
1\tThe\tthe\tDET\t_\t_\t2\tdet\t_\t_
2\tcat\tcat\tNOUN\t_\t_\t0\troot\t_\tSpaceAfter=No

"""
sentences = cformat.load_from_string(text)

# Write back to file
with open('output.conllu', 'w') as f:
    cformat.write_corpus(sentences, f)

The Token Class

The Token class defines the CoNLL-U format with 10 standard columns:

Fields

  1. id: str - Token ID (e.g., “1”, “2-3”, “2.1”)

  2. form: Optional[str] - Word form or punctuation symbol

  3. lemma: Optional[str] - Lemma or stem

  4. upos: Optional[str] - Universal part-of-speech tag

  5. xpos: Optional[str] - Language-specific part-of-speech tag

  6. feats: dict[str, set[str]] - Morphological features

  7. head: Optional[str] - Head token ID

  8. deprel: Optional[str] - Dependency relation

  9. deps: dict[str, tuple[str, ...]] - Enhanced dependencies

  10. misc: dict[str, Optional[set[str]]] - Miscellaneous annotations

Example Usage

from pyconll.conllu import conllu

sentences = conllu.load_from_file('train.conllu')

for sentence in sentences:
    for token in sentence.tokens:
        # Access basic fields
        if token.upos == 'VERB':
            print(f"Verb: {token.form} -> {token.lemma}")

        # Modify features
        if token.upos == 'NOUN':
            if 'Number' not in token.feats:
                token.feats['Number'] = set()
            token.feats['Number'].add('Sing')

        # Add misc annotations
        token.misc['Analyzed'] = None  # Singleton feature

# Write modified corpus
with open('output.conllu', 'w') as f:
    conllu.write_corpus(sentences, f)

Dictionary Fields

Three fields (feats, deps, misc) are dictionaries.

feats

Morphological features as key-value pairs:

# Example: Gender=Fem|Number=Sing
token.feats = {
    'Gender': {'Fem'},
    'Number': {'Sing'}
}

# Modify
token.feats['Case'] = {'Nom'}
token.feats['Number'].add('Plur')  # Now {'Sing', 'Plur'}

# Serializes to: Case=Nom|Gender=Fem|Number=Sing,Plur

deps

Enhanced dependencies as head-to-relation mappings:

# Example: 4:nsubj
token.deps = {
    '4': ('nsubj',)
}

# The tuple is a fixed size per element but can vary between elements.
# In CoNLL-U, there is usually only two elements in this field.

misc

Miscellaneous annotations with optional values:

# Singleton features (no value)
token.misc['SpaceAfter'] = None  # Serializes as "SpaceAfter"

# Features with values
token.misc['Translit'] = {'example'}  # Serializes as "Translit=example"

# Multiple values
token.misc['Gloss'] = {'cat', 'feline'}  # Serializes as "Gloss=cat,feline"

API

Defines the Token type and parsing and output logic. A Token is the based unit in CoNLL-U and so the data and parsing in this module is central to the CoNLL-U format.

class pyconll.conllu.Sentence[source]

A sentence in a CoNLL-U file. A sentence consists of several components.

First, are comments. Each sentence must have two comments per UD v2 guidelines, which are sent_id and text. Comments are stored as a dict in the meta field. For singleton comments with no key-value structure, the value in the dict has a value of None.

Note the sent_id field is also assigned to the id property, and the text field is assigned to the text property for usability, and their importance as comments. The text property is read only along with the paragraph and document id. This is because the paragraph and document id are not defined per Sentence but across multiple sentences. Instead, these fields can be changed through changing the metadata of the Sentences.

Then comes the token annotations. Each sentence is made up of many token lines that provide annotation to the text provided. While a sentence usually means a collection of tokens, in this CoNLL-U sense, it is more useful to think of it as a collection of annotations with some associated metadata. Therefore the text of the sentence cannot be changed with this class, only the associated annotations can be changed.

to_tree() Tree[Token][source]

Create a tree from the default, pre-defined CoNLL-U tokens.

This follows the assumptions of the CoNLL-U format, such as that the root token has a parent id of “0”, and that empty and multiword tokens do not participate in the underlying tree structure.

Parameters:

tokens – The token objects to create a tree structure from.

Returns:

The constructed Tree object.

class pyconll.conllu.Token(id, form, lemma, upos, xpos, feats, head, deprel, deps, misc)[source]

The base Token definition which will be used for both the Standard and Compact implementations.

This defines the attributes and any behavior on the CoNLL-U data model.

__post_init__() None[source]

Post-initialization logic beyond per-field serialization needed to properly create Token.

Specifically, this handles the case where both the form and lemma are underscore in which case the behavior should be to treat these as their raw values.

is_empty_node() bool[source]

Checks if this Token is an empty node, used for ellipsis annotation.

Note that this is separate from any field being empty, rather it means the id has a period in it.

Returns:

True if this token is an empty node and False otherwise.

is_multiword() bool[source]

Checks if this Token is a multiword token.

Returns:

True if this token is a multiword token, and False otherwise.

pyconll.conllu.conllu: ConlluFormat = <pyconll.format.Format object>

The Format instance which handles CoNLL-U and should be used in most scenarios. It is not as fast as fast_conllu (about 10-15% slower) but creates a much more compact in-memory representation (about 30%) smaller. Specifically, the largest treebanks in the CoNLL-U corpus can be difficult to load on a normal laptop with multiple processes open, and this change avoids memory issues. This instance provides both parsing and serialization capabilities in a single interface.

pyconll.conllu.fast_conllu: ConlluFormat = <pyconll.format.Format object>

The Format instance which has the same interface as the default conllu Format instance, but runs about 10% faster but uses more memory. In the case of using the iter_* family of methods (where the full treebank is not loaded at once into memory anyway), this instance can be preferred. This provides both parsing and serialization capabilities in a single interface.