format

The format module defines the core interface for reading and writing tabular data formats. It provides three main classes: ReadFormat, WriteFormat, and Format (which inherits both).

Overview

The Format system is built around the tokenspec decorator and the AbstractSentence ABC, allowing you to define custom token and sentence types and automatically generate optimized parsers and serializers for them. This makes pyconll flexible enough to work with CoNLL-U or any other tabular format.

The Format class compiles reading and writing logic based on your token schema at initialization time.

Classes

ReadFormat[T, S]

Provides methods for parsing tabular data into Python objects. It provides operations for Tokens and Sentences, but most usage would be primarily on collections of Sentences.

WriteFormat[T, S]

Provides methods for serializing Python objects to tabular format. Like ReadFormat, it provides operations for Tokens and Sentences, but most usage would be primarily on collections of Sentences.

Format[T, S]

Combines both ReadFormat and WriteFormat functionality. This is the class you’ll typically use. By separating out the read and write side future changes allowing for serialization or deserialization only types is possible.

Example

Creating a custom format for CoNLL-X:

from pyconll.format import Format
from pyconll.schema import tokenspec, nullable, unique_array, field
from pyconll.shared import Sentence
from typing import Optional

@tokenspec
class TokenX:
    id: int
    form: str
    lemma: str
    cpostag: str
    postag: str
    feats: set[str] = field(unique_array(str, "|", "_"))
    head: int
    deprel: str
    phead: Optional[int] = field(nullable(int, "_"))
    pdeprel: Optional[str] = field(nullable(str, "_"))

# Create format instance
conllx = Format(TokenX, Sentence[TokenX], comment_marker="#", delimiter="\t")

# Load data
sentences = conllx.load_from_file("data.conllx")

# Modify data
for sentence in sentences:
    for token in sentence.tokens:
        if token.postag == "NN":
            token.feats.add("Modified")

# Write back
with open("output.conllx", "w") as f:
    conllx.write_corpus(sentences, f)

Using the pre-configured CoNLL-U format:

from pyconll.conllu import conllu  # Pre-defined Format instance

# Load
sentences = conllu.load_from_file("train.conllu")

# Stream for large files
for sentence in conllu.iter_from_file("huge.conllu"):
    process(sentence)

Performance Notes

The Format class uses dynamic code generation (via Python’s compile() and exec()) to create optimized parsers and serializers. This compilation happens once at Format initialization, so:

  • Creating a Format instance has some overhead (typically milliseconds).

  • Once created, parsing and serialization are optimized and cached.

  • Reuse Format instances rather than recreating them.

For CoNLL-U specifically, use the pre-configured conllu or fast_conllu instance from pyconll.conllu rather than creating your own.

Advanced: Dynamic Field Descriptors

The Format constructor accepts a field_descriptors parameter that allows you to provide field descriptors dynamically instead of as class attributes. This is useful for:

  • Switching between different serialization strategies at runtime

  • Performance tuning (e.g., using sys.intern for string interning)

  • Sharing token classes across multiple formats

from pyconll.format import Format
from pyconll.schema import tokenspec, nullable, via, FieldDescriptor
from pyconll.shared import Sentence
import sys
from typing import Optional

@tokenspec
class Token:
    id: str
    form: str
    lemma: str
    upos: str

# Define descriptors separately
standard_descriptors: dict[str, Optional[FieldDescriptor]] = {
    'id': None,  # None for primitive types (str, int, float)
    'form': nullable(str, "_"),
    'lemma': nullable(str, "_"),
    'upos': nullable(str, "_"),
}

# Compact version using string interning for memory efficiency
compact_descriptors: dict[str, Optional[FieldDescriptor]] = {
    'id': via(sys.intern),
    'form': nullable(via(sys.intern), "_"),
    'lemma': nullable(via(sys.intern), "_"),
    'upos': nullable(via(sys.intern), "_"),
}

# Create two different Format instances with the same Token class
standard_format = Format(Token, Sentence[Token], field_descriptors=standard_descriptors)
compact_format = Format(Token, Sentence[Token], field_descriptors=compact_descriptors)

When both class attributes (using field()) and field_descriptors are provided, field_descriptors takes precedence. The extra_primitives parameter allows you to specify additional types that should be treated as primitives (constructed via their type constructor, serialized via str()). This also takes precedence over anything provided on @tokenspec. Note that one downside of the field_descriptors parameter is not that no type checking is performed, as opposed to using field() on the class definition, so this should be used with care and only in instances where the exact schema implementation will vary at runtime.

API