format

The format module defines the core interface for reading and writing tabular data formats. It provides three main classes: ReadFormat, WriteFormat, and Format (which inherits both).

Overview

The Format system is built around the tokenspec decorator and the AbstractSentence ABC, allowing you to define custom token and sentence types and automatically generate optimized parsers and serializers for them. This makes pyconll flexible enough to work with CoNLL-U or any other tabular format.

The Format class compiles reading and writing logic based on your token schema at initialization time.

Classes

ReadFormat[T, S]

Provides methods for parsing tabular data into Python objects. It provides operations for Tokens and Sentences, but most usage would be primarily on collections of Sentences.

WriteFormat[T, S]

Provides methods for serializing Python objects to tabular format. Like ReadFormat, it provides operations for Tokens and Sentences, but most usage would be primarily on collections of Sentences.

Format[T, S]

Combines both ReadFormat and WriteFormat functionality. This is the class you’ll typically use. By separating out the read and write side future changes allowing for serialization or deserialization only types is possible.

Example

Creating a custom format for CoNLL-X:

from pyconll.format import Format
from pyconll.schema import tokenspec, nullable, unique_array, field
from pyconll.shared import Sentence
from typing import Optional

@tokenspec
class TokenX:
    id: int
    form: str
    lemma: str
    cpostag: str
    postag: str
    feats: set[str] = field(unique_array(str, "|", "_"))
    head: int
    deprel: str
    phead: Optional[int] = field(nullable(int, "_"))
    pdeprel: Optional[str] = field(nullable(str, "_"))

# Create format instance
conllx = Format(TokenX, Sentence[TokenX], comment_marker="#", delimiter="\t")

# Load data
sentences = conllx.load_from_file("data.conllx")

# Modify data
for sentence in sentences:
    for token in sentence.tokens:
        if token.postag == "NN":
            token.feats.add("Modified")

# Write back
with open("output.conllx", "w") as f:
    conllx.write_corpus(sentences, f)

Using the pre-configured CoNLL-U format:

from pyconll.conllu import conllu  # Pre-defined Format instance

# Load
sentences = conllu.load_from_file("train.conllu")

# Stream for large files
for sentence in conllu.iter_from_file("huge.conllu"):
    process(sentence)

Performance Notes

The Format class uses dynamic code generation (via Python’s compile() and exec()) to create optimized parsers and serializers. This compilation happens once at Format initialization, so:

  • Creating a Format instance has some overhead (typically milliseconds).

  • Once created, parsing and serialization are optimized and cached.

  • Reuse Format instances rather than recreating them.

For CoNLL-U specifically, use the pre-configured conllu or fast_conllu instance from pyconll.conllu rather than creating your own.

Advanced: Dynamic Field Descriptors

The Format constructor accepts a field_descriptors parameter that allows you to provide field descriptors dynamically instead of as class attributes. This is useful for:

  • Switching between different serialization strategies at runtime

  • Performance tuning (e.g., using sys.intern for string interning)

  • Sharing token classes across multiple formats

from pyconll.format import Format
from pyconll.schema import tokenspec, nullable, via, FieldDescriptor
from pyconll.shared import Sentence
import sys
from typing import Optional

@tokenspec
class Token:
    id: str
    form: str
    lemma: str
    upos: str

# Define descriptors separately
standard_descriptors: dict[str, Optional[FieldDescriptor]] = {
    'id': None,  # None for primitive types (str, int, float)
    'form': nullable(str, "_"),
    'lemma': nullable(str, "_"),
    'upos': nullable(str, "_"),
}

# Compact version using string interning for memory efficiency
compact_descriptors: dict[str, Optional[FieldDescriptor]] = {
    'id': via(sys.intern),
    'form': nullable(via(sys.intern), "_"),
    'lemma': nullable(via(sys.intern), "_"),
    'upos': nullable(via(sys.intern), "_"),
}

# Create two different Format instances with the same Token class
standard_format = Format(Token, Sentence[Token], field_descriptors=standard_descriptors)
compact_format = Format(Token, Sentence[Token], field_descriptors=compact_descriptors)

When both class attributes (using field()) and field_descriptors are provided, field_descriptors takes precedence. The extra_primitives parameter allows you to specify additional types that should be treated as primitives (constructed via their type constructor, serialized via str()). This also takes precedence over anything provided on @tokenspec. Note that one downside of the field_descriptors parameter is not that no type checking is performed, as opposed to using field() on the class definition, so this should be used with care and only in instances where the exact schema implementation will vary at runtime.

API

Format module providing consolidated interfaces for CoNLL data parsing and serialization.

This module defines three classes: - ReadFormat: For read-only parsing operations - WriteFormat: For write-only serialization operations - Format: Combines both reading and writing capabilities

For typical use cases where both read and write operations are needed, use Format. For specialized read-only or write-only scenarios, use ReadFormat or WriteFormat directly.

class pyconll.format.Format(token_schema: type[T], sentence_schema: type[S], comment_marker: str = '#', delimiter: str = '\t', collapse_delimiters: bool = False, field_descriptors: dict[str, FieldDescriptor | None] | None = None, extra_primitives: set[type] | None = None)[source]

A unified interface for both parsing and serializing CoNLL formatted data.

This class combines the functionality of ReadFormat and WriteFormat through multiple inheritance, providing a complete read/write interface for CoNLL data. It maintains consistent formatting options (comment markers, delimiters) across both parsing and serialization operations.

For typical use cases where both reading and writing are needed, this is the recommended class to use.

__init__(token_schema: type[T], sentence_schema: type[S], comment_marker: str = '#', delimiter: str = '\t', collapse_delimiters: bool = False, field_descriptors: dict[str, FieldDescriptor | None] | None = None, extra_primitives: set[type] | None = None) None[source]

Initialize the format handler with both read and write capabilities.

Parameters:
  • token_schema – The Token type to use for parsing and serialization.

  • sentence_schema – The Sentence type to use for parsing and serialization.

  • comment_marker – The character that marks the beginning of comments. Defaults to ‘#’.

  • delimiter – The delimiter between the columns on a token line. Defaults to tab.

  • collapse_delimiters – Flag if sequential delimiters denote an empty value or should be collapsed into one larger delimiter. Defaults to False.

  • field_descriptors – The descriptors for the fields on the schema as a mapping from the field name to the descriptor instance. For primitive types, use None as the descriptor. This takes precedence over anything on the type itself.

  • extra_primitives – The set of types to consider as primitives (default construction and the str() operator are appropriate). This takes precedence over what is given on the tokenspec decorator.

class pyconll.format.ReadFormat(token_schema: type[T], sentence_schema: type[S], comment_marker: str = '#', delimiter: str = '\t', collapse_delimiters: bool = False, field_descriptors: dict[str, FieldDescriptor | None] | None = None, extra_primitives: set[type] | None = None)[source]

A read-only interface for parsing CoNLL formatted data.

This class wraps Parser functionality and provides methods to parse CoNLL data from various sources including strings, files, and IO resources. Use this when only parsing operations are needed.

__init__(token_schema: type[T], sentence_schema: type[S], comment_marker: str = '#', delimiter: str = '\t', collapse_delimiters: bool = False, field_descriptors: dict[str, FieldDescriptor | None] | None = None, extra_primitives: set[type] | None = None) None[source]

Initialize the read format handler.

Parameters:
  • token_schema – The Token type to use for parsing.

  • sentence_schema – The Sentence type to use for parsing.

  • comment_marker – The character that marks the beginning of comments. Defaults to ‘#’.

  • delimiter – The delimiter between the columns on a token line. Defaults to tab.

  • collapse_delimiters – Flag if sequential delimiters denote an empty value or should be collapsed into one larger delimiter. Defaults to False.

  • field_descriptors – The descriptors for the fields on the schema as a mapping from the field name to the descriptor instance. For primitive types, use None as the descriptor. This takes precedence over anything on the type itself.

  • extra_primitives – The set of types to consider as primitives (default construction and the str() operator are appropriate). This takes precedence over what is given on the tokenspec decorator.

iter_from_file(filepath: str | bytes | PathLike) Iterator[source]

Iterate over the Sentence contained within the file.

Assumes that the file is UTF-8 encoded.

Parameters:

filepath – The path descriptor of the file to parse.

Returns:

The sentence iterator.

Raises:
  • IOError – If there is an error opening the given filename.

  • ParseError – If there is an error parsing the input.

iter_from_resource(resource: TextIOBase) Iterator[source]

Iterate over the Sentences contained within the resource.

Parameters:

resource – The resource from which to read in the strings from. The resource must have universal newline reading enabled.

Returns:

An iterator over the parsed Sentences within the resource.

Raises:

ParseError – If there is an error parsing the input.

iter_from_string(source: str) Iterator[source]

Iterate over the Sentences contained within the string.

Parameters:

source – The source string to extract the Sentence iterator from.

Returns:

The sentence iterator.

Raises:

ParseError – If there is an error parsing the input.

load_from_file(filepath: str | bytes | PathLike) list[S][source]

Parse a CoNLL file into a list of sentences.

Assumes the file is UTF-8 encoded.

Parameters:

filepath – The path descriptor of the file to parse.

Returns:

A list of Sentence objects parsed from the file.

Raises:
  • IOError – If there is an error opening the given filename.

  • ParseError – If there is an error parsing the input.

load_from_resource(resource: TextIOBase) list[S][source]

Parse a CoNLL resource into a list of sentences.

Parameters:

resource – The resource from which to read in the strings from. The resource must have universal newline reading enabled.

Returns:

A list of Sentence objects parsed from the resource.

Raises:

ParseError – If there is an error parsing the input.

load_from_string(source: str) list[S][source]

Parse a CoNLL formatted string into a list of sentences.

Parameters:

source – The CoNLL formatted string.

Returns:

A list of Sentence objects parsed from the source.

Raises:

ParseError – If there is an error parsing the input.

parse_sentence(buffer: str) S[source]

Parse a single sentence from the buffer.

If there is more than one sentence in the buffer an error is thrown.

Parameters:

buffer – The string to parse for a single sentence.

Returns:

The single sentence that was parsed out of the string.

parse_token(buffer: str) T[source]

Parse a buffer into a Token.

Parameters:

buffer – The string to parse into a Token. No newline splitting is done on the input.

Returns:

The buffer parsed into the underlying Token type.

class pyconll.format.WriteFormat(token_schema: type[T], comment_marker: str = '#', delimiter: str = '\t', field_descriptors: dict[str, FieldDescriptor | None] | None = None, extra_primitives: set[type] | None = None)[source]

A write-only interface for serializing CoNLL formatted data.

This class wraps Serializer functionality and provides methods to serialize CoNLL data to various output formats including strings and IO resources. Use this when only serialization operations are needed.

__init__(token_schema: type[T], comment_marker: str = '#', delimiter: str = '\t', field_descriptors: dict[str, FieldDescriptor | None] | None = None, extra_primitives: set[type] | None = None) None[source]

Initialize the write format handler.

Parameters:
  • token_schema – The Token type to use for serialization.

  • sentence_schema – The Sentence type to use for serialization.

  • comment_marker – The prefix to use for comments or metadata. Defaults to ‘#’.

  • delimiter – The delimiter between Token columns. Defaults to tab.

  • field_descriptors – The descriptors for the fields on the schema as a mapping from the field name to the descriptor instance. For primitive types, use None as the descriptor. This takes precedence over anything on the type itself.

  • extra_primitives – The set of types to consider as primitives (default construction and the str() operator are appropriate). This takes precedence over what is given on the tokenspec decorator.

serialize_sentence(sentence: S) str[source]

Serialize a Sentence to a string representation.

Parameters:

sentence – The sentence to serialize.

Returns:

The serialized representation of the sentence.

serialize_token(token: T) str[source]

Serialize a token to a string representation.

Parameters:

token – The token to serialize.

Returns:

The serialized representation of the token.

write_corpus(corpus: Iterable, writable: IO[str]) None[source]

Write out the entire corpus to the IO buffer.

Parameters:
  • corpus – The sequence of sentences to write out.

  • writable – The IO buffer to write the sentences to.

Raises:

FormatError – If the serialization of a Token was unable to be performed.

write_sentence(sentence: S, writable: IO[str]) None[source]

Write an individual sentence to an IO buffer.

Note that the buffer always has a newline added at the end.

Parameters:
  • sentence – The sentence to write to the buffer.

  • writable – The buffer to do the writing to.

Raises:

FormatError – If the serialization of a Token was unable to be performed.