schema

The schema module defines the @tokenspec decorator and field descriptors for defining custom token types. This is the foundation of pyconll’s flexible format system.

@tokenspec Decorator

The @tokenspec decorator is used to mark a class as a token specification that can be used with the Format system. To create a custom token schema:

  1. Define a class with typed fields using Python type hints.

  2. Decorate the class with @tokenspec.

  3. Optionally use field descriptors for more complex serialization definitions.

  4. Add any necessary extra behavior to your class that can use the deserialized values.

The field order in your class definition determines the column order in the serialized output.

Basic Example

from pyconll.schema import tokenspec

@tokenspec
class SimpleToken:
    id: int
    form: str
    lemma: str
    pos: str

# This defines a 4-column format where columns are parsed as:
# int, str, str, str

Supported Field Types

Basic Types

These types can be used directly without field descriptors:

  • str - String column

  • int - Integer column

  • float - Float column

Field Descriptors

For more complex column types, use field descriptors. The following terminology or points may be helpful for understanding the API.

  • The empty_marker parameters correspond to what text will exclusively map to the empty value for the field’s native type. For a nullable, that is None, for a unique_array it would be an empty set, etc…

  • Each field descriptor also takes a nested mapper. This allows for composition of multiple descriptors. The mapper can be another descriptor, or one of the supported primitive types.

nullable

Represents optional values with an empty marker.

from pyconll.schema import tokenspec, nullable, field

@tokenspec
class Token:
    id: str
    lemma: Optional[str] = field(nullable(str, "_"))
    # "_" represents None/null, otherwise parsed as string

array

Represents lists with a delimiter.

from pyconll.schema import tokenspec, array, field

@tokenspec
class Token:
    features: list[str] = field(array(str, "|", "_"))
    # Values separated by "|", "_" for empty list

unique_array

Represents sets (unordered, unique values) with optional ordering for serialization.

from pyconll.schema import tokenspec, unique_array, field

@tokenspec
class Token:
    tags: set[str] = field(unique_array(str, "|", "_", str.lower))
    # Set of strings, serialized in order by lowercase value

fixed_array

Represents tuples (fixed-length sequences).

from pyconll.schema import tokenspec, fixed_array, field

@tokenspec
class Token:
    tup: tuple[str, ...] = field(fixed_array(str, "|", "_"))

mapping

Represents dictionaries with key-value pairs.

from pyconll.schema import tokenspec, mapping, unique_array, field

@tokenspec
class Token:
    # CoNLL-U feats: Gender=Fem|Number=Sing
    feats: dict[str, set[str]] = field(
        mapping(
            str,                           # Key mapper
            unique_array(str, ","),        # Value mapper (set of strings)
            "|",                           # Pair delimiter
            "=",                           # Key-value delimiter
            "_",                           # Empty marker
            lambda p: p[0].lower()         # Ordering key
        )
    )

mapping_ext

Represents dictionaries that can have singleton keys (keys without values). This is useful for formats like CoNLL-U’s MISC field where you can have both Key=Value and standalone Key entries.

from pyconll.schema import tokenspec, mapping_ext, unique_array, field

@tokenspec
class Token:
    # Misc field: SpaceAfter=No|Translit=example|SpellId
    # SpellId is a singleton (no value)
    misc: dict[str, Optional[set[str]]] = field(
        mapping_ext(
            str,                           # Key mapper
            unique_array(str, ","),        # Value mapper (set of strings)
            None,                          # Singleton marker (value when no = present)
            "|",                           # Pair delimiter
            "=",                           # Key-value delimiter
            "_",                           # Empty marker
            lambda p: p[0].lower()         # Ordering key
        )
    )

varcols

Represents variable number of column (including an empty width). It does not have to be the last field, but only one column can use it for a given class.

from pyconll.schema import tokenspec, varcols, field

@tokenspec
class Token:
    id: int
    form: str
    extra: list[str] = field(varcols(str))
    # All remaining columns parsed as strings

via

Custom (de)serialization functions.

from pyconll.schema import tokenspec, via, field
from datetime import datetime

def parse_date(s: str) -> datetime:
    return datetime.fromisoformat(s)

def serialize_date(d: datetime) -> str:
    return d.isoformat()

@tokenspec
class Token:
    timestamp: datetime = field(via(parse_date, serialize_date))

Complete Example: CoNLL-U

The CoNLL-U Token schema demonstrates many of these features in concert:

from pyconll.schema import (
    tokenspec, nullable, mapping, mapping_ext, unique_array,
    fixed_array, field
)
from typing import Optional

@tokenspec
class Token:
    id: str
    form: Optional[str] = field(nullable(str, "_"))
    lemma: Optional[str] = field(nullable(str, "_"))
    upos: Optional[str] = field(nullable(str, "_"))
    xpos: Optional[str] = field(nullable(str, "_"))

    # Features: Gender=Fem|Number=Sing
    feats: dict[str, set[str]] = field(
        mapping(
            str,
            unique_array(str, ",", "", str.lower),
            "|",
            "=",
            "_",
            lambda p: p[0].lower()
        )
    )

    head: Optional[str] = field(nullable(str, "_"))
    deprel: Optional[str] = field(nullable(str, "_"))

    # Enhanced dependencies: 4:nsubj|8:nmod:tmod
    deps: dict[str, tuple[str, ...]] = field(
        mapping(
            str,
            fixed_array(str, ":"),
            "|",
            ":",
            "_",
            lambda p: p[0]
        )
    )

    # Misc: SpaceAfter=No|Translit=example or singleton keys like SpellId
    misc: dict[str, Optional[set[str]]] = field(
        mapping_ext(
            str,
            unique_array(str, ",", "", str.lower),
            None,  # Singleton marker for keys without values
            "|",
            "=",
            "_",
            lambda p: p[0].lower()
        )
    )

Token Lifecycle Hooks

The @tokenspec decorator supports a __post_init__ method that runs custom logic after token initialization:

from pyconll.schema import tokenspec

@tokenspec
class Token:
    id: int
    form: str
    processed: bool = False

    def __post_init__(self) -> None:
        # This runs after all fields are set during parsing
        self.processed = True

# After parsing, token.processed will be True

This is useful for: - Computing derived fields - Validation - Normalization - Special handling (e.g., CoNLL-U’s form/lemma underscore logic)

Advanced: Dynamic Field Descriptors

While the typical approach is to use field() as class attributes, you can also provide field descriptors dynamically via the field_descriptors parameter in the Format constructor. This is useful for:

  • Switching between different descriptor implementations at runtime

  • Sharing token classes across different formats

  • Advanced performance tuning

from pyconll.schema import tokenspec, nullable, FieldDescriptor
from pyconll.format import Format

@tokenspec
class Token:
    id: str
    form: str
    lemma: str
    upos: str

# Define descriptors separately
field_descriptors: dict[str, Optional[FieldDescriptor]] = {
    'id': None,  # None for primitive types (int, float, str)
    'form': nullable(str, "_"),
    'lemma': nullable(str, "_"),
    'upos': nullable(str, "_"),
}

# Pass to Format constructor
my_format = Format(Token, Sentence[Token], field_descriptors=field_descriptors)

When both class attributes and field_descriptors are provided, field_descriptors takes precedence. This allows you to override the class-level descriptors at Format creation time.

AbstractSentence Interface

AbstractSentence is an abstract interface that defines how Sentence implementations work with the Format system. Any Sentence type used with Format must implement this interface.

Required Methods and Properties

from pyconll.schema import AbstractSentence
from typing import OrderedDict

class MySentence(AbstractSentence[MyToken]):
    def __init__(self) -> None:
        # Must have a no-argument constructor
        self._meta: OrderedDict[str, Optional[str]] = OrderedDict()
        self._tokens: list[MyToken] = []

    @property
    def meta(self) -> MutableMapping[str, Optional[str]]:
        return self._meta

    @meta.setter
    def meta(self, value: MutableMapping[str, Optional[str]]) -> None:
        self._meta = value

    @property
    def tokens(self) -> MutableSequence[MyToken]:
        return self._tokens

    @tokens.setter
    def tokens(self, value: MutableSequence[MyToken]) -> None:
        self._tokens = value

    def __accept_meta__(self, key: str, value: Optional[str]) -> None:
        # Called during parsing for each metadata pair
        self.meta[key] = value

    def __accept_token__(self, t: MyToken) -> None:
        # Called during parsing for each token
        self.tokens.append(t)

    def __finalize__(self) -> None:
        # Called when sentence parsing is complete
        pass

The lifecycle methods (__accept_meta__, __accept_token__, __finalize__) allow custom sentence implementations to process data incrementally during parsing, enabling streaming scenarios and custom initialization logic.

A functional equivalent to MySentence is provided in pyconll.shared.Sentence[T] as this is likely the most common sentence implementation that will be needed in conll based parsing.

API

Module containing concepts for defining the schema of a CoNLL format such as through structural Token parsing schema components, the descriptor building blocks, and Sentence interface requirements.

class pyconll.schema.AbstractSentence[source]

The interface that all Sentence implementations need to accept to work with the (de)serialization libraries. This defines the operations on the sentence for how to handle new metadata and tokens while parsing a text stream along with how to access the values themselves once in memory.

abstractmethod __accept_meta__(key: str, value: str | None) None[source]

The lifecycle operation during parsing where the next metadata pair is received by the Sentence object.

Parameters:
  • key – The key of the metadata.

  • value – The value of the metadata or None if the metadata is a singleton.

abstractmethod __accept_token__(t: T) None[source]

The lifecycle operation during parsing where the next parsed token is received by the Sentence object.

Parameters:

t – The next parsed token object to receive on this Sentence.

abstractmethod __finalize__() None[source]

Called once there is no more information to parse for the sentence.

abstract property meta: MutableMapping[str, str | None]

The read view of the meta property.

Returns:

The mapping that represents the sentence’s metadata.

abstract property tokens: MutableSequence

The read view of the tokens property.

Returns:

The sequence that represents the sentence’s tokens.

class pyconll.schema.BaseFieldDescriptor[source]

A FieldDescriptor to use for most scenarios where the descriptor has to generate code or an actual method.

deserialize_codegen(namespace: dict[str, Any]) str[source]

Adds the deserialization method to the given namespace and returns the method name.

Parameters:

namespace – The codegen namespace to define the method in.

Returns:

The name of the method that was generated.

serialize_codegen(namespace: dict[str, Any]) str[source]

Adds the serialization method to the given namespace and returns the method name.

Parameters:

namespace – The codegen namespace to define the method in.

Returns:

The name of the method that was generated.

class pyconll.schema.FieldDescriptor[source]

Base class to represent the different types of descriptors that can be defined for the Token fields. Each descriptor needs to be able to dynamically generate the relevant python code for (de)serialization.

abstractmethod deserialize_codegen(namespace: dict[str, Any]) str[source]

Adds the deserialization method to the given namespace and returns the method name.

Parameters:

namespace – The codegen namespace to define the method in.

Returns:

The name of the method that was generated.

abstractmethod serialize_codegen(namespace: dict[str, Any]) str[source]

Adds the serialization method to the given namespace and returns the method name.

Parameters:

namespace – The codegen namespace to define the method in.

Returns:

The name of the method that was generated.

pyconll.schema.array(el_mapper: type[T] | FieldDescriptor, delimiter: str, empty_marker: str = '') _ArrayDescriptor[source]

Describe a serialization schema for a list.

Parameters:
  • mapper – The nested mapper to describe the serialization scheme of array elements.

  • delimiter – The string which separates array elements in the serialized representation.

  • empty_marker – The string representation which maps to an empty list.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.field(desc: FieldDescriptor) T[source]

Method to help with type-checking on structural Token definitions.

Use on the outer-most level of each Token’s field descriptor to unwrap the appropriate type. The only application of this method is on structural Token definitions.

Parameters:

desc – The FieldDescriptor whose type is being unwrapped.

Returns:

The descriptor originally provided by force cast to the type it describes.

pyconll.schema.fixed_array(el_mapper: type[T] | FieldDescriptor, delimiter: str, empty_marker: str = '') _FixedArrayDescriptor[source]

Describe a serialization schema for a tuple.

Parameters:
  • el_mapper – The nested mapper to describe the serialization scheme of the tuple elements.

  • delimiter – The string which separates tuple elements in the serialized representation.

  • empty_marker – The string representation which maps to an empty tuple.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.mapping(kmapper: type[K] | FieldDescriptor, vmapper: type[V] | FieldDescriptor, pair_delimiter: str, kv_delimiter: str, empty_marker: str = '', ordering_key: Callable[[tuple[K, V]], SupportsRichComparison] | None = None) _MappingDescriptor[source]

Describe a serialization scheme for a dictionary.

Parameters:
  • kmapper – The nested mapper to describe the serialization scheme for keys in the map.

  • vmapper – The nested mapper to describe the serialization scheme for values in the map.

  • pair_delimiter – The string to delimit key-value pairs in the serialized representation.

  • kv_delimiter – The string to delimit the key and value within a single pair.

  • empty_marker – The string representation which maps to an empty dict.

  • ordering_key – If provided, describes the order in which the dict entries are serialized.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.mapping_ext(kmapper: type[K] | FieldDescriptor, vmapper: type[V] | FieldDescriptor, singleton: S, pair_delimiter: str, kv_delimiter: str, empty_marker: str = '', ordering_key: Callable[[tuple[K, V]], SupportsRichComparison] | None = None) _MappingExtDescriptor[source]

Describe a serialization scheme for a dictionary with various extensions over the default.

Parameters:
  • kmapper – The nested mapper to describe the serialization scheme for keys in the map.

  • vmapper – The nested mapper to describe the serialization scheme for values in the map.

  • singleton – The value to use as the singleton marker.

  • pair_delimiter – The string to delimit key-value pairs in the serialized representation.

  • kv_delimiter – The string to delimit the key and value within a single pair.

  • empty_marker – The string representation which maps to an empty dict.

  • ordering_key – If provided, describes the order in which the dict entries are serialized.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.nullable(mapper: type[T] | FieldDescriptor, empty_marker: str = '') _NullableDescriptor[source]

Describe a serialization schema for an optional value.

Parameters:
  • mapper – The nested mapper to describe the serialization scheme of the underlying type.

  • empty_marker – The string value which represents None.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.tokenspec(cls: T, /, *, slots: bool = False, gen_repr: bool = False, extra_primitives: Iterable[type] | None = None) T[source]
pyconll.schema.tokenspec(cls: None = None, /, *, slots: bool = False, gen_repr: bool = False, extra_primitives: Iterable[type] | None = None) Callable[[T], T]

Annotate a Token’s class for different aspects.

Parameters:
  • cls – The class to decorate as a token specification.

  • slots – Flag if the generated class should use slots for member storage.

  • gen_repr – Flag if a repr method should be generated.

  • extra_primitives – Types that should be considered as “primitives” in addition to int, float, and str. What this means is that during compilation of parsing and serialization code, these types will construct the in-memory representations directly by the type constructor and the str operator will be for serialization.

Returns:

The decorated class instance that can be used with Format operations.

pyconll.schema.unique_array(el_mapper: type[T] | FieldDescriptor, delimiter: str, empty_marker: str = '', ordering_key: Callable[[T], Any] | None = None, single_escape_hatch: bool = False) _UniqueArrayDescriptor[source]

Describe a serialization schema for a set.

Parameters:
  • el_mapper – The nested mapper to describe the serialization scheme of the set elements.

  • delimiter – The string which separates set elements in the serialized representation.

  • empty_marker – The string representation which maps to an empty set.

  • ordering_key – If provided, describes the order in which the set entries are serialized.

  • single_escape_hatch – If set, means that a single delimiter value is interpreted as a single item set, rather than two empty strings which collapse into one element.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.varcols(mapper: type[T] | FieldDescriptor) _VarColsDescriptor[source]

Describe an entry that has a variable number of fields.

Parameters:

mapper – The nested mapper to describe the serialization schema for the targeted columns.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.

pyconll.schema.via(deserialize: ~typing.Callable[[str], T], serialize: ~typing.Callable[[T], str] = <class 'str'>) _ViaDescriptor[source]

Describe a user-provided serialization scheme which uses arbitrary callables.

Other descriptors create symmetric (de)serialization schemes while this allows for asymmetric definitions (that is reading in one thing and writing out another). This is a possible feature that is not fully fleshed out, but could be better explored in the future.

Parameters:
  • deserialize – The callable to deserialize the string into the in-memory representation.

  • serialize – The callable to serialize the in-memory representation to a string.

Returns:

The FieldDescriptor to use for compiling the structural Token parser.