schema
The schema module defines the @tokenspec decorator and field descriptors for defining custom token types. This is the foundation of pyconll’s flexible format system.
@tokenspec Decorator
The @tokenspec decorator is used to mark a class as a token specification that can be used with the Format system. To create a custom token schema:
Define a class with typed fields using Python type hints.
Decorate the class with
@tokenspec.Optionally use field descriptors for more complex serialization definitions.
Add any necessary extra behavior to your class that can use the deserialized values.
The field order in your class definition determines the column order in the serialized output.
Basic Example
from pyconll.schema import tokenspec
@tokenspec
class SimpleToken:
id: int
form: str
lemma: str
pos: str
# This defines a 4-column format where columns are parsed as:
# int, str, str, str
Supported Field Types
Basic Types
These types can be used directly without field descriptors:
str- String columnint- Integer columnfloat- Float column
Field Descriptors
For more complex column types, use field descriptors. The following terminology or points may be helpful for understanding the API.
The
empty_markerparameters correspond to what text will exclusively map to the empty value for the field’s native type. For a nullable, that isNone, for aunique_arrayit would be an empty set, etc…Each field descriptor also takes a nested mapper. This allows for composition of multiple descriptors. The mapper can be another descriptor, or one of the supported primitive types.
nullable
Represents optional values with an empty marker.
from pyconll.schema import tokenspec, nullable, field
@tokenspec
class Token:
id: str
lemma: Optional[str] = field(nullable(str, "_"))
# "_" represents None/null, otherwise parsed as string
array
Represents lists with a delimiter.
from pyconll.schema import tokenspec, array, field
@tokenspec
class Token:
features: list[str] = field(array(str, "|", "_"))
# Values separated by "|", "_" for empty list
unique_array
Represents sets (unordered, unique values) with optional ordering for serialization.
from pyconll.schema import tokenspec, unique_array, field
@tokenspec
class Token:
tags: set[str] = field(unique_array(str, "|", "_", str.lower))
# Set of strings, serialized in order by lowercase value
fixed_array
Represents tuples (fixed-length sequences).
from pyconll.schema import tokenspec, fixed_array, field
@tokenspec
class Token:
tup: tuple[str, ...] = field(fixed_array(str, "|", "_"))
mapping
Represents dictionaries with key-value pairs.
from pyconll.schema import tokenspec, mapping, unique_array, field
@tokenspec
class Token:
# CoNLL-U feats: Gender=Fem|Number=Sing
feats: dict[str, set[str]] = field(
mapping(
str, # Key mapper
unique_array(str, ","), # Value mapper (set of strings)
"|", # Pair delimiter
"=", # Key-value delimiter
"_", # Empty marker
lambda p: p[0].lower() # Ordering key
)
)
mapping_ext
Represents dictionaries that can have singleton keys (keys without values). This is useful for formats like CoNLL-U’s MISC field where you can have both Key=Value and standalone Key entries.
from pyconll.schema import tokenspec, mapping_ext, unique_array, field
@tokenspec
class Token:
# Misc field: SpaceAfter=No|Translit=example|SpellId
# SpellId is a singleton (no value)
misc: dict[str, Optional[set[str]]] = field(
mapping_ext(
str, # Key mapper
unique_array(str, ","), # Value mapper (set of strings)
None, # Singleton marker (value when no = present)
"|", # Pair delimiter
"=", # Key-value delimiter
"_", # Empty marker
lambda p: p[0].lower() # Ordering key
)
)
varcols
Represents variable number of column (including an empty width). It does not have to be the last field, but only one column can use it for a given class.
from pyconll.schema import tokenspec, varcols, field
@tokenspec
class Token:
id: int
form: str
extra: list[str] = field(varcols(str))
# All remaining columns parsed as strings
via
Custom (de)serialization functions.
from pyconll.schema import tokenspec, via, field
from datetime import datetime
def parse_date(s: str) -> datetime:
return datetime.fromisoformat(s)
def serialize_date(d: datetime) -> str:
return d.isoformat()
@tokenspec
class Token:
timestamp: datetime = field(via(parse_date, serialize_date))
Complete Example: CoNLL-U
The CoNLL-U Token schema demonstrates many of these features in concert:
from pyconll.schema import (
tokenspec, nullable, mapping, mapping_ext, unique_array,
fixed_array, field
)
from typing import Optional
@tokenspec
class Token:
id: str
form: Optional[str] = field(nullable(str, "_"))
lemma: Optional[str] = field(nullable(str, "_"))
upos: Optional[str] = field(nullable(str, "_"))
xpos: Optional[str] = field(nullable(str, "_"))
# Features: Gender=Fem|Number=Sing
feats: dict[str, set[str]] = field(
mapping(
str,
unique_array(str, ",", "", str.lower),
"|",
"=",
"_",
lambda p: p[0].lower()
)
)
head: Optional[str] = field(nullable(str, "_"))
deprel: Optional[str] = field(nullable(str, "_"))
# Enhanced dependencies: 4:nsubj|8:nmod:tmod
deps: dict[str, tuple[str, ...]] = field(
mapping(
str,
fixed_array(str, ":"),
"|",
":",
"_",
lambda p: p[0]
)
)
# Misc: SpaceAfter=No|Translit=example or singleton keys like SpellId
misc: dict[str, Optional[set[str]]] = field(
mapping_ext(
str,
unique_array(str, ",", "", str.lower),
None, # Singleton marker for keys without values
"|",
"=",
"_",
lambda p: p[0].lower()
)
)
Token Lifecycle Hooks
The @tokenspec decorator supports a __post_init__ method that runs custom logic after token initialization:
from pyconll.schema import tokenspec
@tokenspec
class Token:
id: int
form: str
processed: bool = False
def __post_init__(self) -> None:
# This runs after all fields are set during parsing
self.processed = True
# After parsing, token.processed will be True
This is useful for: - Computing derived fields - Validation - Normalization - Special handling (e.g., CoNLL-U’s form/lemma underscore logic)
Advanced: Dynamic Field Descriptors
While the typical approach is to use field() as class attributes, you can also provide field descriptors dynamically via the field_descriptors parameter in the Format constructor. This is useful for:
Switching between different descriptor implementations at runtime
Sharing token classes across different formats
Advanced performance tuning
from pyconll.schema import tokenspec, nullable, FieldDescriptor
from pyconll.format import Format
@tokenspec
class Token:
id: str
form: str
lemma: str
upos: str
# Define descriptors separately
field_descriptors: dict[str, Optional[FieldDescriptor]] = {
'id': None, # None for primitive types (int, float, str)
'form': nullable(str, "_"),
'lemma': nullable(str, "_"),
'upos': nullable(str, "_"),
}
# Pass to Format constructor
my_format = Format(Token, Sentence[Token], field_descriptors=field_descriptors)
When both class attributes and field_descriptors are provided, field_descriptors takes precedence. This allows you to override the class-level descriptors at Format creation time.
SentenceBase Interface
The SentenceBase is an abstract interface that defines how Sentence implementations work with the Format system. Any Sentence type used with Format must implement this interface.
Required Methods and Properties
from pyconll.schema import SentenceBase
from typing import OrderedDict
class MySentence(SentenceBase[MyToken]):
def __init__(self) -> None:
# Must have a no-argument constructor
self._meta: OrderedDict[str, Optional[str]] = OrderedDict()
self._tokens: list[MyToken] = []
@property
def meta(self) -> MutableMapping[str, Optional[str]]:
return self._meta
@meta.setter
def meta(self, value: MutableMapping[str, Optional[str]]) -> None:
self._meta = value
@property
def tokens(self) -> MutableSequence[MyToken]:
return self._tokens
@tokens.setter
def tokens(self, value: MutableSequence[MyToken]) -> None:
self._tokens = value
def __accept_meta__(self, key: str, value: Optional[str]) -> None:
# Called during parsing for each metadata pair
self.meta[key] = value
def __accept_token__(self, t: MyToken) -> None:
# Called during parsing for each token
self.tokens.append(t)
def __finalize__(self) -> None:
# Called when sentence parsing is complete
pass
The lifecycle methods (__accept_meta__, __accept_token__, __finalize__) allow custom sentence implementations to process data incrementally during parsing, enabling streaming scenarios and custom initialization logic.