Custom Token Schemas

Version 4.0 allows you to define custom token formats to parse and serialize beyond the base CoNLL-U format.

from pyconll.format import Format
from pyconll.schema import tokenspec, nullable, unique_array, field, SentenceBase
from pyconll.shared import Sentence
from typing import Optional

@tokenspec
class CoNLLX:
    id: int
    form: str
    lemma: str
    cpostag: str
    postag: str
    feats: set[str] = field(unique_array(str, "|", "_"))
    head: int
    deprel: str
    phead: Optional[int] = field(nullable(int, "_"))
    pdeprel: Optional[str] = field(nullable(str, "_"))

conllx = Format(CoNLLX, Sentence[CoNLLX])

# Use it
sentences = conllx.load_from_file('data.conllx')

See the schema and format documentation for more details.

The compilation happens once when creating a Format instance, so reuse Format instances for best performance.