format =================================== The ``format`` module defines the core interface for reading and writing tabular data formats. It provides three main classes: ``ReadFormat``, ``WriteFormat``, and ``Format`` (which inherits both). Overview ---------------------------------- The Format system is built around the ``tokenspec`` decorator and the ``AbstractSentence`` ABC, allowing you to define custom token and sentence types and automatically generate optimized parsers and serializers for them. This makes ``pyconll`` flexible enough to work with CoNLL-U or any other tabular format. The ``Format`` class compiles reading and writing logic based on your token schema at initialization time. Classes ---------------------------------- ReadFormat[T, S] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Provides methods for parsing tabular data into Python objects. It provides operations for Tokens and Sentences, but most usage would be primarily on collections of Sentences. WriteFormat[T, S] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Provides methods for serializing Python objects to tabular format. Like ReadFormat, it provides operations for Tokens and Sentences, but most usage would be primarily on collections of Sentences. Format[T, S] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Combines both ``ReadFormat`` and ``WriteFormat`` functionality. This is the class you'll typically use. By separating out the read and write side future changes allowing for serialization or deserialization only types is possible. Example ----------------------------------- Creating a custom format for CoNLL-X: .. code:: python from pyconll.format import Format from pyconll.schema import tokenspec, nullable, unique_array, field from pyconll.shared import Sentence from typing import Optional @tokenspec class TokenX: id: int form: str lemma: str cpostag: str postag: str feats: set[str] = field(unique_array(str, "|", "_")) head: int deprel: str phead: Optional[int] = field(nullable(int, "_")) pdeprel: Optional[str] = field(nullable(str, "_")) # Create format instance conllx = Format(TokenX, Sentence[TokenX], comment_marker="#", delimiter="\t") # Load data sentences = conllx.load_from_file("data.conllx") # Modify data for sentence in sentences: for token in sentence.tokens: if token.postag == "NN": token.feats.add("Modified") # Write back with open("output.conllx", "w") as f: conllx.write_corpus(sentences, f) Using the pre-configured CoNLL-U format: .. code:: python from pyconll.conllu import conllu # Pre-defined Format instance # Load sentences = conllu.load_from_file("train.conllu") # Stream for large files for sentence in conllu.iter_from_file("huge.conllu"): process(sentence) Performance Notes ---------------------------------- The Format class uses dynamic code generation (via Python's ``compile()`` and ``exec()``) to create optimized parsers and serializers. This compilation happens once at Format initialization, so: - Creating a Format instance has some overhead (typically milliseconds). - Once created, parsing and serialization are optimized and cached. - Reuse Format instances rather than recreating them. For CoNLL-U specifically, use the pre-configured ``conllu`` or ``fast_conllu`` instance from ``pyconll.conllu`` rather than creating your own. Advanced: Dynamic Field Descriptors ---------------------------------- The ``Format`` constructor accepts a ``field_descriptors`` parameter that allows you to provide field descriptors dynamically instead of as class attributes. This is useful for: - Switching between different serialization strategies at runtime - Performance tuning (e.g., using ``sys.intern`` for string interning) - Sharing token classes across multiple formats .. code:: python from pyconll.format import Format from pyconll.schema import tokenspec, nullable, via, FieldDescriptor from pyconll.shared import Sentence import sys from typing import Optional @tokenspec class Token: id: str form: str lemma: str upos: str # Define descriptors separately standard_descriptors: dict[str, Optional[FieldDescriptor]] = { 'id': None, # None for primitive types (str, int, float) 'form': nullable(str, "_"), 'lemma': nullable(str, "_"), 'upos': nullable(str, "_"), } # Compact version using string interning for memory efficiency compact_descriptors: dict[str, Optional[FieldDescriptor]] = { 'id': via(sys.intern), 'form': nullable(via(sys.intern), "_"), 'lemma': nullable(via(sys.intern), "_"), 'upos': nullable(via(sys.intern), "_"), } # Create two different Format instances with the same Token class standard_format = Format(Token, Sentence[Token], field_descriptors=standard_descriptors) compact_format = Format(Token, Sentence[Token], field_descriptors=compact_descriptors) When both class attributes (using ``field()``) and ``field_descriptors`` are provided, ``field_descriptors`` takes precedence. The ``extra_primitives`` parameter allows you to specify additional types that should be treated as primitives (constructed via their type constructor, serialized via ``str()``). This also takes precedence over anything provided on ``@tokenspec``. Note that one downside of the ``field_descriptors`` parameter is not that no type checking is performed, as opposed to using ``field()`` on the class definition, so this should be used with care and only in instances where the exact schema implementation will vary at runtime. API ---------------------------------- .. automodule:: pyconll.format :members: :exclude-members: __dict__, __weakref__