conllu
===================================

The ``conllu`` module provides the standard CoNLL-U format implementation, including the ``Token`` and ``Sentence`` class and pre-configured ``Format`` instance for reading and writing CoNLL-U files.

Overview
----------------------------------

This module is the primary entry point for working with CoNLL-U files. It provides:

- ``Token`` - The CoNLL-U token schema with all standard fields.
- ``Sentence`` - The CoNLL-U sentence schema which can create a Tree model and provides access to metadata and tokens.
- ``conllu`` - A pre-configured ``Format`` instance for CoNLL-U. This should be the default in most use cases as opposed to ``fast_conllu``.
- ``fast_conllu`` - A pre-configured ``Format`` instance for CoNLL-U which trades off parser speed for increased memory usage.
- ``ConllFormat`` which provides a type alias to abstract from having to use the full types of ``conllu`` and ``fast_conllu``.

The ``conllu`` Format Instance
----------------------------------

The module exports pre-configured ``Format[Token]`` instances named ``conllu`` and ``fast_conllu`` that are ready to use and are completely interchangeable.

.. code:: python

    from pyconll.conllu import conllu, ConlluFormat

    cformat: ConlluFormat = conllu # or conllu.fast_conllu

    # Load entire file into memory
    sentences = cformat.load_from_file('train.conllu')

    # Stream large files
    for sentence in cformat.iter_from_file('huge.conllu'):
        process(sentence)

    # Parse from string
    text = """# sent_id = 1
    1\tThe\tthe\tDET\t_\t_\t2\tdet\t_\t_
    2\tcat\tcat\tNOUN\t_\t_\t0\troot\t_\tSpaceAfter=No

    """
    sentences = cformat.load_from_string(text)

    # Write back to file
    with open('output.conllu', 'w') as f:
        cformat.write_corpus(sentences, f)

The Token Class
----------------------------------

The ``Token`` class defines the CoNLL-U format with 10 standard columns:

Fields
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. ``id: str`` - Token ID (e.g., "1", "2-3", "2.1")
2. ``form: Optional[str]`` - Word form or punctuation symbol
3. ``lemma: Optional[str]`` - Lemma or stem
4. ``upos: Optional[str]`` - Universal part-of-speech tag
5. ``xpos: Optional[str]`` - Language-specific part-of-speech tag
6. ``feats: dict[str, set[str]]`` - Morphological features
7. ``head: Optional[str]`` - Head token ID
8. ``deprel: Optional[str]`` - Dependency relation
9. ``deps: dict[str, tuple[str, ...]]`` - Enhanced dependencies
10. ``misc: dict[str, Optional[set[str]]]`` - Miscellaneous annotations

Example Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

    from pyconll.conllu import conllu

    sentences = conllu.load_from_file('train.conllu')

    for sentence in sentences:
        for token in sentence.tokens:
            # Access basic fields
            if token.upos == 'VERB':
                print(f"Verb: {token.form} -> {token.lemma}")

            # Modify features
            if token.upos == 'NOUN':
                if 'Number' not in token.feats:
                    token.feats['Number'] = set()
                token.feats['Number'].add('Sing')

            # Add misc annotations
            token.misc['Analyzed'] = None  # Singleton feature

    # Write modified corpus
    with open('output.conllu', 'w') as f:
        conllu.write_corpus(sentences, f)

Dictionary Fields
----------------------------------

Three fields (``feats``, ``deps``, ``misc``) are dictionaries.

feats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Morphological features as key-value pairs:

.. code:: python

    # Example: Gender=Fem|Number=Sing
    token.feats = {
        'Gender': {'Fem'},
        'Number': {'Sing'}
    }

    # Modify
    token.feats['Case'] = {'Nom'}
    token.feats['Number'].add('Plur')  # Now {'Sing', 'Plur'}

    # Serializes to: Case=Nom|Gender=Fem|Number=Sing,Plur

deps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enhanced dependencies as head-to-relation mappings:

.. code:: python

    # Example: 4:nsubj
    token.deps = {
        '4': ('nsubj',)
    }

    # The tuple is a fixed size per element but can vary between elements.
    # In CoNLL-U, there is usually only two elements in this field.

misc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Miscellaneous annotations with optional values:

.. code:: python

    # Singleton features (no value)
    token.misc['SpaceAfter'] = None  # Serializes as "SpaceAfter"

    # Features with values
    token.misc['Translit'] = {'example'}  # Serializes as "Translit=example"

    # Multiple values
    token.misc['Gloss'] = {'cat', 'feline'}  # Serializes as "Gloss=cat,feline"

API
----------------------------------
.. automodule:: pyconll.conllu
    :members:
    :exclude-members: __dict__, __weakref__