sentence =================================== The ``Sentence`` class defined in ``pyconll.shared`` represents a sentence across different formats. It inherits from ``AbstractSentence`` which describes the requirements for a sentence type. Most formats will have the same sentence structure, so one base case is given, but more advanced usage can be derived from a new class inheriting from ``AbstractSentence`` directly. A ``Sentence`` is a simple container with two main components: - ``meta: OrderedDict[str, Optional[str]]`` - Metadata/comments - ``tokens: list[T]`` - List of token objects with the Sentence being generic to the exact token type. There is a ``Sentence`` class defined in ``pyconll.conll`` which is built off of this base and adds the ``to_tree`` method. Metadata ---------------------------------- Metadata (comments in the CoNLL-U file) are stored as an ordered dictionary. Comments are treated as key-value pairs, separated by the ``=`` character. A singleton comment has no ``=`` present; in this situation the key is the comment string, and the value is ``None``. Accessing Metadata ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python from pyconll.conllu import conllu sentences = conllu.load_from_file('train.conllu') sentence = sentences[0] # Access metadata sent_id = sentence.meta['sent_id'] text = sentence.meta['text'] # Add new metadata sentence.meta['custom'] = 'value' # Singleton metadata sentence.meta['newpar'] = None Common Metadata Keys ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In CoNLL-U, common metadata keys include: - ``sent_id`` - Sentence identifier - ``text`` - The original sentence text - ``newdoc id`` - Document boundary marker - ``newpar id`` - Paragraph boundary marker Tokens ---------------------------------- Tokens are stored as a simple list. The type of tokens depends on the exact token specification provided when parsing. For CoNLL-U files, tokens are of type ``Token`` from ``pyconll.conllu``. Accessing Tokens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python from pyconll.conllu import conllu sentences = conllu.load_from_file('train.conllu') sentence = sentences[0] # Iterate over tokens for token in sentence.tokens: print(token.form, token.upos) # Access by index first_token = sentence.tokens[0] # Build ID index if needed token_by_id = {t.id: t for t in sentence.tokens} token = token_by_id['5'] API ---------------------------------- .. automodule:: pyconll.shared :members: :exclude-members: __dict__, __weakref__