sentence

The Sentence class defined in pyconll.shared represents a sentence across different formats. It inherits from AbstractSentence which describes the requirements for a sentence type. Most formats will have the same sentence structure, so one base case is given, but more advanced usage can be derived from a new class inheriting from AbstractSentence directly.

A Sentence is a simple container with two main components:

  • meta: OrderedDict[str, Optional[str]] - Metadata/comments

  • tokens: list[T] - List of token objects with the Sentence being generic to the exact token type.

There is a Sentence class defined in pyconll.conll which is built off of this base and adds the to_tree method.

Metadata

Metadata (comments in the CoNLL-U file) are stored as an ordered dictionary. Comments are treated as key-value pairs, separated by the = character. A singleton comment has no = present; in this situation the key is the comment string, and the value is None.

Accessing Metadata

from pyconll.conllu import conllu

sentences = conllu.load_from_file('train.conllu')
sentence = sentences[0]

# Access metadata
sent_id = sentence.meta['sent_id']
text = sentence.meta['text']

# Add new metadata
sentence.meta['custom'] = 'value'

# Singleton metadata
sentence.meta['newpar'] = None

Common Metadata Keys

In CoNLL-U, common metadata keys include:

  • sent_id - Sentence identifier

  • text - The original sentence text

  • newdoc id - Document boundary marker

  • newpar id - Paragraph boundary marker

Tokens

Tokens are stored as a simple list. The type of tokens depends on the exact token specification provided when parsing.

For CoNLL-U files, tokens are of type Token from pyconll.conllu.

Accessing Tokens

from pyconll.conllu import conllu

sentences = conllu.load_from_file('train.conllu')
sentence = sentences[0]

# Iterate over tokens
for token in sentence.tokens:
    print(token.form, token.upos)

# Access by index
first_token = sentence.tokens[0]

# Build ID index if needed
token_by_id = {t.id: t for t in sentence.tokens}
token = token_by_id['5']

API