sentence
The Sentence class defined in pyconll.shared represents a sentence across different formats. It inherits from AbstractSentence which describes the requirements for a sentence type. Most formats will have the same sentence structure, so one base case is given, but more advanced usage can be derived from a new class inheriting from AbstractSentence directly.
A Sentence is a simple container with two main components:
meta: OrderedDict[str, Optional[str]]- Metadata/commentstokens: list[T]- List of token objects with the Sentence being generic to the exact token type.
There is a Sentence class defined in pyconll.conll which is built off of this base and adds the to_tree method.
Metadata
Metadata (comments in the CoNLL-U file) are stored as an ordered dictionary. Comments are treated as key-value pairs, separated by the = character. A singleton comment has no = present; in this situation the key is the comment string, and the value is None.
Accessing Metadata
from pyconll.conllu import conllu
sentences = conllu.load_from_file('train.conllu')
sentence = sentences[0]
# Access metadata
sent_id = sentence.meta['sent_id']
text = sentence.meta['text']
# Add new metadata
sentence.meta['custom'] = 'value'
# Singleton metadata
sentence.meta['newpar'] = None
Common Metadata Keys
In CoNLL-U, common metadata keys include:
sent_id- Sentence identifiertext- The original sentence textnewdoc id- Document boundary markernewpar id- Paragraph boundary marker
Tokens
Tokens are stored as a simple list. The type of tokens depends on the exact token specification provided when parsing.
For CoNLL-U files, tokens are of type Token from pyconll.conllu.
Accessing Tokens
from pyconll.conllu import conllu
sentences = conllu.load_from_file('train.conllu')
sentence = sentences[0]
# Iterate over tokens
for token in sentence.tokens:
print(token.form, token.upos)
# Access by index
first_token = sentence.tokens[0]
# Build ID index if needed
token_by_id = {t.id: t for t in sentence.tokens}
token = token_by_id['5']