sentence

The Sentence class defined in pyconll.shared represents a sentence across different formats. It inherits from AbstractSentence which describes the requirements for a sentence type. Most formats will have the same sentence structure, so one base case is given, but more advanced usage can be derived from a new class inheriting from AbstractSentence directly.

A Sentence is a simple container with two main components:

  • meta: OrderedDict[str, Optional[str]] - Metadata/comments

  • tokens: list[T] - List of token objects with the Sentence being generic to the exact token type.

There is a Sentence class defined in pyconll.conll which is built off of this base and adds the to_tree method.

Metadata

Metadata (comments in the CoNLL-U file) are stored as an ordered dictionary. Comments are treated as key-value pairs, separated by the = character. A singleton comment has no = present; in this situation the key is the comment string, and the value is None.

Accessing Metadata

from pyconll.conllu import conllu

sentences = conllu.load_from_file('train.conllu')
sentence = sentences[0]

# Access metadata
sent_id = sentence.meta['sent_id']
text = sentence.meta['text']

# Add new metadata
sentence.meta['custom'] = 'value'

# Singleton metadata
sentence.meta['newpar'] = None

Common Metadata Keys

In CoNLL-U, common metadata keys include:

  • sent_id - Sentence identifier

  • text - The original sentence text

  • newdoc id - Document boundary marker

  • newpar id - Paragraph boundary marker

Tokens

Tokens are stored as a simple list. The type of tokens depends on the exact token specification provided when parsing.

For CoNLL-U files, tokens are of type Token from pyconll.conllu.

Accessing Tokens

from pyconll.conllu import conllu

sentences = conllu.load_from_file('train.conllu')
sentence = sentences[0]

# Iterate over tokens
for token in sentence.tokens:
    print(token.form, token.upos)

# Access by index
first_token = sentence.tokens[0]

# Build ID index if needed
token_by_id = {t.id: t for t in sentence.tokens}
token = token_by_id['5']

API

Contains definitions for concepts that are shared among many CoNLL variations.

class pyconll.shared.Sentence[source]

A very basic sentence type that can be used for most use cases. It simply stores the metadata and tokens in the order they were received. It can be used as a base level for other sentence implementations which want to add additional operations on top of this very common logic.

__accept_meta__(key: str, value: str | None) None[source]

Accept the next metadata values.

Parameters:
  • key – The key of the metadata.

  • value – The value of the metadata or None if it is a singleton.

__accept_token__(t: T) None[source]

Accept the next token value.

Parameters:

t – The next token value for this Sentence to accept.

__finalize__() None[source]

There is nothing to finalize for this Sentence type.

__init__() None[source]

Create a new structured Sentence object.

__repr__() str[source]

Create a string that represents this Sentence object.

Returns:

The constructed string.