sentence¶
The Sentence
module represents an entire CoNLL sentence, which is composed of comments and tokens.
Comments¶
Comments are treated as key-value pairs, separated by the =
character. A singleton comment has no =
present. In this situation the key is the comment string, and the value is None
. Methods for reading and writing cmoments on Sentences are prefixed with meta_
, and are found below.
For convenience, the id and text comments are accessible through member properties on the Sentence in addition to metadata methods. So sentence.id
and sentence.meta_value('id')
are equivalent but the former is more concise and readable. Since this API does not support changing a token’s form, the text
comment cannot be changed. Text translations or transliterations can still be added just like any other comment.
Document and Paragraph ID¶
In previous versions of pyconll, the document and paragraph id of a Sentence were extracted similar to text and id information. This causes strange results and semantics when adding Sentences to a Conll object since the added sentence may have a newpar
or newdoc
comment which affects all subsequent Sentence ids. For simplicity’s sake, this information is now only directly available as normal metadata information.
Tokens¶
This is the heart of the sentence. Tokens can be indexed on Sentences through their id value, as a string, or as a numeric index. So all of the following calls are valid, sentence['5']
, sentence['2-3']
, sentence['2.1']
, and sentence[2]
. Note that sentence[x]
and sentence[str(x)]
are not interchangeable. These calls are both valid but have different meanings.
API¶
Defines the Sentence type and the associated parsing and output logic.
-
class
pyconll.unit.sentence.
Sentence
(source: str)[source]¶ A sentence in a CoNLL-U file. A sentence consists of several components.
First, are comments. Each sentence must have two comments per UD v2 guidelines, which are sent_id and text. Comments are stored as a dict in the meta field. For singleton comments with no key-value structure, the value in the dict has a value of None.
Note the sent_id field is also assigned to the id property, and the text field is assigned to the text property for usability, and their importance as comments. The text property is read only along with the paragraph and document id. This is because the paragraph and document id are not defined per Sentence but across multiple sentences. Instead, these fields can be changed through changing the metadata of the Sentences.
Then comes the token annotations. Each sentence is made up of many token lines that provide annotation to the text provided. While a sentence usually means a collection of tokens, in this CoNLL-U sense, it is more useful to think of it as a collection of annotations with some associated metadata. Therefore the text of the sentence cannot be changed with this class, only the associated annotations can be changed.
-
__getitem__
(key: str) → pyconll.unit.token.Token[source]¶ -
__getitem__
(key: int) → pyconll.unit.token.Token -
__getitem__
(key: slice) → Sequence[pyconll.unit.token.Token] Return the desired tokens from the Sentence.
- Parameters
key – The indicator for the tokens to return. Can either be an integer, a string, or a slice. For an integer, the numeric indexes of Tokens are used. For a string, the id of the Token is used. And for a slice the start and end must be the same data types, and can be both string and integer.
- Returns
If the key is a string then the appropriate Token. The key can also be a slice in which case a sequence of tokens is provided.
-
__init__
(source: str) → None[source]¶ Construct a Sentence object from the provided CoNLL-U string.
- Parameters
source – The raw CoNLL-U string to parse. Comments must precede token lines.
- Raises
ParseError – If there is any token that was not valid.
-
__iter__
() → Iterator[pyconll.unit.token.Token][source]¶ Iterate through all the tokens in the Sentence including multiword tokens.
-
__len__
() → int[source]¶ Get the length of this sentence.
- Returns
The amount of tokens in this sentence. In the CoNLL-U sense, this includes both all the multiword tokens and their decompositions.
-
conll
() → str[source]¶ Convert the sentence to a CoNLL-U representation.
- Returns
A string representing the Sentence in CoNLL-U format.
- Raises
FormatError – If the Sentence or underlying Tokens can not be converted to the CoNLL format.
-
property
id
¶ Get the sentence id.
- Returns
The sentence id. If there is none, then returns None.
-
meta_present
(key: str) → bool[source]¶ Check if the key is present as a singleton or as a pair.
- Parameters
key – The value to check for in the comments.
- Returns
True if the key was provided as a singleton or as a key value pair. False otherwise.
-
meta_value
(key: str) → Optional[str][source]¶ Returns the value associated with the key in the metadata (comments).
- Parameters
key – The key whose value to look up.
- Returns
The value associated with the key as a string. If the key is a singleton then None is returned.
- Raises
KeyError – If the key is not present in the comments.
-
remove_meta
(key: str) → None[source]¶ Remove a metadata element associated with the Sentence.
- Parameters
key – The name of the metadata / comment.
- Raises
KeyError – If the key is not present in the Sentence metadata.
ValueError – If the text key is provided, regardless of presence.
-
set_meta
(key: str, value: Optional[str] = None) → None[source]¶ Set or add the metadata or comments associated with this Sentence.
- Parameters
key – The key for the comment.
value – The value to associate with the key. If the comment is a singleton, this field can be ignored or set to None.
-
property
text
¶ Get the continuous text for this sentence. Read-only.
- Returns
The continuous text of this sentence. If none is provided in comments, then None is returned.
-
to_tree
() → pyconll.tree.tree.Tree[pyconll.unit.token.Token][source]¶ Creates a Tree data structure from the current sentence.
An empty sentence will cannot be converted into a Tree and will throw an exception. The children for a node in the tree are ordered as they appear in the sentence. So the earliest child of a token appears first in the token’s children in the tree.
Each Tree node has a data member that references the actual Token represented by the node. Multiword tokens are not included in the tree since they are more like virtual Tokens and do not participate in any dependency relationships or carry much value in dependency relations.
- Returns
A constructed Tree that represents the dependency graph of the sentence.
- Raises
ValueError – If the sentence can not be made into a tree because a token has an empty head value or if there is no root token.
-