token

The Token module represents a single token (multiword or otherwise) in a CoNLL-U file. In text, this corresponds to one non-empty, non-comment line. Token has several members that correspond with the columns of the lines. All values are stored as strings. So ids are strings and not numeric. These fields are listed below and coresspond exactly with those found in the Universal Dependencenies project:

id form lemma upos xpos feats head deprel deps misc

Currently, all fields are strings except for feats, deps, and misc, which are dicts. There are specific semantics for each of these according to the UDv2 guidelines. feats is a dictionary of attribute value pairs, where there can be multiple values. So the values for feats is a set. deps is also a dictionary of attribute value pairs, but there is only one value, so the values are strings. Lastly, for misc, the documentation only specifies that the values are separated by a ‘|’. So for this reason, the value for misc is either None for entries with no ‘=’, and an attribute value pair otherwise.

In order to use this class explicity, use can import pyconll.unit and use pyconll.unit.Token or use the import .. from ... syntax.

API

class pyconll.unit.token.Token(source, empty=True, _line_number=None)[source]

A token in a CoNLL-U file. This consists of 10 columns, each separated by a single tab character and ending in an LF (‘n’) line break. Each of the 10 column values corresponds to a specific component of the token, such as id, word form, lemma, etc.

This class does not do any formatting validation on input or output. This means that invalid input may be properly processed and then output. Or that client changes to the token may result in invalid data that can then be output. Properly formatted CoNLL-U will always work on input and as long as all basic units are strings output will work as expected. The result may just not be proper CoNLL-U.

Also note that the word form for a token is immutable. This is because CoNLL-U is inherently interested in annotation schemes and not storing sentences.

__eq__(other)[source]

Test if this Token is equal to other.

Args: other: The other token to compare against.

Returns: True if the this Token and the other are the same. Two tokens are considered the same when all columns are the same.

__init__(source, empty=True, _line_number=None)[source]

Construct the token from the given source.

A Token line must end in an an LF line break according to the specification. However, this method will accept a line with or without this ending line break.

Further, a ‘_’ that appears in the form and lemma is ambiguous and can either refer to an empty value or an actual underscore. So the flag empty_form allows for control over this if it is known from outside information. If, the token is a multiword token, all fields except for form should be empty.

Note that no validation is done on input. Valid input will be processed properly, but there is no guarantee as to invalid input that does not follow the CoNLL-U specifications.

Args: line: The line that represents the Token in CoNLL-U format. empty: A flag to signify if the word form and lemma can be assumed to be

empty and not the token signifying empty. Only if both the form and lemma are both the same token as empty and there is no empty assumption, will they not be assigned to None.
_line_number: The line number for this Token in a CoNLL-U file. For
internal use mostly.

Raises: ValueError if the provided source is not composed of 10 tab separated columns.

__weakref__

list of weak references to the object (if defined)

conll()[source]

Convert Token to the CoNLL-U representation.

Note that this does not include a newline at the end.

Returns: A string representing the token as a line in a CoNLL-U file.

form

Provide the word form of this Token. This property makes it readonly.

Returns: The Token wordform.

is_multiword()[source]

Checks if this token is a multiword token.

Returns: True if this token is a multiword token, and False otherwise.