token

The Token module represents a single token (multiword or otherwise) in a CoNLL-U file. In text, this corresponds to one non-empty, non-comment line. Token has several members that correspond with the columns of the lines. All values are stored as strings. So ids are strings and not numeric. These fields are listed below and coresspond exactly with those found in the Universal Dependencenies project:

id form lemma upos xpos feats head deprel deps misc

Fields

Currently, all fields are strings except for feats, deps, and misc, which are dicts. There are specific semantics for each of these according to the UDv2 guidelines. Again, the current approach is for these fields to be dicts as described below rather than providing an extra interface for these fields.

Since all of these fields are dicts, modifying non existent keys will result in a KeyError. This means that new values must be added as in a normal dict. For set based dicts, feats and specific fields of misc, the new key must be assigned to an empty set to start. More details on this below.

feats

feats is a dictionary of attribute value pairs, where there can be multiple values. So the values for feats is a set when parsed. The keys are str and the values are set. Do not assign a value to a str or any other type. Note that any keys with empty sets will not be output.

deps

deps is also a dictionary of attribute value pairs, where the values are tuples of cardinality 4. Most Universal Dependencies, only use a token index and relation in the deps, but according to documentation, there are up to 4 components in this field, not including the token index. Note that this fixed parsing was introduced in version 1.0 and is not backward compatible. When adding new deps, the values should also be of 4 tuples therefore.

misc

Lastly, for misc, the documentation only specifies that the values are separated by a ‘|’. So the values can either be an attribute values pair like feats or it can be a single value. So for this reason, the value for misc is either None for entries with no ‘=’, and an attribute values pair, otherwise, with the value being a set of str. A key with a value of None is output as a singleton, while a key with an empty set is not output like with feats.

When adding a new key, the key must first be initialized manually as so:

token.misc[‘NewFeature’] = set((‘No’, ))

or alternatively as:

token.misc[‘NewFeature’] = set() token.misc[‘NewFeature’].add(‘No’)

API

class pyconll.unit.token.Token(source, empty=True, _line_number=None)[source]

A token in a CoNLL-U file. This consists of 10 columns, each separated by a single tab character and ending in an LF (‘n’) line break. Each of the 10 column values corresponds to a specific component of the token, such as id, word form, lemma, etc.

This class does not do any formatting validation on input or output. This means that invalid input may be properly processed and then output. Or that client changes to the token may result in invalid data that can then be output. Properly formatted CoNLL-U will always work on input and as long as all basic units are strings output will work as expected. The result may just not be proper CoNLL-U.

Also note that the word form for a token is immutable. This is because CoNLL-U is inherently interested in annotation schemes and not storing sentences.

__eq__(other)[source]

Test if this Token is equal to other.

Parameters:other – The other token to compare against.
Returns:True if the this Token and the other are the same. Two tokens are considered the same when all columns are the same.
__init__(source, empty=True, _line_number=None)[source]

Construct the token from the given source.

A Token line must end in an an LF line break according to the specification. However, this method will accept a line with or without this ending line break.

Further, a ‘_’ that appears in the form and lemma is ambiguous and can either refer to an empty value or an actual underscore. So the flag empty_form allows for control over this if it is known from outside information. If, the token is a multiword token, all fields except for form should be empty.

Note that no validation is done on input. Valid input will be processed properly, but there is no guarantee as to invalid input that does not follow the CoNLL-U specifications.

Parameters:
  • line – The line that represents the Token in CoNLL-U format.
  • empty – A flag to signify if the word form and lemma can be assumed to be empty and not the token signifying empty. Only if both the form and lemma are both the same token as empty and there is no empty assumption, will they not be assigned to None.
  • _line_number – The line number for this Token in a CoNLL-U file. For internal use mostly.
Raises:

ParseError – If the provided source is not composed of 10 tab separated columns.

conll()[source]

Convert Token to the CoNLL-U representation.

Note that this does not include a newline at the end.

Returns:A string representing the token as a line in a CoNLL-U file.
form

Provide the word form of this Token. This property makes it readonly.

Returns:The Token wordform.
is_multiword()[source]

Checks if this token is a multiword token.

Returns:True if this token is a multiword token, and False otherwise.