token

The Token module represents a CoNLL token annotation. In a CoNLL file, this corresponds to a non-empty, non-comment line. Token members correspond directly with the Universal Dependencies CoNLL definition and all values are stored as strings. This means ids are strings as well. These fields are: id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Fields

All fields are strings except for feats, deps, and misc, which are dicts. Each of these fields has specific semantics per the UDv2 guidelines.

Since all of these fields are dicts, modifying non existent keys will result in a KeyError. This means that new values must be added as in a normal dict. For set based dicts, feats and specific fields of misc, the new key must be assigned to an empty set to start. More details on this below.

feats

feats is a key value mapping from str to set. Note that any keys with empty sets will throw an error, as all keys must have at least one feature.

deps

deps is a key value mapping from str to tuple of cardinality 4. Most Universal Dependencies treebanks, only use 2 of these 4 dimensions: the token index and the relation. See the Universal Dependencies guideline for more information on these 4 components.When adding new deps, the values must also be tuples of cardinality 4. Note that deps parsing is broken before version 1.0.

misc

Lastly, for misc, the documentation only specifies that the values are separated by a ‘|’. So not all components have to have a value. So, the values on misc are either None for entries with no ‘=’, or set of str. A key with a value of None is output as a singleton.

Example

Below is an example of adding a new feature to a token, where the key must first be initialized:

token.feats['NewFeature'] = set(('No', ))

or alternatively as:

token.feats['NewFeature'] = set()
token.feats['NewFeature'].add('No')

API

Defines the Token type and the associated parsing and output logic.

class pyconll.unit.token.Token(source, empty=True, _line_number=None)[source]

A token in a CoNLL-U file. This consists of 10 columns, each separated by a single tab character and ending in an LF (‘n’) line break. Each of the 10 column values corresponds to a specific component of the token, such as id, word form, lemma, etc.

This class does not do any formatting validation on input or output. This means that invalid input may be properly processed and then output. Or that client changes to the token may result in invalid data that can then be output. Properly formatted CoNLL-U will always work on input and as long as all basic units are strings output will work as expected. The result may just not be proper CoNLL-U.

Also note that the word form for a token is immutable. This is because CoNLL-U is inherently interested in annotation schemes and not storing sentences.

__eq__(other)[source]

Test if this Token is equal to other.

Parameters:other – The other token to compare against.
Returns:True if the this Token and the other are the same. Two tokens are considered the same when all columns are the same.
__init__(source, empty=True, _line_number=None)[source]

Construct the token from the given source.

A Token line must end in an an LF line break according to the specification. However, this method will accept a line with or without this ending line break.

Further, a ‘_’ that appears in the form and lemma is ambiguous and can either refer to an empty value or an actual underscore. So the flag empty_form allows for control over this if it is known from outside information. If, the token is a multiword token, all fields except for form should be empty.

Note that no validation is done on input. Valid input will be processed properly, but there is no guarantee as to invalid input that does not follow the CoNLL-U specifications.

Parameters:
  • line – The line that represents the Token in CoNLL-U format.
  • empty – A flag to signify if the word form and lemma can be assumed to be empty and not the token signifying empty. Only if both the form and lemma are both the same token as empty and there is no empty assumption, will they not be assigned to None.
  • _line_number – The line number for this Token in a CoNLL-U file. For internal use mostly.
Raises:

ParseError – If the provided source is not composed of 10 tab separated columns.

conll()[source]

Convert Token to the CoNLL-U representation.

Note that this does not include a newline at the end.

Returns:A string representing the token as a line in a CoNLL-U file.
form

Provide the word form of this Token. This property makes it readonly.

Returns:The Token wordform.
is_multiword()[source]

Checks if this token is a multiword token.

Returns:True if this token is a multiword token, and False otherwise.