token¶
The Token module represents a single token (multiword or otherwise) in a CoNLL-U file. In text, this corresponds to one non-empty, non-comment line. Token has several members that correspond with the columns of the lines. All values are stored as strings. So ids are strings and not numeric. These fields are listed below and coresspond exactly with those found in the Universal Dependencenies project:
id form lemma upos xpos feats head deprel deps misc
Fields¶
Currently, all fields are strings except for feats
, deps
, and misc
, which are dicts
. There are specific semantics for each of these according to the UDv2 guidelines. Again, the current approach is for these fields to be dicts
as described below rather than providing an extra interface for these fields.
Since all of these fields are dicts
, modifying non existent keys will result in a KeyError
. This means that new values must be added as in a normal dict
. For set
based dicts
, feats
and specific fields of misc
, the new key must be assigned to an empty set
to start. More details on this below.
feats¶
feats
is a dictionary of attribute value pairs, where there can be multiple values. So the values for feats
is a set
when parsed. The keys are str
and the values are set
. Do not assign a value to a str
or any other type. Note that any keys with empty sets
will not be output.
deps¶
deps
is also a dictionary of attribute value pairs, where the values are tuples of cardinality 4. Most Universal Dependencies, only use a token index and relation in the deps
, but according to documentation, there are up to 4 components in this field, not including the token index. Note that this fixed parsing was introduced in version 1.0 and is not backward compatible. When adding new deps
, the values should also be of 4 tuples therefore.
misc¶
Lastly, for misc
, the documentation only specifies that the values are separated by a ‘|’. So the values can either be an attribute values pair like feats
or it can be a single value. So for this reason, the value for misc
is either None
for entries with no ‘=’, and an attribute values pair, otherwise, with the value being a set
of str
. A key with a value of None
is output as a singleton, while a key with an empty set
is not output like with feats
.
When adding a new key, the key must first be initialized manually as so:
token.misc[‘NewFeature’] = set((‘No’, ))
or alternatively as:
token.misc[‘NewFeature’] = set() token.misc[‘NewFeature’].add(‘No’)
API¶
-
class
pyconll.unit.token.
Token
(source, empty=True, _line_number=None)[source]¶ A token in a CoNLL-U file. This consists of 10 columns, each separated by a single tab character and ending in an LF (‘n’) line break. Each of the 10 column values corresponds to a specific component of the token, such as id, word form, lemma, etc.
This class does not do any formatting validation on input or output. This means that invalid input may be properly processed and then output. Or that client changes to the token may result in invalid data that can then be output. Properly formatted CoNLL-U will always work on input and as long as all basic units are strings output will work as expected. The result may just not be proper CoNLL-U.
Also note that the word form for a token is immutable. This is because CoNLL-U is inherently interested in annotation schemes and not storing sentences.
-
__eq__
(other)[source]¶ Test if this Token is equal to other.
Parameters: other – The other token to compare against. Returns: True if the this Token and the other are the same. Two tokens are considered the same when all columns are the same.
-
__init__
(source, empty=True, _line_number=None)[source]¶ Construct the token from the given source.
A Token line must end in an an LF line break according to the specification. However, this method will accept a line with or without this ending line break.
Further, a ‘_’ that appears in the form and lemma is ambiguous and can either refer to an empty value or an actual underscore. So the flag empty_form allows for control over this if it is known from outside information. If, the token is a multiword token, all fields except for form should be empty.
Note that no validation is done on input. Valid input will be processed properly, but there is no guarantee as to invalid input that does not follow the CoNLL-U specifications.
Parameters: - line – The line that represents the Token in CoNLL-U format.
- empty – A flag to signify if the word form and lemma can be assumed to be empty and not the token signifying empty. Only if both the form and lemma are both the same token as empty and there is no empty assumption, will they not be assigned to None.
- _line_number – The line number for this Token in a CoNLL-U file. For internal use mostly.
Raises: ParseError
– If the provided source is not composed of 10 tab separated columns.
-
conll
()[source]¶ Convert Token to the CoNLL-U representation.
Note that this does not include a newline at the end.
Returns: A string representing the token as a line in a CoNLL-U file.
-
form
¶ Provide the word form of this Token. This property makes it readonly.
Returns: The Token wordform.
-