The Token module represents a CoNLL token annotation. In a CoNLL file, this is a non-empty, non-comment line.
Token members correspond directly with the Universal Dependencies CoNLL definition and all members are stored as strings. This means ids are strings as well. These fields are:
misc. More information on these is found below.
All fields are strings except for
misc, which are
dicts. Each of these fields has specific semantics per the UDv2 guidelines. Since these fields are
dicts these means modifying them uses python’s natural syntax for dictionaries.
feats is a key-value mapping from
set. An example entry would be key
Gender with value
set((Feminine,)). More features could be added to an existing key by adding to its set, or a new feature could be added by adding to the dictionary. All features must have at least one value, so any keys with empty sets will throw an error on serialization back to text.
deps is a key-value mapping from
tuple of cardinality 4. This field represents enhanced dependencies. The key is the index of the token head, and the tuple elements define the enhanced dependency. Most Universal Dependencies treebanks, only use 2 of these 4 dimensions: the token index and the relation. See the Universal Dependencies guideline for more information on these 4 components. When adding new
deps, the values must also be tuples of cardinality 4.
misc, the documentation only specifies that values be separated by a ‘|’, so not all keys have to have a value. So, values on
misc are either
None, or a
str. A key with a value of
None is output as a singleton, with no separating ‘=’. A key with a corresponding
set value will be handled like
Below is an example of adding a new feature to a token, where the key must first be initialized:
token.feats['NewFeature'] = set(('No', ))
or alternatively as:
token.feats['NewFeature'] = set() token.feats['NewFeature'].add('No')
On the miscellaneous column, adding a singleton field is done with the following line:
token.misc['SingletonFeature'] = None
Defines the Token type and parsing and output logic. A Token is the based unit in CoNLL-U and so the data and parsing in this module is central to the CoNLL-U format.
A token in a CoNLL-U file. This consists of 10 columns, each separated by a single tab character and ending in an LF (‘n’) line break. Each of the 10 column values corresponds to a specific component of the token, such as id, word form, lemma, etc.
This class does not do any formatting validation on input or output. This means that invalid input may be properly processed and then output. Or that client changes to the token may result in invalid data that can then be output. Properly formatted CoNLL-U will always work on input and as long as all basic units are strings output will work as expected. The result may just not be proper CoNLL-U.
Also note that the word form for a token is immutable. This is because CoNLL-U is inherently interested in annotation schemes and not storing sentences.
Construct a Token from the given source line.
A Token line ends in an an LF line break according to the CoNLL-U specification. However, this method accepts a line with or without the LF line break.
On parsing, a ‘_’ in the form and lemma is ambiguous and either refers to an empty value or to an actual underscore. The empty parameter flag controls how this situation should be handled.
This method also guarantees properly processing valid input, but invalid input may not be parsed properly. Some inputs that do not follow the CoNLL-U specification may still be parsed properly and as expected. So proper parsing is not an indication of validity.
- line – The line that represents the Token in CoNLL-U format.
- empty – A flag to control if the word form and lemma can be assumed to be empty and not the token signifying empty. If both the form and lemma are underscores and empty is set to False (there is no empty assumption), then the form and lemma will be underscores rather than None.
ParseError– On various parsing errors, such as not enough columns or improper column values.
Convert this Token to its CoNLL-U representation.
A Token’s CoNLL-U representation is a line. Note that this method does not include a newline at the end.
Returns: A string representing the Token in CoNLL-U format.
Provide the word form of this Token. This property is read only.
Returns: The Token form.
Checks if this Token is a multiword token.
Returns: True if this token is a multiword token, and False otherwise.