token¶
The Token module represents a CoNLL token annotation. In a CoNLL file, this is a non-empty, non-comment line. Token
members correspond directly with the Universal Dependencies CoNLL definition and all members are stored as strings. This means ids are strings as well. These fields are: id
, form
, lemma
, upos
, xpos
, feats
, head
, deprel
, deps
, misc
. More information on these is found below.
Fields¶
All fields are optional strings except for feats
, deps
, and misc
, which are dicts
. As optional strings, they can either be None, or a string value. Fields which are dictionaries have specific semantics per the UDv2 guidelines. Since these fields are dicts
this means modifying them uses python’s natural syntax for dictionaries.
feats¶
feats
is a key-value mapping from str
to set
. An example entry would be key Gender
with value set((Feminine,))
. More features could be added to an existing key by adding to its set, or a new feature could be added by adding to the dictionary. All features must have at least one value, so any keys with empty sets will throw an error on serialization back to text.
deps¶
deps
is a key-value mapping from str
to tuple
of cardinality 4. This field represents enhanced dependencies. The key is the index of the token head, and the tuple elements define the enhanced dependency. Most Universal Dependencies treebanks, only use 2 of these 4 dimensions: the token index and the relation. See the Universal Dependencies guideline for more information on these 4 components. When adding new deps
, the values must also be tuples of cardinality 4.
misc¶
For misc
, the documentation only specifies that values be separated by a ‘|’, so not all keys have to have a value. So, values on misc
are either None
, or a set
of str
. A key with a value of None
is output as a singleton, with no separating ‘=’. A key with a corresponding set
value will be handled like feats
.
Examples¶
Below is an example of adding a new feature to a token, where the key must first be initialized:
token.feats['NewFeature'] = set(('No', ))
or alternatively as:
token.feats['NewFeature'] = set()
token.feats['NewFeature'].add('No')
On the miscellaneous column, adding a singleton field is done with the following line:
token.misc['SingletonFeature'] = None
API¶
Defines the Token type and parsing and output logic. A Token is the based unit in CoNLL-U and so the data and parsing in this module is central to the CoNLL-U format.
-
class
pyconll.unit.token.
Token
(source: str, empty: bool = False)[source]¶ A token in a CoNLL-U file. This consists of 10 columns, each separated by a single tab character and ending in an LF (‘n’) line break. Each of the 10 column values corresponds to a specific component of the token, such as id, word form, lemma, etc.
This class does not do any formatting validation on input or output. This means that invalid input may be properly processed and then output. Or that client changes to the token may result in invalid data that can then be output. Properly formatted CoNLL-U will always work on input and as long as all basic units are strings output will work as expected. The result may just not be proper CoNLL-U.
Also note that the word form for a token is immutable. This is because CoNLL-U is inherently interested in annotation schemes and not storing sentences.
-
__init__
(source: str, empty: bool = False) → None[source]¶ Construct a Token from the given source line.
A Token line ends in an an LF line break according to the CoNLL-U specification. However, this method accepts a line with or without the LF line break.
On parsing, a ‘_’ in the form and lemma is ambiguous and either refers to an empty value or to an actual underscore. The empty parameter flag controls how this situation should be handled.
This method also guarantees properly processing valid input, but invalid input may not be parsed properly. Some inputs that do not follow the CoNLL-U specification may still be parsed properly and as expected. So proper parsing is not an indication of validity.
- Parameters
line – The line that represents the Token in CoNLL-U format.
empty – A flag to control if the word form and lemma can be assumed to be empty and not the token signifying empty. If both the form and lemma are underscores and empty is set to False (there is no empty assumption), then the form and lemma will be underscores rather than None.
- Raises
ParseError – On various parsing errors, such as not enough columns or improper column values.
-
conll
() → str[source]¶ Convert this Token to its CoNLL-U representation.
A Token’s CoNLL-U representation is a line. Note that this method does not include a newline at the end.
- Returns
A string representing the Token in CoNLL-U format.
- Raises
FormatError – If the Token can not be converted to the CoNLL format.
-
property
form
¶ Provide the word form of this Token. This property is read only.
- Returns
The Token form.
-