load¶
This module defines the main interface to load CoNLL treebank resources. CoNLL treebanks can be loaded through a string or a file (or technically anything that can function as a string iterator). CoNLL resources can be loaded and held in memory, or simply iterated through a sentence at a time which is useful for handling very large files.
The fully qualified name of the module is pyconll.load
, but these methods are imported at the pyconll
namespace level. This module provides the wrappers for loading from a string or file, but if another string iterator is available, for example, a network resource, this can be passed directly to the Conll constructor as well.
Example¶
This example counts the number of times a token with a lemma of linguistic
appeared in the treebank. If all the operations that will be done on the CoNLL file are readonly or are data aggregations, the iter_from
alternatives are more memory efficient alternative as well. These methods will return an iterator over the sentences in the CoNLL resource rather than storing the CoNLL object in memory, which can be convenient when dealing with large files that do not need be completely loaded. This example uses the load_from_file
method for illustration purposes.
import pyconll
example_treebank = '/home/myuser/englishdata.conll'
conll = pyconll.load_from_file(example_treebank)
count = 0
for sentence in conll:
for word in sentence:
if word.lemma == 'linguistic':
count += 1
print(count)
API¶
A wrapper around the Conll class to easily load treebanks from multiple formats. This module can also load resources by iterating over treebank data without storing Conll objects in memory. This module is the main entrance to pyconll’s functionalities.
-
pyconll.load.
iter_from_file
(file_descriptor: Union[str, bytes, os.PathLike]) → Iterator[pyconll.unit.sentence.Sentence][source]¶ Iterate over a CoNLL-U file’s sentences.
- Parameters
file_descriptor – The file to iterate the CoNLL-U data from. This can be a filepath as a Path object, or string, or a file descriptor.
- Yields
The sentences that make up the CoNLL-U file.
- Raises
IOError – If there is an error opening the file.
ParseError – If there is an error parsing the input into a Conll object.
-
pyconll.load.
iter_from_resource
(resource: Iterable[str]) → Iterator[pyconll.unit.sentence.Sentence][source]¶ Iterate over the sentences from an iterable string resource.
This is a generic method that allows for any general resource that can provide data (like a streaming network request or memory mapped data) to be parsed as a CoNLL-U data source.
- Parameters
resource – The line source. Each iterated string should be a line in a CoNLL-U formatted file.
- Yields
The sentences that make up the CoNLL-U file.
- Raises
ParseError – If there is an error parsing the input into a Conll object.
-
pyconll.load.
iter_from_string
(source: str) → Iterator[pyconll.unit.sentence.Sentence][source]¶ Iterate over a CoNLL-U string’s sentences.
Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.
- Parameters
source – The CoNLL-U string.
- Yields
The sentences that make up the CoNLL-U file.
- Raises
ParseError – If there is an error parsing the input into a Conll object.
-
pyconll.load.
load_from_file
(file_descriptor: Union[str, bytes, os.PathLike]) → pyconll.unit.conll.Conll[source]¶ Load a CoNLL-U file given its location.
- Parameters
file_descriptor – The file to load the CoNLL-U data from. This can be a filepath as a Path object, or string, or a file descriptor.
- Returns
A Conll object equivalent to the provided file.
- Raises
IOError – If there is an error opening the given filename.
ParseError – If there is an error parsing the input into a Conll object.
-
pyconll.load.
load_from_resource
(resource: Iterable[str]) → pyconll.unit.conll.Conll[source]¶ Load a CoNLL-U file from a generic string resource.
- Parameters
resource – The generic string resource. Each string from the resource is assumed to be a line in a CoNLL-U formatted resource.
- Returns
A Conll object equivalent to the string resource provided.
- Raises
ParseError – If there is an error parsing the input into a Conll object.
-
pyconll.load.
load_from_string
(source: str) → pyconll.unit.conll.Conll[source]¶ Load the CoNLL-U source in a string into a Conll object.
- Parameters
source – The CoNLL-U formatted string.
- Returns
A Conll object equivalent to the provided source.
- Raises
ParseError – If there is an error parsing the input into a Conll object.