tree
Tree is a simple, generic tree data structure for representing hierarchical relationships between tokens (such as dependency trees). A Tree can have multiple children and one parent.
Overview
The tree module provides:
Tree[T]- A generic tree node containing data of type Tfrom_tokens()- A function to build trees from sequences of tokens
Structure
A Tree has the following key components:
data: T- The data stored at this node (e.g., a Token)parent: Optional[Tree[T]]- The parent node (None for root)__getitem__(i)- Access children by index__iter__()- Iterate over children__len__()- Number of children
Creating Trees
Generic Tree Creation
Use tree.from_tokens() to create trees from any sequence of tokens:
from pyconll.tree import from_tokens
tree = from_tokens(
tokens=my_tokens,
starting_id='0', # Root parent ID
to_id=lambda t: t.id, # Extract token ID
to_head=lambda t: t.head, # Extract parent ID
skip=lambda t: '-' in t.id # Skip multiword tokens
)
CoNLL-U Tree Creation
For the CoNLL-U model, Sentences have a to_tree method which can be used directly.
from pyconll.conllu import conllu
sentences = conllu.load_from_file('train.conllu')
for sentence in sentences:
tree = sentence.to_tree()
# Tree root is the token with head="0"
root_token = tree.data
print(f"Root: {root_token.form}")
# Iterate over dependents
for child_tree in tree:
child_token = child_tree.data
print(f" Dependent: {child_token.form}")
Traversing Trees
from pyconll.conllu import conllu
sentences = conllu.load_from_file('train.conllu')
tree = sentences[0].to_tree()
# Access root data
root = tree.data
print(f"Root word: {root.form}, POS: {root.upos}")
# Iterate over direct children
for child_tree in tree:
child = child_tree.data
print(f"Dependent: {child.form} ({child.deprel})")
# Recursively process subtree
for grandchild_tree in child_tree:
grandchild = grandchild_tree.data
print(f" Grandchild: {grandchild.form}")
# Access children by index
if len(tree) > 0:
first_child = tree[0]
print(f"First dependent: {first_child.data.form}")
Example: Finding Non-Projective Dependencies
from pyconll.conllu import conllu
def has_nonprojective(tree, start=None, end=None):
"""Check if tree has non-projective dependencies."""
if start is None:
# Get token IDs for span calculation
token_ids = set()
collect_ids(tree, token_ids)
start = min(int(id) for id in token_ids if id.isdigit())
end = max(int(id) for id in token_ids if id.isdigit())
for child in tree:
child_id = int(child.data.id) if child.data.id.isdigit() else 0
if child_id < start or child_id > end:
return True
if has_nonprojective(child, start, end):
return True
return False
sentences = conllu.load_from_file('train.conllu')
for sentence in sentences:
tree = sentence.to_tree()
if has_nonprojective(tree):
print(f"Non-projective: {sentence.meta['sent_id']}")
API
A general immutable tree module. This module is used when parsing a serial sentence into a Tree structure.
- class pyconll.tree.Tree(data: T)[source]
A tree node. This is the base representation for a tree, which can have many children which are accessible via child index. The tree’s structure is immutable, so the data, parent, children cannot be changed once created.
As is this class is useless, and must be created with the TreeBuilder module which is a sort of friend class of Tree to maintain its immutable public contract.
- __getitem__(key: int) Tree[T][source]
- __getitem__(key: slice) list['Tree[T]']
Get specific children from the Tree. This can be an integer or slice.
- Parameters:
key – The indexer for the item.
- __init__(data: T) None[source]
Create a tree holding the value. Create a larger Tree, with TreeBuilder.
- Parameters:
data – The data to put with the Tree node.
- __len__() int[source]
Provides the number of direct children on the tree.
- Returns:
The number of direct children on the tree.
- property data: T
The data on the tree node. The property ensures it is readonly.
- Returns:
The data stored on the Tree.
- pyconll.tree.from_tokens(tokens: Sequence, root_id: I, to_id: Callable[[K], I], to_head: Callable[[K], I], skip: Callable[[K], bool] | None = None) Tree[source]
The completely generic function to create a Tree structure for a sequence of Tokens.
This can be used for tokens other than the pre-defined CoNLL-U schema.
- Parameters:
tokens – The tokens to create the tree from.
root_id – The root token of the tree will be a child of this id.
to_id – The mapper from the token to its id.
to_head – The mapper from the token to the id of its parent.
skip – The optional guard to skip certain tokens that may not participate in the Tree structure.