Migration Guide: Version 3.x to 4.0
Version 4.0 introduces significant architectural improvements to pyconll. This guide helps you migrate from earlier versions to 4.0.
Overview of Changes
Version 4.0 brings major improvements:
Flexible schema system for custom tabular formats
Improved performance through compiled parsers/serializers
Simplified object model with standard Python collections
Better type safety with generics
Quick Migration Checklist
Update imports from
import pyconlltofrom pyconll.conllu import conllu. All methods that were previously exposed onpyconllcan now be found on theconlluinstance.Change return type annotations for
load_from_*anditer_from_*fromConlltolist[Sentence]Update token access from
sentence[token_index]tosentence.tokens[token_index]Update metadata access from
sentence.idtosentence.meta['sent_id']Update serialization from
.conll()methods toWriteFormatmethods
Detailed Migration Steps
1. Import Changes
Before:
import pyconll
After:
from pyconll.conllu import conllu
The module structure has changed to support multiple formats. For CoNLL-U, use the conllu module.
2. Loading Data
Before:
import pyconll
# Load into memory
corpus = pyconll.load_from_file('train.conllu') # Returns Conll object
# Stream
for sentence in pyconll.iter_from_file('train.conllu'):
pass
After:
from pyconll.conllu import conllu
# Load into memory
corpus = conllu.load_from_file('train.conllu') # Returns list[Sentence]
# Stream
for sentence in conllu.iter_from_file('train.conllu'):
pass
The Conll wrapper object is gone. Loading methods now return standard Python lists.
3. Iterating Over Corpus
Before:
corpus = pyconll.load_from_file('train.conllu')
for sentence in corpus: # Conll implements MutableSequence
for token in sentence: # Sentence is iterable over tokens
print(token.form)
After:
corpus = conllu.load_from_file('train.conllu')
for sentence in corpus: # Standard list iteration
for token in sentence.tokens: # Access .tokens attribute
print(token.form)
The main difference is accessing sentence.tokens instead of iterating directly over sentence.
4. Accessing Tokens by ID
Before:
for sentence in corpus:
for token in sentence:
if token.head != '0':
head_token = sentence[token.head] # Direct ID lookup
print(f"{token.form} -> {head_token.form}")
After:
for sentence in corpus:
# Build token index
token_by_id = {t.id: t for t in sentence.tokens}
for token in sentence.tokens:
if token.head != '0':
head_token = token_by_id.get(token.head)
if head_token:
print(f"{token.form} -> {head_token.form}")
Sentences no longer support indexing by token ID. Build your own index if needed.
5. Accessing Sentence Metadata
Before:
for sentence in corpus:
print(sentence.id) # Direct property access
print(sentence.text)
# Or via meta methods
sent_id = sentence.meta_value('sent_id')
After:
for sentence in corpus:
print(sentence.meta['sent_id']) # Dictionary access
print(sentence.meta['text'])
# Add metadata
sentence.meta['custom'] = 'value'
# Singleton metadata
sentence.meta['newpar'] = None
Metadata is now accessed as a standard OrderedDict.
6. Serialization (Writing Output)
Before:
# Serialize to string
conll_string = corpus.conll()
sentence_string = sentence.conll()
token_string = token.conll()
# Write to file
with open('output.conllu', 'w') as f:
corpus.write(f)
After:
from pyconll.conllu import conllu
# Serialize individual items
sentence_string = conllu.serialize_sentence(sentence)
token_string = conllu.serialize_token(token)
# Write to file (recommended)
with open('output.conllu', 'w') as f:
conllu.write_corpus(corpus, f)
Serialization is now handled by the Format instance rather than methods on objects. This is not a common scenario but is listed here for completeness.
Complete Migration Example
Before:
import pyconll
# Load
train = pyconll.load_from_file('./ud/train.conllu')
# Process
for sentence in train:
# Access metadata
print(f"Sentence: {sentence.id}")
# Build tree
tree = sentence.to_tree()
# Process tokens
for token in sentence:
if token.upos == 'VERB':
# Look up head
if token.head != '0':
head = sentence[token.head]
print(f"{token.form} -> {head.form}")
# Write output
with open('output.conllu', 'w') as f:
train.write(f)
After:
from pyconll.conllu import conllu
# Load
train = conllu.load_from_file('./ud/train.conllu')
# Process
for sentence in train:
# Access metadata
print(f"Sentence: {sentence.meta['sent_id']}")
# Build tree
tree = sentence.to_tree()
# Build token index for lookups
token_by_id = {t.id: t for t in sentence.tokens}
# Process tokens
for token in sentence.tokens:
if token.upos == 'VERB':
# Look up head
if token.head != '0':
head = token_by_id[token.head]
if head:
print(f"{token.form} -> {head.form}")
# Write output
with open('output.conllu', 'w') as f:
conllu.write_corpus(train, f)
Troubleshooting
Common Issues
AttributeError: ‘Sentence’ object has no attribute ‘id’
Use sentence.meta.get('sent_id') instead of sentence.id.
TypeError: ‘Sentence’ object is not subscriptable
Sentences no longer support sentence[token_id]. Build an index:
token_by_id = {t.id: t for t in sentence.tokens}
token = token_by_id.get(token_id)
AttributeError: ‘list’ object has no attribute ‘write’
The Conll object is gone. Use the Format instance:
with open('output.conllu', 'w') as f:
conllu.write_corpus(sentences, f)
Getting Help
If you encounter issues during migration:
Check the updated documentation
Review the examples in the repository
Ask questions on GitHub