Migration Guide: Version 3.x to 4.0

Version 4.0 introduces significant architectural improvements to pyconll. This guide helps you migrate from earlier versions to 4.0.

Overview of Changes

Version 4.0 brings major improvements:

  • Flexible schema system for custom tabular formats

  • Improved performance through compiled parsers/serializers

  • Simplified object model with standard Python collections

  • Better type safety with generics

Quick Migration Checklist

  1. Update imports from import pyconll to from pyconll.conllu import conllu. All methods that were previously exposed on pyconll can now be found on the conllu instance.

  2. Change return type annotations for load_from_* and iter_from_* from Conll to list[Sentence]

  3. Update token access from sentence[token_index] to sentence.tokens[token_index]

  4. Update metadata access from sentence.id to sentence.meta['sent_id']

  5. Update serialization from .conll() methods to WriteFormat methods

Detailed Migration Steps

1. Import Changes

Before:

import pyconll

After:

from pyconll.conllu import conllu

The module structure has changed to support multiple formats. For CoNLL-U, use the conllu module.

2. Loading Data

Before:

import pyconll

# Load into memory
corpus = pyconll.load_from_file('train.conllu')  # Returns Conll object

# Stream
for sentence in pyconll.iter_from_file('train.conllu'):
    pass

After:

from pyconll.conllu import conllu

# Load into memory
corpus = conllu.load_from_file('train.conllu')  # Returns list[Sentence]

# Stream
for sentence in conllu.iter_from_file('train.conllu'):
    pass

The Conll wrapper object is gone. Loading methods now return standard Python lists.

3. Iterating Over Corpus

Before:

corpus = pyconll.load_from_file('train.conllu')

for sentence in corpus:  # Conll implements MutableSequence
    for token in sentence:  # Sentence is iterable over tokens
        print(token.form)

After:

corpus = conllu.load_from_file('train.conllu')

for sentence in corpus:  # Standard list iteration
    for token in sentence.tokens:  # Access .tokens attribute
        print(token.form)

The main difference is accessing sentence.tokens instead of iterating directly over sentence.

4. Accessing Tokens by ID

Before:

for sentence in corpus:
    for token in sentence:
        if token.head != '0':
            head_token = sentence[token.head]  # Direct ID lookup
            print(f"{token.form} -> {head_token.form}")

After:

for sentence in corpus:
    # Build token index
    token_by_id = {t.id: t for t in sentence.tokens}

    for token in sentence.tokens:
        if token.head != '0':
            head_token = token_by_id.get(token.head)
            if head_token:
                print(f"{token.form} -> {head_token.form}")

Sentences no longer support indexing by token ID. Build your own index if needed.

5. Accessing Sentence Metadata

Before:

for sentence in corpus:
    print(sentence.id)    # Direct property access
    print(sentence.text)

    # Or via meta methods
    sent_id = sentence.meta_value('sent_id')

After:

for sentence in corpus:
    print(sentence.meta['sent_id'])  # Dictionary access
    print(sentence.meta['text'])

    # Add metadata
    sentence.meta['custom'] = 'value'

    # Singleton metadata
    sentence.meta['newpar'] = None

Metadata is now accessed as a standard OrderedDict.

6. Serialization (Writing Output)

Before:

# Serialize to string
conll_string = corpus.conll()
sentence_string = sentence.conll()
token_string = token.conll()

# Write to file
with open('output.conllu', 'w') as f:
    corpus.write(f)

After:

from pyconll.conllu import conllu

# Serialize individual items
sentence_string = conllu.serialize_sentence(sentence)
token_string = conllu.serialize_token(token)

# Write to file (recommended)
with open('output.conllu', 'w') as f:
    conllu.write_corpus(corpus, f)

Serialization is now handled by the Format instance rather than methods on objects. This is not a common scenario but is listed here for completeness.

Complete Migration Example

Before:

import pyconll

# Load
train = pyconll.load_from_file('./ud/train.conllu')

# Process
for sentence in train:
    # Access metadata
    print(f"Sentence: {sentence.id}")

    # Build tree
    tree = sentence.to_tree()

    # Process tokens
    for token in sentence:
        if token.upos == 'VERB':
            # Look up head
            if token.head != '0':
                head = sentence[token.head]
                print(f"{token.form} -> {head.form}")

# Write output
with open('output.conllu', 'w') as f:
    train.write(f)

After:

from pyconll.conllu import conllu

# Load
train = conllu.load_from_file('./ud/train.conllu')

# Process
for sentence in train:
    # Access metadata
    print(f"Sentence: {sentence.meta['sent_id']}")

    # Build tree
    tree = sentence.to_tree()

    # Build token index for lookups
    token_by_id = {t.id: t for t in sentence.tokens}

    # Process tokens
    for token in sentence.tokens:
        if token.upos == 'VERB':
            # Look up head
            if token.head != '0':
                head = token_by_id[token.head]
                if head:
                    print(f"{token.form} -> {head.form}")

# Write output
with open('output.conllu', 'w') as f:
    conllu.write_corpus(train, f)

Troubleshooting

Common Issues

AttributeError: ‘Sentence’ object has no attribute ‘id’

Use sentence.meta.get('sent_id') instead of sentence.id.

TypeError: ‘Sentence’ object is not subscriptable

Sentences no longer support sentence[token_id]. Build an index:

token_by_id = {t.id: t for t in sentence.tokens}
token = token_by_id.get(token_id)

AttributeError: ‘list’ object has no attribute ‘write’

The Conll object is gone. Use the Format instance:

with open('output.conllu', 'w') as f:
    conllu.write_corpus(sentences, f)

Getting Help

If you encounter issues during migration: