Migrating content with alembic

6. May 2013. Tagged python, sql, sqlalchemy, alembic, development.

This post will be mainly about a way to migrate content with alembic. If you are searching for information on how to get started with alembic read the linked blog post about that.

You may have come to a point at some time where you had to update your data because of changes in the structure. An example might be that you only had a name column, but now need a finer grain of control and thus want to split up the name column in two columns.

With a standard alembic migration you won’t get far, but don’t worry, this is not impossible.

First I will give you an example to work with. Let’s use the example we had before (it might be an address book), which might include the following code:

1
2
3
4
5
6
7
8
9
10
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Contact(Base):
__tablename__ = 'contacts'
id = Column(Integer, primary_key=True)
name = Column(String(length=100))

So we create all this and set it up (for a detailed description of what I did when, you can check out the commit history of the github repository for this post). Now we just create some data to actually make the whole thing work, I used this code for it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sballmer = Contact('Steve Ballmer')
sjobs = Contact('Steve Jobs')
mzuckerberg = Contact('Mark Zuckerberg')

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
engine = create_engine(
    'mysql://alexander:@localhost/alembic_content_migration_example'
)
Session = sessionmaker(bind=engine)
session = Session()
session.add(sballmer)
session.add(sjobs)
session.add(mzuckerberg)
session.commit()

This is also what the function create_example_contacts in database.py does, just in case you want to use it.

So now we have some data and we realize that it actually would be better to have saved the contact names as first- and lastname. Well, damn! So what now?

So I adjust the model as follows:

1
2
3
4
5
class Contact(Base):
    __tablename__ = 'contacts'
    id = Column(Integer, primary_key=True)
    firstname = Column(String(length=30))
    lastname = Column(String(length=70))

If you just automatically generate a revision, this is what will happen:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
"""split up name in first- and lastname.

Revision ID: 3f05b8f881a3
Revises: 3821916c0277
Create Date: 2013-05-06 15:24:23.815282

"""

# revision identifiers, used by Alembic.
revision = '3f05b8f881a3'
down_revision = '3821916c0277'

from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import mysql

def upgrade():
    op.add_column('contacts', sa.Column('lastname', sa.String(length=70), nullable=True))
    op.add_column('contacts', sa.Column('firstname', sa.String(length=30), nullable=True))
    op.drop_column('contacts', u'name')


def downgrade():
    op.add_column('contacts', sa.Column(u'name', mysql.VARCHAR(length=100), nullable=True))
    op.drop_column('contacts', 'firstname')
    op.drop_column('contacts', 'lastname')

As you can see you would loose all names. The column name will be dropped and two new ones will be created without any data transfer. Which is certainly not what you want. So we will have some extra magic here. Read up in the comments of the following code what I did what and why.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
"""split up name in first- and lastname.

Revision ID: 3f05b8f881a3
Revises: 3821916c0277
Create Date: 2013-05-06 15:24:23.815282

"""

# revision identifiers, used by Alembic.
revision = '3f05b8f881a3'
down_revision = '3821916c0277'

from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import mysql

# now we build a helper table, that is kind of a hybrid. it has both,
# the name column and the first- and lastname attribute because it will
# be used in the migration, when both are available and necessary.
# in this case we define all attributes, because we also need the id
# to identify rows, if you had more columns you would only have to specify
# the relevant ones
contacthelper = sa.Table(
    'contacts',
    sa.MetaData(),
    sa.Column('id', sa.Integer, primary_key=True),
    sa.Column('name', sa.String(length=100)),
    sa.Column('firstname', sa.String(length=30)),
    sa.Column('lastname', sa.String(length=70)),
)


def upgrade():
    # we build a quick link for the current connection of alembic
    connection = op.get_bind()
    
    # we add the new columns first
    op.add_column(
        'contacts',
        sa.Column(
            'firstname',
            sa.String(length=30),
            nullable=True
        )
    )
    op.add_column(
        'contacts',
        sa.Column(
            'lastname',
            sa.String(length=70),
            nullable=True
        )
    )
    # at this state right now, the old column is not deleted and the
    # new columns are present already. So now is the time to run the
    # content migration. We use the connection to grab all data from
    # the table, split up name into first- and lastname and update the
    # row, which is identified by its id
    for contact in connection.execute(contacthelper.select()):
        firstname, lastname = contact.name.split(' ')
        connection.execute(
            contacthelper.update().where(
                contacthelper.c.id == contact.id
            ).values(
                firstname=firstname,
                lastname=lastname
            )
        )
    # now that all data is migrated we can just drop the old column
    # without having lost any data
    op.drop_column('contacts', u'name')


def downgrade():
    # we build a quick link for the current connection of alembic
    connection = op.get_bind()
    # for downgrading we do it exactly the other way around
    # we add the old column again
    op.add_column(
        'contacts',
        sa.Column(
            u'name',
            mysql.VARCHAR(length=100),
            nullable=True
        )
    )
    # select all data, join firstname and lastname together to name
    # and update the entry identified by it's id.
    for contact in connection.execute(contacthelper.select()):
        name = "%s %s" % (contact.firstname, contact.lastname)
        connection.execute(
            contacthelper.update().where(
                contacthelper.c.id == contact.id
            ).values(
                name=name
            )
        )
    # now we can drop the two new columns without having lost any data.
    op.drop_column('contacts', 'firstname')
    op.drop_column('contacts', 'lastname')

As you can see you can migrate contents this way, without having any data. I have to warn you though that this might not be very stable or reliable for huge data sets. You should have a backup of your data at any time. This might seem complicated, but it is the only way I could find so far to migrate content at all. Please let me know if you have any feedback, improvements or ideas.

Don’t forget to check out the github repository for the complete code