Live migrating busy (kvm) VMs in a ganeti cluster

Migrating a VM from its primary to secondary node is usually a very simple task for ganeti. But with busy machines, it sometimes, apparently, does not work. Here’s what helped me solve this situation.

When running

gnt-instance migrate $vm

it happens that the command seems to stall at step “starting memory transfer”

root@node1:# gnt-instance migrate -f vmname
Tue Dec 16 14:30 2014 Migrating instance vmname
Tue Dec 16 14:30 2014 * checking disk consistency between source and target
Tue Dec 16 14:30 2014 * switching node node2 to secondary mode
Tue Dec 16 14:30 2014 * changing into standalone mode
Tue Dec 16 14:30 2014 * changing disks into dual-master mode
Tue Dec 16 14:30 2014 * wait until resync is done
Tue Dec 16 14:30 2014 * preparing node2 to accept the instance
Tue Dec 16 14:30 2014 * migrating instance to node2
Tue Dec 16 14:30 2014 * starting memory transfer

and the long awaited

Tue Dec 16 14:37 2014 * memory transfer complete

just never arrives. Looking at the secondary node, you can see a new qemu process that grows in memory usage to whatever the vm is configured to have, but migration just never succeeds.

This is very likely due to a default setting regarding network bandwith used for memory transfer. By default, that uses a maximum of 32 MiB/s, so approximately 300Mbit/s. Look at the traffic graph my monitoring system recorded when I faced the problem:

migrationtraffic

So if the VM is busy and it changes more than 32 MB per second in its memory, then live migration is never going to succeed. Luckily, you can increase the bandwidth limit by running

gnt-cluster modify -H kvm:migration_bandwidth=$newlimit

Unit is MiB/s.

Update 1

You can debug this while the migration is going on by running

echo "info migrate" | \
/usr/bin/socat STDIO UNIX-CONNECT:\
/var/run/ganeti/kvm-hypervisor/ctrl/$vmname.monitor

and you can adjust the parameter for just the current migration using

echo 'migrate_set_speed 1000m' | \
/usr/bin/socat STDIO UNIX-CONNECT:\
/var/run/ganeti/kvm-hypervisor/ctrl/$vmname.monitor

Update 2

There is also a second tunable in this area. The acceptable downtime for the vm during the migration. The default value of 30 milliseconds might be too sophisticated for some use cases. It can be changed globally

gnt-cluster modify -H kvm:migration_downtime=60

or for individual VMs

gnt-instance modify -H migration_downtime=60 $vmname

It is also adjustable on the fly for currently running migrations by running

echo 'migrate_set_downtime 1000ms' | \
/usr/bin/socat STDIO UNIX-CONNECT:\
/var/run/ganeti/kvm-hypervisor/ctrl/$vmname.monitor
Advertisements

One response to “Live migrating busy (kvm) VMs in a ganeti cluster

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s