Migrating a VM from its primary to secondary node is usually a very simple task for ganeti. But with busy machines, it sometimes, apparently, does not work. Here’s what helped me solve this situation.
gnt-instance migrate $vm
it happens that the command seems to stall at step “starting memory transfer”
root@node1:# gnt-instance migrate -f vmname Tue Dec 16 14:30 2014 Migrating instance vmname Tue Dec 16 14:30 2014 * checking disk consistency between source and target Tue Dec 16 14:30 2014 * switching node node2 to secondary mode Tue Dec 16 14:30 2014 * changing into standalone mode Tue Dec 16 14:30 2014 * changing disks into dual-master mode Tue Dec 16 14:30 2014 * wait until resync is done Tue Dec 16 14:30 2014 * preparing node2 to accept the instance Tue Dec 16 14:30 2014 * migrating instance to node2 Tue Dec 16 14:30 2014 * starting memory transfer
and the long awaited
Tue Dec 16 14:37 2014 * memory transfer complete
just never arrives. Looking at the secondary node, you can see a new qemu process that grows in memory usage to whatever the vm is configured to have, but migration just never succeeds.
This is very likely due to a default setting regarding network bandwith used for memory transfer. By default, that uses a maximum of 32 MiB/s, so approximately 300Mbit/s. Look at the traffic graph my monitoring system recorded when I faced the problem:
So if the VM is busy and it changes more than 32 MB per second in its memory, then live migration is never going to succeed. Luckily, you can increase the bandwidth limit by running
gnt-cluster modify -H kvm:migration_bandwidth=$newlimit
Unit is MiB/s.
You can debug this while the migration is going on by running
echo "info migrate" | \ /usr/bin/socat STDIO UNIX-CONNECT:\ /var/run/ganeti/kvm-hypervisor/ctrl/$vmname.monitor
and you can adjust the parameter for just the current migration using
echo 'migrate_set_speed 1000m' | \ /usr/bin/socat STDIO UNIX-CONNECT:\ /var/run/ganeti/kvm-hypervisor/ctrl/$vmname.monitor
There is also a second tunable in this area. The acceptable downtime for the vm during the migration. The default value of 30 milliseconds might be too sophisticated for some use cases. It can be changed globally
gnt-cluster modify -H kvm:migration_downtime=60
or for individual VMs
gnt-instance modify -H migration_downtime=60 $vmname
It is also adjustable on the fly for currently running migrations by running
echo 'migrate_set_downtime 1000ms' | \ /usr/bin/socat STDIO UNIX-CONNECT:\ /var/run/ganeti/kvm-hypervisor/ctrl/$vmname.monitor