Half way through setting up continuous deployment with jenkins and ansible, I sporadically spotted “unknown errors” with ssh connections when ansible tried to run things on a series of hosts. Searching for “SSH encountered an unknown error during the connection” obviously revealed a LOT of results but nothing that really helped. So I had to dig deeper and eventually found the problem.
Migrating a VM from its primary to secondary node is usually a very simple task for ganeti. But with busy machines, it sometimes, apparently, does not work. Here’s what helped me solve this situation.
A friend of mine received photos from a photographer who seems to have a really limited email account. To “bypass” the attachment size limit, he used pkzip to create a zip file consisting of multiple smaller files. Since my friend doesn’t know much about computers (and did not have pkzip installed in his windows machine), he failed to extract the pictures and just in case anyone should ever again come across having to extract a multi-archive pkzip file, here’s what to do (in linux) …
It’s not very often that I use terms like “great”, “awesome” or even “love” when it comes to computer things any more, but when it comes to DRBD, I can’t help it. It’s a truely great piece of software that I just love to work with. It helped me solve a lot of problems I had to deal with in my job over the last couple of years – here’s a tribute …
The first time I played with DRBD was in summer of 2007. I was trying to build my first open-source HA cluster and needed “something” that took care of the data-replication part. After I tried DRBD, there was nothing else to look at because it did exactly what I needed and did a very good job.
DRBD is a linux kernel module that provides a block device and replicates data between two nodes. On one node, the device is writable, on the other node, the device cannot be accessed – it’s an active/backup, master/slave, primary/secondary, call it what you like, kind of setup. Nowadays (actually, for a couple of years already, you can even go active/active, but I’ve never really used that, so I won’t cover it here).
Not only does it do a great job in safely replicating your data to another machine, it also allows for data verification (in case you don’t trust the data on the backup node) and is 100% oblivious to applications. So basically “anything” can be made highly available with DRBD (and a cluster manager). It can even save you an outage if your disk dies.
Working with blocks, it does not matter to DRBD what you do on top of it. Whether you want to encrypt the device, use it as a raw device for a database or just create a standard file system is not interesting for DRBD. It will just sit there and replicate whatever blocks change.
So you say it can protect me from disk failures, huh? There’s a configuration called “on-io-error” and it can be configured to “detach”. This means that in case of a local I/O error on the active machine, DRBD will just detach the backing block device from the DRBD device and write changes to the backup node, read operations will also silently be served by the backup node. The application using the DRBD device won’t even notice that the physical disk failed. Once you replaced the disk, you may just attach the DRBD resource to the new disk and your customers will not even notice you had an outage. Well, maybe things will be a little slower temporarily, but that’s still way better than an outage in the middle of the day.
So, thank you DRBD. Thank you Linbit. For this well documented awesome piece of free software that made me sleep at night instead of rushing to fix things, that enabled me to run things on standard hardware instead of having to buy overly expensive SAN things, that even helped me overcome laziness in creating backups of my digital photos.
Well, you actually _can_ view the console and as long as you don’t have to type special characters, it _is_ somewhat useable. But try to type a pipe (for you german windows guys reading this: that’s the symbol appearing once you press “ALT-GR” and “>” at the same time, looks like this: “|”) for example. Quite a usual task when working on the command line … you might say. But it seems impossible to do that from a linux desktop. We tried a couple of things but couldn’t get it to work and since the task was somewhat urgent, we decided to try using a Windows XP vm we had in a lab environment.
At first, the remote desktop login to the vm didn’t work since we had a mixup in one byte of the node’s IP address. So apparently, someone in russia, who happened to own the IP we tried to connect to by accident, has a remote desktop session running but since everything was russian letters, we couldn’t read what was going on. At any rate, we could see the login didn’t work and once we verified the IP, we noticed the mistake. Unfortunately, at that time, we had already entered the lab password a couple of times … So this needs changing …
So then, once the connection to the actual windows system was established, we started a browser to connect to the DRACs https interface only to find out we could not type the URL. Every time we hit “r”, the system showed the “start”-“execute” dialog. Erm, yeah. After a reboot this was fixed, but it turned out the node was in a network segment that was not allowed to connect to the DRAC.
Luckily, someone had a windows laptop around from which we could then try to actually do the work. Obviously … there was an old java version installed which DRAC didn’t like so we had to upgrade java. Download 80+MB, extract, install … took about 15 minutes but finally, it was possible to start the DRAC remote console and actually type special characters.
We couldn’t help but think of a bug’s life at this moment.
Do not panic! We are trained professionals!
3 (apparently) IT professionals needed about one hour to connect to a remote console and execute a rather trivial task (for which the network had to be disconnected and therefore could only be executed via a remote console). Prost!
Today I saw another fine example of why having a fencing mechanism in a cluster is a sane idea.
The old setup used the heartbeat version 1 DRBD agent “drbddisk”, which assumes DRBD is started during the boot process and had that as the first item of a group with a filesystem resource, an IP address and a MySQL database. The group was tied to the pingd attribute which corresponds to the connection to the default gateway and that was it. Nothing fancy … and nothing fency either.
The goal was to re-create the service with the new cluster stack and perform the failover tests that once were done with the old setup. A lot of things worked, but in one situation the new cluster stack was apparently behaving worse than the old setup.
Consider node1 in primary mode, node2 in secondary mode. Now
pkill -9 heartbeat on node1 to simulate a heartbeat crash. Node2 realizes that node1 is gone and tries to promote DRBD. This obviously does not work since, in a default setup, DRBD does not allow for two primaries at the same time. So we have a promotion failure. Pacemaker correctly initiates a recovery process consisting of a demote, a stop, a start and eventually a promote on the DRBD device.
Step one, demote, basically executes
drbdadm secondary $res – which works. Step two, stop, runs
drbdadm down $res – this also works. Start in step three is not much more than
drbdadm up $res and finally, in step four, promote is basically
drbdadm primary $res. I thought the last step would just not work again since, after “up”, the connection would be established and then promote would fail as it did in the first place.
But what actually happened was that pacemaker was so fast in executing “start” and “promote” that the promotion happened before the DRBD network connection was established. Therefore, drbdadm primary did not refuse to go to primary mode and pacemaker happily started the rest of the group. So then, the IP address was available on both machines at the same time, causing all kinds of ARP trouble and the corresponding MySQL database … well … was started and writable on both nodes. So it might just happen (depending on the ARP cache policy of the client) that one client writes to node1 while another client writes to node2, leaving you to sort things out manually.
Now consider the same situation with the old drbddisk agent. After node2 realized node1 was gone, it would also have tried to put the DRBD resource to the primary mode, but it would not have succeeded. Since its “stop” operation is just “drbdadm secondary” vs. “drbdadm down”, it would not have torn down the network connection and so DRBD would have refused to become primary. From a database admin point of view, this is the sane thing to do. It’s not the database that crashed but the cluster software. So why restart the database?
From a cluster point of view, the cluster could have recovered from that situation correctly, if the admin had provided the proper tools to do so. Since he did not (stonith disabled and no-quorum-policy set to ignore), the cluster leaves him to sort out the mess.
So what would have happened if stonith had been enabled in that situation? Node2 would have powered off node1 and then, only after that power off procedure succeeded, it would have promoted the DRBD device and started the database service. Then, once node1 had rebooted, it would have re-joined the cluster and all that would have happened would have been a restart of the database. No having-to-sort-out-the-mess at all.
In my experience, fencing is often considered an optional component of a cluster and people build clusters without a proper implementation of this mechanism. And then, in a tricky situation like this, they most likely fail to understand what actually went wrong. Maybe because understanding and admitting that oneself is what went wrong is a hard thing to do and explain to $BOSS. Just making the cluster look bad because blindly assuming it did a wrong thing is really easy in comparison.