Unexpected pitfall when building MySQL HA Servers

I built quite some HA setups for numerous pieces of software over the past years. But what happened this week I never saw … well … up to now.

Continue reading

Advertisements

One of the most awesome pieces of open-source software: DRBD

It’s not very often that I use terms like “great”, “awesome” or even “love” when it comes to computer things any more, but when it comes to DRBD, I can’t help it. It’s a truely great piece of software that I just love to work with. It helped me solve a lot of problems I had to deal with in my job over the last couple of years – here’s a tribute …

The first time I played with DRBD was in summer of 2007. I was trying to build my first open-source HA cluster and needed “something” that took care of the data-replication part. After I tried DRBD, there was nothing else to look at because it did exactly what I needed and did a very good job.

DRBD is a linux kernel module that provides a block device and replicates data between two nodes. On one node, the device is writable, on the other node, the device cannot be accessed – it’s an active/backup, master/slave, primary/secondary, call it what you like, kind of setup. Nowadays (actually, for a couple of years already, you can even go active/active, but I’ve never really used that, so I won’t cover it here).

Not only does it do a great job in safely replicating your data to another machine, it also allows for data verification (in case you don’t trust the data on the backup node) and is 100% oblivious to applications. So basically “anything” can be made highly available with DRBD (and a cluster manager). It can even save you an outage if your disk dies.

Working with blocks, it does not matter to DRBD what you do on top of it. Whether you want to encrypt the device, use it as a raw device for a database or just create a standard file system is not interesting for DRBD. It will just sit there and replicate whatever blocks change.

So you say it can protect me from disk failures, huh? There’s a configuration called “on-io-error” and it can be configured to “detach”. This means that in case of a local I/O error on the active machine, DRBD will just detach the backing block device from the DRBD device and write changes to the backup node, read operations will also silently be served by the backup node. The application using the DRBD device won’t even notice that the physical disk failed. Once you replaced the disk, you may just attach the DRBD resource to the new disk and your customers will not even notice you had an outage. Well, maybe things will be a little slower temporarily, but that’s still way better than an outage in the middle of the day.

So, thank you DRBD. Thank you Linbit. For this well documented awesome piece of free software that made me sleep at night instead of rushing to fix things, that enabled me to run things on standard hardware instead of having to buy overly expensive SAN things, that even helped me overcome laziness in creating backups of my digital photos.

Fencing shouldn’t be considered optional

Today I saw another fine example of why having a fencing mechanism in a cluster is a sane idea.

We were comparing an old heartbeat 2.1.3 setup with an (almost) up-to-date cluster stack formed by heartbeat 3.0.3, pacemaker 1.0.9 and DRBD 8.3.11.

The old setup used the heartbeat version 1 DRBD agent “drbddisk”, which assumes DRBD is started during the boot process and had that as the first item of a group with a filesystem resource, an IP address and a MySQL database. The group was tied to the pingd attribute which corresponds to the connection to the default gateway and that was it. Nothing fancy … and nothing fency either.

The goal was to re-create the service with the new cluster stack and perform the failover tests that once were done with the old setup. A lot of things worked, but in one situation the new cluster stack was apparently behaving worse than the old setup.

Consider node1 in primary mode, node2 in secondary mode. Now pkill -9 heartbeat on node1 to simulate a heartbeat crash. Node2 realizes that node1 is gone and tries to promote DRBD. This obviously does not work since, in a default setup, DRBD does not allow for two primaries at the same time. So we have a promotion failure. Pacemaker correctly initiates a recovery process consisting of a demote, a stop, a start and eventually a promote on the DRBD device.

Step one, demote, basically executes drbdadm secondary $res – which works. Step two, stop, runs drbdadm down $res – this also works. Start in step three is not much more than drbdadm up $res and finally, in step four, promote is basically drbdadm primary $res. I thought the last step would just not work again since, after “up”, the connection would be established and then promote would fail as it did in the first place.

But what actually happened was that pacemaker was so fast in executing “start” and “promote” that the promotion happened before the DRBD network connection was established. Therefore, drbdadm primary did not refuse to go to primary mode and pacemaker happily started the rest of the group. So then, the IP address was available on both machines at the same time, causing all kinds of ARP trouble and the corresponding MySQL database … well … was started and writable on both nodes. So it might just happen (depending on the ARP cache policy of the client) that one client writes to node1 while another client writes to node2, leaving you to sort things out manually.

Now consider the same situation with the old drbddisk agent. After node2 realized node1 was gone, it would also have tried to put the DRBD resource to the primary mode, but it would not have succeeded. Since its “stop” operation is just “drbdadm secondary” vs. “drbdadm down”, it would not have torn down the network connection and so DRBD would have refused to become primary. From a database admin point of view, this is the sane thing to do. It’s not the database that crashed but the cluster software. So why restart the database?

From a cluster point of view, the cluster could have recovered from that situation correctly, if the admin had provided the proper tools to do so. Since he did not (stonith disabled and no-quorum-policy set to ignore), the cluster leaves him to sort out the mess.

So what would have happened if stonith had been enabled in that situation? Node2 would have powered off node1 and then, only after that power off procedure succeeded, it would have promoted the DRBD device and started the database service. Then, once node1 had rebooted, it would have re-joined the cluster and all that would have happened would have been a restart of the database. No having-to-sort-out-the-mess at all.

In my experience, fencing is often considered an optional component of a cluster and people build clusters without a proper implementation of this mechanism. And then, in a tricky situation like this, they most likely fail to understand what actually went wrong. Maybe because understanding and admitting that oneself is what went wrong is a hard thing to do and explain to $BOSS. Just making the cluster look bad because blindly assuming it did a wrong thing is really easy in comparison.

Migrating VMs between DRBD backed clusters

DRBD is a kernel module that lets you mirror block devices over the network. Every bit you write is mirrored to a second node and your writing process only gets back to you once the second node finished writing¹. So at every point in time, you have an exact copy of whatever it is you’re writing to that block device.

One of many common use cases is to export DRBD devices to virtual machines and have them use those as hard drives. If you think about what I wrote in the first part of this blog, you’ll understand that if a VM uses a DRBD device as a harddisk, you can effectively run the VM on either of your two machines. So, say you have to shutdown the currently active machine for maintenance purposes, you can move the VM to the other node and your service does not have to be down with its hardware. The service the VM provides does not have to be interrupted².

Several years ago, I built such a platform to run a bunch of VMs (like 15 or something) and therey reduce 15 physical machines to 2. Time goes by, and now, that VM cluster hardware is to be replaced by more powerful hardware in order to be able to run more virtual machines.

First task is to migrate the currently running VMs to the new hardware. And here’s how I did that using DRBD. Let’s first paint a picture of what I’m talking about

So right now, all VMs are on Node1 and Node2, the DRBD replication takes place over the back to back connection with network 192.168.0.0/30. The goal is to move all VMs to Nodes 3 and 4 and replicate data over their back to back connection with network 192.168.1.0/30.

Steps to move one VM from Node1 to Node3:

  1. Node1:
    1. Disconnect the DRBD device:
      drbdadm disconnect foo
    2. Re-configure drbd.conf to replicate to Node3 instead of Node2 using the common network 10.0.0.0/8:
      resource foo {
              protocol C;
              device          /dev/drbdXX;
              disk            /dev/vg1/foo;
              meta-disk       internal;
              on Node1 {
      #                address 192.168.0.1:7788;
                      address 10.0.0.1:7788;
              }
      #        on Node2 {
              on Node3 {
      #                address 192.168.0.2:7788;
                      address 10.0.0.3:7788;
              }
      }
      
    3. Load this config:
      drbdadm adjust foo
    4. Connect this config:
      drbdadm connect foo
  2. Node3:
    1. Create backing device with the same specs as on Node1
    2. Create drbd.conf that uses this backing device and replicates from Node1 using the common network 10.0.0.0/8:
      resource foo {
              protocol C;
              device          /dev/drbdXX;
              disk            /dev/vg1/foo;
              meta-disk       internal;
              on Node1 {
                      address 10.0.0.1:7788;
              }
              on Node3 {
                      address 10.0.0.3:7788;
              }
      }
      
    3. Create metadata on this new DRBD device:
      drbdadm create-md foo
    4. Bring this device up:
      drbdadm up foo
  3. Watch the device sync:
    drbdadm status
  4. Node1: After initial sync, shutdown the VM and put device into secondary mode:
    drbdadm secondary foo
  5. Node3: Put device into primary mode:
    drbdadm primary foo

Once that’s done, copy your VMs configuration file to Node3, make adjustments as needed (maybe network bridge names changed) and try to boot up the VM. If you’re using the same hypervisor and have a sane configuration, this should just work. The VMs data is identical to what it was on Node1 before.

So now we need to re-configure Node3 to replicate data to Node4.

  1. Node3
    1. Disconnect the DRBD device:
      drbdadm disconnect foo
    2. Re-configure drbd.conf to replicate to Node4 instead of Node1 using the back to back network 192.168.1.0/30:
      resource foo {
              protocol C;
              device          /dev/drbdXX;
              disk            /dev/vg1/foo;
              meta-disk       internal;
              on Node3 {
                      address 192.168.1.1:7788;
              }
              on Node4 {
                      address 192.168.1.2:7788;
              }
      }
      
    3. Load the new config:
      drbdadm adjust foo
    4. Connect the new config:
      drbdadm connect foo
  2. Node4:
    1. Create backing device with the same specs as on Node3
    2. Create drbd.conf that uses this backing device and replicates from Node3 using the back to back network 192.168.1.0/30:
      resource foo {
              protocol C;
              device          /dev/drbdXX;
              disk            /dev/vg1/foo;
              meta-disk       internal;
              on Node3 {
                      address 192.168.1.1:7788;
              }
              on Node4 {
                      address 192.168.1.2:7788;
              }
      }
      
    3. Create metadata on this new DRBD device:
      drbdadm create-md foo
    4. Bring this device up:
      drbdadm up foo
  3. Watch the device sync:
    drbdadm status
  4. Node3: After initial sync, shutdown the VM and put the device into secondary mode:
    drbdadm secondary foo
  5. Node4: Put the device into primary mode:
    drbdadm primary foo

Now copy the VM configuration from Node3 and try to startup the VM on Node4. This, too, should just work.

I thought this was an impressively easy way to migrate things to a new cluster and once again, DRBD “just worked” for me.

Cheers

¹) but that’s only one way of using DRBD, have a look at their page if you don’t know DRBD yet
²) without any further setup you’d technically have to shutdown the VM on the active node and boot it up on the second node, which would give you the downtime of a reboot, but this can be optimized

OpenNMS UCE 2011

This Thursday and Friday I attended the Users Conference Europe 2011 of the OpenNMS-project. While the last two years the program consisted of talks that were handed in before the actual conference and then, as an attendee, you only had to choose which talks you wanted to listen to, this year, they had decided to hold a course covering the basics of the software on day one (they usually do this course in 4 days) and a barcamp style thing on day two.

To be honest, I didn’t expect much from day one since I’ve been using OpenNMS for over 4 years now … Once again, Tarus did a great job talking about OpenNMS and keeping everyone interested and awake by putting in an anecdote or a joke every now and then. And actually, I did learn quite a few things. So now I understand that an “RRA” is a round robin array and what the numbers of such configuration actually mean.

RRA:AVERAGE:0.5:1:2016
RRA:AVERAGE:0.5:12:148

So the first RRA would store 2016 entries, each holding the average value of 1 sample. The second one would store 148 entries, each holding the average value of 12 samples. The 0.5 represents that it needs at 6 of the 12 (0.5 or 50%) values in order to actually store a value.

Aaand … I finally understood what a – drumrolls – alarm is. I had never had a use for this since I apparently could do anything I wanted OpenNMS to with just using events and so I never really tried to understand the alarms-conecpt. Turns out, afaiu, alarms are just something a user sees in the WebUI and so I’m not too interested in that. Although … there was one thing that sounded interesting … namely being able to only keep the most recent event of a certain type around instead of storing all of them. I’ll need to dig into that I guess.

From day 2 I really did not know what to expect. I had never attended a barcamp before and what I read about it on the internet didn’t really give me a good idea of what was going to happen. Maybe I didn’t read carefully enough. So anyway, they had everybody get up and come to the front, introduce themselves and then briefly tell about what you’re interested in regarding OpenNMS and what you might offer to talk about.

So I figured since I did the HA talk last year and the slides were still on my laptop, I could offer to give that presentation if anyone’d be interested in that. After everybody talked about what they’d like to hear or wanted to talk about, everyone got to vote on the available topics and the top 9 voted talks would then be held in the 3 rooms available for the conference. Turns out, my topic was among the top 5 of the offered talks. So like 10 minutes later, I started giving that talk from last year.

I hadn’t looked at the presentation since then and it was like about 80 slides and I only had 90 minutes to give the talk so as soon as the projector was working I started rambling about Opensource HA clusters and went through the setup I had created about a year ago. Actually, I think it went rather smoothly considering I had not thought about this talk for about a year and so I’m quite satisfied with how this one turned out. Some guy (sorry I’m _really_ bad with names) even gave me some positive feedback on the talk which always feels good.

After lunch, there was a talk on provisioning for which I also volunteered to share my use case of provisioning from the source of a DNS servers. While David did most of the talk, I was able to slide this in and I think there was some interest in that.

This barcamp approach was a completely new concept of a conference to me but I can’t say I didn’t like it. While you could see that some people were not that comfortable talking in front of the entire group of 60 (!!!) people, I think I kind of got used to it over the last couple of years and I’m quite happy with that.

So let’s get some sleep and go bike riding tomorrow :)