Today I saw another fine example of why having a fencing mechanism in a cluster is a sane idea.
The old setup used the heartbeat version 1 DRBD agent “drbddisk”, which assumes DRBD is started during the boot process and had that as the first item of a group with a filesystem resource, an IP address and a MySQL database. The group was tied to the pingd attribute which corresponds to the connection to the default gateway and that was it. Nothing fancy … and nothing fency either.
The goal was to re-create the service with the new cluster stack and perform the failover tests that once were done with the old setup. A lot of things worked, but in one situation the new cluster stack was apparently behaving worse than the old setup.
Consider node1 in primary mode, node2 in secondary mode. Now
pkill -9 heartbeat on node1 to simulate a heartbeat crash. Node2 realizes that node1 is gone and tries to promote DRBD. This obviously does not work since, in a default setup, DRBD does not allow for two primaries at the same time. So we have a promotion failure. Pacemaker correctly initiates a recovery process consisting of a demote, a stop, a start and eventually a promote on the DRBD device.
Step one, demote, basically executes
drbdadm secondary $res – which works. Step two, stop, runs
drbdadm down $res – this also works. Start in step three is not much more than
drbdadm up $res and finally, in step four, promote is basically
drbdadm primary $res. I thought the last step would just not work again since, after “up”, the connection would be established and then promote would fail as it did in the first place.
But what actually happened was that pacemaker was so fast in executing “start” and “promote” that the promotion happened before the DRBD network connection was established. Therefore, drbdadm primary did not refuse to go to primary mode and pacemaker happily started the rest of the group. So then, the IP address was available on both machines at the same time, causing all kinds of ARP trouble and the corresponding MySQL database … well … was started and writable on both nodes. So it might just happen (depending on the ARP cache policy of the client) that one client writes to node1 while another client writes to node2, leaving you to sort things out manually.
Now consider the same situation with the old drbddisk agent. After node2 realized node1 was gone, it would also have tried to put the DRBD resource to the primary mode, but it would not have succeeded. Since its “stop” operation is just “drbdadm secondary” vs. “drbdadm down”, it would not have torn down the network connection and so DRBD would have refused to become primary. From a database admin point of view, this is the sane thing to do. It’s not the database that crashed but the cluster software. So why restart the database?
From a cluster point of view, the cluster could have recovered from that situation correctly, if the admin had provided the proper tools to do so. Since he did not (stonith disabled and no-quorum-policy set to ignore), the cluster leaves him to sort out the mess.
So what would have happened if stonith had been enabled in that situation? Node2 would have powered off node1 and then, only after that power off procedure succeeded, it would have promoted the DRBD device and started the database service. Then, once node1 had rebooted, it would have re-joined the cluster and all that would have happened would have been a restart of the database. No having-to-sort-out-the-mess at all.
In my experience, fencing is often considered an optional component of a cluster and people build clusters without a proper implementation of this mechanism. And then, in a tricky situation like this, they most likely fail to understand what actually went wrong. Maybe because understanding and admitting that oneself is what went wrong is a hard thing to do and explain to $BOSS. Just making the cluster look bad because blindly assuming it did a wrong thing is really easy in comparison.