Hard Drive Failure !

100% uptime is impossible. All we can do is get close.

Last week 2 hard disks failed (simultaneously !) in node 2 of cluster 2. Besides being backed up in real time on a second cluster node, each node is running RAID-5 with hot swap drives. A single drive can fail and be replaced with no down time. But if a second drive fails, it’s fatal.

It wasn’t a clean failure and the performance of the fail-over system and failure reporting was less than perfect. Initial symptoms seemed to point at network card failure. The cluster software did fail over properly, but we had to clean up some databases. Some sites were not in good shape for 2 or 3 hours.

The next day, the remaining node was bombarding us with emails about the failed node. I had to shut everything down and power it back up outside the cluster. Total down time with this was probably 10 minutes. This was necessary because otherwise we could easily miss emails about failures of which we are not yet aware. The signal to noise ratio was way too high.

Monday we replaced all the hard drives in the failed node, re-installed the operating system and all the cluster software and began the process of manually syncing the drives from the node which was still in operation. Synchronization completed overnight last night, Tuesday night.

This morning at 5 AM I began the task of moving services back into the cluster. I will spare you the details, but it’s a nasty and error prone process. All the safeguards, checks and balances in the cluster software really get in the way while doing this. Sites were up and down several times. My guess at total down time today was something like 30 minutes.

Everything is completely back to normal now.

This was the first major real world test of the clustered live fail-over system we put in place 18 months ago. I’m not totally happy with it. Previous tests were done by pulling plugs – total failures. In that situation, performance was flawless. Down time was so short no one noticed. Real world failures are usually messy like this one was. The fail-over system worked, but it needed a little help. It was still a big win compared to re-installing a server and restoring backups. That could take a day or more.

There is a recurring pattern with problems like these. There is a period of a few days or a week during which problems come up and quickly or gradually get ironed out. These periods in retrospect feel like they are much longer than they really are because the worry and frustration when a server is down is intense. An hour is remembered as half a day. Related problems recurring a few times over several days is remembered as lasting a week or more. It’s human nature. Problem periods are followed by long periods, many months or a year during which everything runs smoothly.

If you look at our up time in longer periods it’s actually very good. It’s something over 99.99%. My perfectionist nature often makes me lose site of that. But nobody does any better so it’s worth a reminder.