In the last 6 months we have twice had nearly simultaneous drive failures leading to service outages. It was hard at first to grasp how something so seemingly unlikely could have happened. When it happened a second time, it was time for some serious scrutiny. What seemed like common sense might not be correct.
We run servers in pairs. A hard disk on one server has a corresponding hard disk on another server. If one disk fails, the service can simply be powered up on the other server. Both servers have to be down to cause an outage.
The solution became obvious once the problem was understood.
Google released a study of their experience with a very large population of hard disks and failures. If you have a taste for a dry technical paper, you can find it here: Google media research What they found was revealing. The data set is based on consumer grade drives. We use enterprise grade drives which have a much longer life, but the general observations will be about the same. This summarizes failure times:
As expected, hard disks show a high infant mortality followed by a period of (in our case) several years of reliable service. Then suddenly failure rates increase. That there is a decline in failure rates at 4 years is unexpected, but a gradually increasing rate after that is just what you might expect. There is no data to support a rise and fall like that with enterprise drives. It may or may not happen.
When we think about reliability, what we want to know is the likelihood of a failure event in a given time interval. Then we can make statements (these are made up numbers) that the odds of a drive failure in a server over a months time are 1 in 300. Then if a drive in a second server which is being mirrored to is the same, the chance of both drives going down in the same month becomes 1 in 600. Since replacing a failed drive and re-mirroring takes 2 days, that would make a 1 in 9,000 chance of a failure before we could recover with no down time. That seems reasonable enough, but it turns out not to be correct. The problem is the failure rate distribution.
Many people are familiar with “the bell curve”, what in statistics is called the normal distribution. The graph looks like this:
If you tossed a coin 5,000 times and kept track of how many times in a row you got heads and tails and graphed it, that’s what it would look like. The left being heads counts and the right being the tails counts.
Hard disk manufacturers supply a statistic meant to show product life called the mean time between failure – MTBF. If the number was 5 years, the expectation is that most drives would last about that amount of time. What they report generally doesn’t relate to reality very well as the Google paper shows. Still, it’s a useful statistic. If the MTBF is 5 years and we charted a large population of disks, you would expect the chart to be a normal distribution with 5 years being the top of the curve. Lacking data, my guess at the standard deviation of a set of 5 year MTBF drives would be something like 3 to 6 months. Failure of a specific drive is random within a time frame so it’s reasonable to expect a failure curve to look something like a normal distribution. We are (were) working with 2 sets of hard drives all manufactured at the same time, all in exactly the same kind of server and in service for exactly the same amount of time. What that means is that the top of the curve is going to be much narrower and the sides much steeper. In statistical terms, the standard deviation will be a much smaller number.
So, the obvious solution? Add randomness. Add new drives, but not all new drives. The older drives have life in them yet. Besides being a waste of money it would lead to the same situation if we simply replaced all of them. What we have done is replace half of them. Each replication pair consists of an older drive and a newer one. When an older drive fails it will be replaced by another older drive until we run out of them. Introducing new drives will therefore be at relatively random intervals. This will move the top of that curve all over the place in terms of single drives. We may not see the odds against double failures as high as 1 in 9,000 but clearly it will be a huge improvement. It would be nice to have actual data for predictions. We don’t, so I will have to make a guess. Based on a lot of consideration, 1 in 1,000 seems reasonable. It’s also a number we can live with.
Jon Fuchs says
I think the key is three. Meaning three drives or storage points. At home, whenever I’m backing up important documents or client projects, I (try my damndest to remember to) save one copy on my desktop hard drive, one on my laptop, and one on an external hard drive. I’m not a math guy, but I assume the odds of catastrophic failure on all three during the same time frame “greatly” decreases.
If nothing else, it gives me a sense of security in knowing there are three copies.