Hello everyone, and welcome to Part 5 of the Nimble OS 2.1 blog series. Today we're going to cover RAID-3P, our new Triple Parity RAID implementation for all systems in the field.
Firstly I'd like to recommend a quick read of a recent blog post written by Ajay Singh (our VP of Product Management here at Nimble) who discusses the thought process behind Triple Parity RAID. It can be found here: Nimble Storage Blog | Reliability Without Compromise.
In the old world of Nimble, all systems installed in the field would run a typical RAID 6 + Hot Spare implementation on provided 7.2K NL-SAS drives. This was true for both the head shelf and expansion shelves, although each shelf would have it's own RAID + Hot Spare build rather than stretch the RAID across shelves as other vendors choose to do. See below a CS400 array on NOS 1.4.9.
RAID 6 + Hot Spare means in theory I could withstand a double-drive failure on a single shelf of storage in a Nimble implementation without any loss of data. However If I were to lose a third drive for any reason then my RAID set would be compromised and data loss is a high probability as I only have dual parity across the drives.
Another downside to RAID 6 are rebuild times; as an industry we are moving into uncharted territory of very high capacity drives (4/6TB drives, and beyond). The larger these drives are, the more data there is to rebuild in times of a failure. And typically the more data there is to rebuild, the longer the rebuild time of the RAID set. And the time of which the RAID set is most vulnerable to another drive failure is when a reconstruction is taking place!
Yet the biggest problem with larger drives is something called Bit Error Rate (BER for short). With 4TB+ drives, the probability of BER is increased due to more bits in a single drive - therefore we as an industry are introducing a higher probability of drive failure if we implement larger drives. A scary thought, indeed. In fact Robin Harris (aka StorageMojo) blogged in 2010 forecasting this exact problem, although perhaps he was being a bit too generous with the 2019 timeframe! Does RAID 6 stop working in 2019? — StorageMojo.
(It's worth noting that Nimble is quite different from a lot of other storage vendors in that if a RAID rebuild is required, we only rebuild the characteristics of the individual compressed blocks that are needed for the sectors on the failed drive. In contrast other vendors choose to rebuild the whole thing regardless if the blocks are in use, or fragmented/wasted space/white space etc. This is why our rebuilds actually don't take that long in the first place, and is where the comparison with Robin Harris' blog ends).
As a forward thinking technology vendor we absolutely want to remain competitive with offering the biggest and best drive capacities for our customers; but we are not willing to do that at the risk of increasing data loss, or impacting system performance for applications. A dilemma indeed!
Therefore in Nimble OS 2.1 we are moving to Triple Parity RAID (or RAID-3P as we are calling it) which allows for greater protection of your data in a drive failure scenario, yet has zero impact to performance or usable capacity to any systems in the field. This is implemented as part of the upgrade process to NOS 2.1 to any currently installed Nimble system, but does NOT require a RAID rebuild for it to be implemented.
As you can see in the screenshot above my CS400 is now running without a dedicated Hot Spare in my head shelf, as we are instead opting to use that drive for the third parity for the solution - meaning i've lost no usable capacity in my array. And thanks to CASL's offloading & coalescing of write IO away from drives it also means we are not hit by an additional RAID parity calculation that other implementations would need, meaning no degradation in performance!
So what does this mean? In all previous versions of NOS we used to gracefully pause data services should we run into a double-drive failure to preserve your data on the array from a potentially catastrophic event (although the array will still be online - important fact when reading competitive rubbish from other vendors!). In NOS 2.1 onwards we can now sustain three simultaneous drive failures on a single shelf before we have to take any corrective action.
As an example of this, I removed two drives from the CS400, and observed that the array was 100% online, still serving data, and could even withstand another drive failure in the same shelf before the array would have to take any corrective action to preserve the data. Cool huh?
I hope you found this blog post useful. Next up Rich Fenton is back discussing the ability to use other SSL certificates for the Nimble Web UI!