"Catastrophic IT Failure of a Lifetime"

 

I recently heard an interesting story from an IT manager at one of our largest financial industry clients. Since this firm manages ‘significant assets for significant clients’, it came as no surprise when he asked that his name remain anonymous. But I did want to share his experience in hopes we can keep others from having to live through a similar ’catastrophic IT meltdown’. Here’s a recap of our conversation…


The IT manager said he has very clear flashbacks to the day of the failure, and he refers to it as the ‘crash of a lifetime.’ It was a Saturday morning when the financial firm’s Dell EqualLogic array went down. It started out as just a single drive failure, but after repeated controller failovers and failbacks, that troubled array succeeded in corrupting ALL of the firm’s VMware LUNs. He lost his entire virtual environment.

The IT manager said they could have recovered if the array had just ‘done its business’ and died, because they were replicating everything to another storage array at their backup site. But unfortunately, replication kept going after the first array died. And that’s when things really started heading south.

 

Due to space limitations on the array, they only had a 2-hour window of snapshots at the datacenter that weren’t corrupt. By the time the IT department realized what had happened, it was way past that 2-hour window. The normal back-out plan for an array failure was to failover to the second site. But since replication kept going, corrupted data was being copied over!

 

Dell then shipped out a replacement array since they wanted the failed one back to troubleshoot. The IT manager spent the entire week recovering everything from their Symantec Backup Exec tape and disk backups onto the new array. Recapping the event in his words, “It was one of the most painful experiences in all of my years of working in IT.”

 

Two weeks later, EqualLogic Support still had no idea why the controller failures had caused volume corruption. Even after many years of good experience with the platform, it only took one really awful experience to lose confidence in the Dell solution. They even offered to give him new controllers for free – which he did not accept – stating, “If they couldn’t tell me what had happened, they certainly couldn't reassure me it wouldn't happen again.” It was a risk he was unwilling to take. That traumatic experience led to the urgency to get the Nimble gear installed quickly. We shipped out two Nimble arrays the next day, and they quickly moved all of their data off the EqualLogic arrays.

 

Obviously, you can never be too careful when it comes to your organization’s business-critical data. I give props to the IT manager for having a proactive mindset and putting data protection high on the list of data center priorities. In this case, the strategy was sound, but the underlying storage infrastructure clearly didn’t live up to the task. And, what’s worse—the customer’s existing storage equipment kept chugging along in spite of this critical failure.

 

Aside from being able to offer customers a robust and highly efficient means of data protection (via thin snapshots and replication), Nimble Storage gives its customers powerful, data sciences-based insight into overall storage health, via InfoSight, which includes actionable dashboards on the state of data protection, and customizable alerts on snapshot and replication events. And for critical situations, InfoSight can automatically identify and generate resolutions for some of storage’s trickiest issues, sparing customers from lengthy downtime and the nightmarish experience of trying to recover data when issues that threaten it would otherwise go undetected.