(This post first appeared on Recovery Monkey and is being cross-posted here)
This topic is very near and dear to me, and is one of the big reasons I came over to Nimble Storage.
I’ve always believed that storage systems should behave gracefully and predictably under pressure. Automatically. Even under complex and difficult situations.
It sounds like a simple request and it makes a whole lot of sense, but very few storage systems out there actually behave this way. This creates business challenges and increases risk and OpEx.
The simplest way to state the problem is that most storage systems can enter conditions where workloads can suffer from unfair and abrupt performance starvation under several circumstances.
OK, maybe that wasn’t the simplest way.
Consider the following scenarios:
- A huge sequential I/O job (backup, analytics, data loads etc.) happening in the middle of latency-sensitive transaction processing
- Heavy array-generated workloads (garbage collection, post-process dedupe, replication, big snapshot deletions etc.) happening at the same time as user I/O
- Failed drives
- Controller failover (due to an actual problem or simply a software update)
#3 and #4 are more obvious - a well-behaved system will ensure high performance even during a drive failure (or three), and even after a controller fails over. For instance, if total system headroom is automatically kept at 50% for a dual-controller system (or, simplistically, 100/n, where n is the controller count for shared-everything architectures), even after a controller fails, performance should be fine.
#1 and #2 are a bit more complicated to deal with. Let’s look at this in more detail.
The Case of Competing Workloads During Hard Times
Inside every array, at any given moment, a balancing act occurs. Multiple things need to happen simultaneously.
Several user-generated workloads, for instance:
- File Services
Various internal array processes - they also are workloads, just array-generated, and often critical:
- Data reduction (dedupe, compression)
- Cleanup (object deletion, garbage collection)
- Data protection (integrity-related)
- Backups (snaps, replication)
If the system has enough headroom, all these things will happen without performance problems.
If the system runs out of headroom, that’s where most arrays have challenges with prioritizing what happens when.
The most common way a system may run out of headroom is the sudden appearance of a hostile “bully” workload. This is also called a “noisy neighbor”. Here’s an example of system behavior in the presence of a bully workload:
In this example, the latency-sensitive workload will greatly and unfairly suffer after the "noisy neighbor" suddenly appears. If the latency-sensitive workload is a mission-critical application, this could cause a serious business problem (slow processing of financial transactions, for instance).
This is an extremely common scenario. A lot of the time it’s not even a new workload. Often, an existing workload changes behavior (possibly due to an application change - for instance a patch or a modified SQL query, or a big garbage collection job on the array). This stuff happens.
How some vendors have tried to fix the issue with Manual “QoS"
As always, there is more than one way to skin a cat, if one is so inclined. Here are a couple of manual methods to fix workload contention:
- Some arrays have a simple IOPS or throughput limit that an administrator can manually adjust in order to fix a performance problem. This is an iterative and reactive method and hard to automate properly in real time. In addition, if the issue was caused by an internal array-generated workload, there is often no tooling available to throttle those processes.
- Other arrays insist on the user setting up minimum, maximum and burst IOPS values for every single volume in the system, upon volume creation. This assumes the user knows in advance what performance envelope is required, in detail, per volume. The reality is that almost nobody knows these things beforehand, and getting the numbers wrong can itself cause a huge problem with latencies. Most people just want to get on with their lives and have their stuff work without babysitting. This type of system also usually offers no tooling to throttle internal processes.
Manual mechanisms for fixing the "bully" workload challenge result in systems that are hard to consume and complex to support while under performance pressure. Moreover, when a performance issue occurs, speed of resolution is critical. The issue needs to be resolved immediately, especially for latency-sensitive workloads. Manual methods will simply not be fast enough. Business will be impacted.
How Nimble Storage Fixed the Noisy Neighbor Issue
No cats were harmed in the process. Nimble engineers looked at the extensive telemetry in InfoSight, used data science, and neatly identified areas that could be massively automated in order to optimize system behavior under a wide variety of adverse conditions. Some of what was done:
- Highly advanced Fair Share disk scheduling (separate mechanisms that deal with different scenarios)
- Fair Share CPU scheduling
- Dynamic Weight Adjustment - automatically adjust priorities in various ways under different resource contention conditions, so that the system can always complete critical tasks and not fall dangerously behind. This is much better versus always equal share between all processes since it helps automatically deal with several situations fair sharing could not deal with.
The end result is a system that:
- Lets system latency increase gracefully and progressively as load increases
- Carefully and automatically balances user and system workloads depending on actual conditions
- Achieves I/O deadlines and preemption behavior
- Eliminates the Noisy Neighbor problem without the need for any manual QoS adjustments
- Allows latency-sensitive small-block I/O to proceed without interference from bully workloads
What Should Nimble Customers do to get this Capability?
As is typical with Nimble systems and their impressive Ease of Consumption, nothing fancy needs to be done apart from simply upgrading to a specific release of the code (in this case 3.1 and up - 2.3 did some of the magic but 3.1 is the fully realized vision).
A bit anticlimactic, apologies… if you like complexity, watching this instead is probably more fun than juggling QoS manually.