Dimitrios Krekoukias

Lowering Risk Through Automated Headroom Management

Blog Post created by Dimitrios Krekoukias Employee on Feb 21, 2017

(Cross-posted at my Recovery Monkey blog, the version here is slightly more Nimble-specific)

Before we begin: This is a vendor-neutral post. I realize there may be no architecture that can do everything I’m proposing, but some may come closer to what you need than others. Whether you’re a vendor or a customer, see it as stuff you should be doing or be asking for respectively…

Headroom!

Headroom is a term that applies to almost all technologies, and it’s crucially important for all of them. Some examples:

  • Photography
  • Cars
  • Bridges
  • Storage arrays…

Why is Sufficient Headroom Important?

Maintaining sufficient headroom in any solution is a way to ensure safety and predictability of operation under most conditions (especially under unfavorable ones).


For instance, if the maximum load for an evenly loaded bridge before it collapses is X, the overall recommended load will be a fraction of that. But even the weight/length/axle count of a single truck on a bridge will also be subject to certain strict limits in order to avoid excessive localized stress on the structure.

Headroom in Storage Arrays

Apologies to the seasoned storage pros for all the foundational material but it’s crucial to take this step by step.


It is important to note that headroom in arrays is not necessarily as simple as how busy the CPU is. Headroom is a multi-dimensional concept.


More factors than just CPU come into play, including how busy the underlying storage media are, how saturated various buses are, and how much of the CPU is spent on true workload vs opportunistic tasks (that could be deferred). Not to mention that in some systems, certain tasks are single-threaded and could pose an overall headroom bottleneck by maxing out a single CPU core, while the rest of the CPU is not busy at all (check Amdahl's Law).


Maintaining sufficient headroom in storage arrays is necessary in order to provide acceptable latency, especially in the event of high load during a controller failover. Depending on the underlying architecture of an array, different headroom approaches and calculations are necessary.Some examples of different architectures:


  • Active-Active controllers, per-controller pool
  • Active-Standby, single pool
  • Active-Active, single pool
  • Grid, single pool
  • Permutations thereof (it’s beyond the scope of this article to explore all possible options)


The single vs multiple pool question complicates things a bit, plus things like disk ownership are also hugely important. This isn’t an argument about which architecture is better (it depends anyway), but rather about headroom management in different architectures.

Dual-Controller Headroom

Dual-controller architectures need to be extremely careful with headroom management. After all, there are only two controllers in play. Here’s what sufficient headroom looks like in a dual-controller system:HeadroomHA2

There are not many things to be done to keep things healthy in a dual-controller architecture. In an Active-Standby system, the Standby controller is ready to immediately take over (this is the technique Nimble Storage uses). There is no danger in loading up the Active controller, aside from expected load-related latency (which, in a well-designed system, should also be managed automatically).


In an Active-Active HA system, maintaining a healthy amount of headroom has to be managed so that there is, overall, an entire controller’s worth of free headroom available.

Headroom in a Cluster of HA Pairs Architecture

There are several implementations that make use of a multiple HA Pair architecture. Often, the multiple HA pairs present a virtual pool to the outside world, even if, internally, there are multiple private pools. Some implementations just keep it to pools owned by each controller.


Here’s an example of healthy headroom in such a system:HeadroomMultiHA2

Even though there are multiple controllers (at least 4 total), in order to maintain an overall healthy system, a total of 100% headroom needs to be maintained in each HA pair, otherwise the performance of an underlying private pool (in green) might suffer, making the overall virtual pool performance (light blue) unpredictable.


In a Nimble Scale-out deployment, each HA engine would simply consist of a 100% headroom hot standby node, making this a very safe way to do scale-out:


Headroom_ScaleOutNimble.png

Headroom in a True Grid Architecture

Grid (also known as Shard) Architectures spread overall load among multiple nodes (often plain servers with some disks inside and connected via a network).


In such a scheme, the minimum overall headroom that needs to be maintained per node as a percentage is 100/N, where N is the number of nodes in the storage cluster.


So, in a 4-node cluster, 100/4=25% headroom per node needs to be maintained. This doesn’t account for the significant work that rebalancing after a node failure takes in such architectures, nor the capacity headroom needed, but it’s roughly accurate enough for our purposes.


Schematically:

HeadroomGrid2

How Headroom is Managed is Crucial

In order to manage headroom, four things need to be able to happen first:

  1. Be able to calculate headroom
  2. Be able to throttle workloads
  3. Be able to prioritize between types of workload
  4. Be able to move workloads around (architecture-dependent).

The only architecture that inherently makes this a bit easier is Active-Standby since there is always a controller waiting to take over if anything bad happens, and nothing can make that controller busy. But even with a single active controller, headroom needs to be managed in order to avoid bad latency conditions during normal operation (see here for an example approach). Remember, headroom is a multi-dimensional thing.

Example Problem Case: Imbalanced & Overloaded Controllers

Consider the following scenario: An Active-Active system has both controllers overloaded, and one of them is really busy:

Headroom Imbalanced2

Clearly, there are a few problems with this picture:

  1. It may be impossible to fail over in the event of a controller failure (total load is 165% of a single controller’s headroom)
  2. The first controller may already be experiencing latency issues
  3. Why was the system allowed to even get to this point in the first place?

This is a commonplace occurrence, unfortunately.

Automation is Key in Managing Headroom

The biggest problem in our example is actually the last point: Why was the system allowed to get to that state to begin with?


Even if a system is able to calculate headroom, throttle workloads and move workloads around, if nothing is done automatically to prevent problems, it’s extremely easy for users to get into the problem situation depicted above. I’ve seen it affect critical production systems far too many times for comfort.

Manual QoS is Not The Best Answer

Being able to manually throttle workloads can obviously help in such a situation. The problems with the manual QoS approach are outlined in another article, but, in summary, most users simply have absolutely no idea what the actual limits should be (nor should they be expected to). Most importantly, placing QoS limits up front doesn’t result in balanced controllers in multi-node systems… and may even result in other kinds of performance problems.


Of course, using QoS limits reactively is not going to prevent the problems from occurring in the first place. Some companies offer Data Classification as a Professional Services engagement, in order to try and figure out an IOPS/TB/Application metric. Even if that is done, it, again, doesn’t result in balanced controllers… it’s also not very useful in dynamic environments. It’s more used as a guideline for setting up manual QoS.

Automation Mechanisms to Consider for Managing Headroom

Clearly, pervasive automation is needed in order to keep headroom at safe levels. I will split up the proposed mechanisms per architecture. There is some common functionality needed regardless of architecture:

Common Automation Needed

Every architecture needs to have the ability to automatically achieve the following, without user intervention at any point:


  1. Conserve headroom per controller
  2. Differentiate between different kinds of user workloads
  3. Differentiate between different kinds of system workloads
  4. Automatically prioritize between different workloads, especially under pressure
  5. Automatically throttle different kinds of workloads, especially under pressure

And now for the extra automation needed per architecture:

Active-Standby Automation

If in a single HA pair, nothing else is needed. If in a scale-out cluster of Active-Standby pairs:


  1. Automatically balance capacity and headroom utilization between HA pairs even if they’re different types
  2. Be able to auto-migrate workloads to other cluster nodes (if using multiple pools instead of one)

Active-Active Automation

  1. Automatically conserve one node’s worth of headroom across the HA pair (50/50, 60/40, 70/30 – all are OK)
  2. When provisioning new workloads, auto-balance them by performance and capacity across the nodes
  3. Be able to balance by auto-migrating workloads to the other node (if using multiple pools instead of one)

Active-Active with Multiple HA Pairs Automation

  1. Automatically conserve one node’s worth of headroom per HA pair
  2. Be able to auto-migrate workloads to any other node
  3. Automatically balance workloads and capacity utilization in the underlying per-HA pools

Grid Automation

  1. Automatically conserve at least one node’s worth of headroom across the grid
  2. Automatically conserve enough capacity to be able to lose one node, rebalance, and have enough capacity left to lose another one (the more cautious may want the capability to lose 2-3 nodes simultaneously)
  3. Automatically take into account grid size and rebalancing effort in order to conserve the right amount of headroom

In Closing…

If you’re a consumer of storage systems, always remember to be running your storage with sufficient headroom to be able to sustain a major failure without overly affecting your performance.

In addition, when looking to refresh your storage system, always ask the vendors about how they automate their headroom management.

Finally, if any vendor is quoting you performance numbers, always ask them how much headroom is left on the array at that performance level… (in addition to the extra questions about read/write percentages and latencies you should be asking already).

The answer may surprise you.

D

Outcomes