In the latest instalment of the NimbleOS 4 - A detailed exploration blog series, I will be introducing you to Quality of Service feature.
Quite often, you will see a NimbleOS feature that has developed through several releases. Quality of Service (QoS) is precisely one such feature!
Every storage array has shared resources, governing access to those shared resources is critical to ensure no single workload is able to consume or 'hog' those resources to the detriment of all other applications. If we consider a Nimble controller the following are shared resources:
- Backend throughput (sequential access to the media - Flash or Disk)
- Cache (applicable only in an Adaptive array)
For the last several NimbleOS releases, our Engineering team has been implementing fencing algorithms that stop any one resource from being consumed by one workload. For example, in 2.2 we introduced CPU fair scheduling, in 2.3 we introduced Volume Pinning (Adaptive-only) and Disk (Bandwidth) fair scheduling and more recently in Nimble OS 3 we introduced QoS-Auto also known as Noisy Neighbour Avoidance, which is detailed in Dimitris Krekoukias excellent blog: The Well-Behaved Storage System: Automatic Noisy Neighbor Avoidance. Of course, none of these features you can directly 'manage' as they are all functionality that reside within NimbleOS to ensure the array (and it's associated services) are self-healing and no single workload is being starved. One of the simplest forms of management is something that requires zero-management! You could also argue that with some of the Nimble arrays, the performance is so over-provisioned that managing quality of service or governance is something that is seldom required.
The Use Cases for Quality of Service
Most implementations of QoS prioritise or govern access to a given resource that is 'topped' out, in order to decide who gets priority when things are busy, in reality, this is defined as Class of Service. QoS-Limits is quite different, as it sets and maintains utilisation regardless of the available resources. Generically, I see three use cases where implementing Quality of Service is highly desirable:
Providing Only The Performance That is Needed (or Purchased)
This requirement is incredibly common in the Service Provider landscape where there is a customer/tenant/application, and there is the desire to restrict that application to a prescribed performance level based on the requirements or the service level that has been purchased. Note: Noisy Neighbour Avoidance wouldn't limit this use case, if the performance was available then it would be honoured allowing the application or tenant to exceed their prescribed amount.
Limiting performance in this case allows the array to restrict the level of the performance to desired level (regardless to what is available to the array) and also offers the hosted provider the ability to provide a 'bursting' service where more performance can be optionally be made available for a period of time. Fundamentally, it provides control to limit a user to a prescribed level of performance.
Consistent Service Levels
Consider an environment where a brand new controller has been deployed. The first application is deployed and has free reign to use all the resources available. On day 1 the workload experience is fantastic as the workload as unrivalled access, but later as more and more applications have been deployed, the first application is now competing with all the other applications for fair usage of the resources. The perception of the first application owner is that performance has gradually worsened (as it no longer has unrivalled access) but in realistic terms it really as it no-longer as dedicated sole use of the array. Again, QoS here would assist by limiting the workload from day one to ensure the same level of performance was maintained regardless of what other workloads were hosted on the controller.
Service Introduction or 'Fear of the Unknown'
Quite often a user may not know how busy a specific workload maybe or the impact of a change. Of course QoS-Auto helps here as it ensures no one workload once it is introduced, however QoS-Limits in this instance allows the admin to once again control the resource and provides limits to what can be consumed by the application providing a fencing algorithm to introduce new services in a staged and safe manner. As applying QoS limit is dynamic, this allows infrastructure admins to increase the performance as and when required once the service has been introduced.
What does QoS limit?
QoS-Limit allows a user to limit either the IOP or MB/s performance of a specific workload. Having the ability to limit both IOPS and MBs is important as quite often any single workload will have different peaks and troughs during the operational day. For instance, an OLTP workload maybe very latency sensitive to small block updates during the working day when rows and tables are frequently being accessed or updated (this will tend to be very IOP/latency sensitive) yet in the evening the same database maybe receiving feeds from other systems (or providing bulk updates/analysis or index rebuilds), the same application will cease to be IOP sensitive and will now be bandwidth (MBs) sensitive. In NimbleOS4 a user can limit a workload by either IOPS or MBs and also specify limits to both IOPS and MBs. If either limit is reached then the volume will be restricted accordingly.
QoS-Limits is set completely dynamic so as soon as a limit is set/unset then it's enforced/lifted appropriately.
What level of Granularity can QoS-Limit be applied?
QoS-Limits can be set on either am individual volume or it can be set on a Folder (which are a collection of volumes). The concept of Folders was introduced formerly in NimbleOS 3 - you can read more about them here in Nick Dyer's Blog: NimbleOS 3 - Introduction To Folders, essentially a Folder could represent several volumes that make up an app (or a environment) or define a tenant or internal customer.
What is the impact of setting QoS-Limits?
Setting QoS-Limits clearly has the potential to limit IOP/MBs performance on the volume/folder, that is after all the nature of the feature. It essentially limits the performance by applying a delay to the IO (so that only the set amount of IO's are serviced in accordance to a virtual clock that is maintained with each object in NimbleOS). If a QoS-Limit is in place and the Folder/Volume is exceeding is limited level of resources then a delay is introduced to slow the volume down and limit it's performance. A simple analogy is a motorway which has several lanes, if the motorway is free and a car wishes to travel fast down it to it's destination. However, the car has been restricted by the use of cruise control which limits the accelerator to pre-determined speed, of course the conditions exist to allow the car to go faster but in this instance it is artificially being limited by the cruise control. The side affect of this is latency will clearly increase (in the same way it will take my car in the analogy longer to reach it's destination). So don't be surprised if you set IOP QoS-Limit to see your latency increase!
How do I set QoS-Limit?
QoS-Limits can be set in several ways:
- NimbleOS GUI
- Command Line Interface
- Scripted via API
- VMWare vSphere Plugin and via VVOLS Policy
The NimbleOS GUI in 4.x has been redesigned, there is a great blog by Craig Sullivan here which details the new GUI (NimbleOS 4 - Next Generation HTML 5 GUI) , but essentially if you go to Create Volume workflow (or the Edit Volume workflow) you will see the ability to set QoS-Limit on the volume in the Performance tab. Here is an example:
The same can be accomplished via the Command Line Interface by setting the volume limit and then returning the volume QoS-Limit to unrestricted using the following command:
and finally within vCenter if your creating/editing a Datastore using the vCenter plugin:
or if your using Storage Policy Based Management with VVOLS. In order to access this, from vCentre click on Home > VM Storage Policies > Select or Create your policy > Edit Settings > Rule-Set:
As mentioned above you can also set QoS-Limit using the API - full documentation is found on Infosight. You will find API's at the Volume and Folder objects, I will be posting a sample API script to set QOS on a volume later in the series.
How do I know QoS-Limit is set?
The array performance graphs will show you when QoS-Limit is set. We wanted to show a visual representation so that if someone was looking at performance it should be obvious to see that QoS-Limit might be at play. When QoS-Limit is set on a volume you will see an orange perforated line that shows at what level QoS is set to, here is an example, where I have just set both an IOP and MBs QoS-Limit policy on the volume:
The orange lines represent the QoS-Limit setting and one can see how the performance has dropped to met/enforce that setting.
Whats the right level of QoS to set?
Finally, a common question is what should I set my volumes tool? What is a good value to use? At Nimble, we always want to give you good guidance on when and how to use a feature. Fortunately the array and Infosight allows us to be much more predictive than recommending you set and tweak over time!
Firstly the array itself will look at the past 24 hours performance and give you guidance what the max and peak IOPS and MBs have been for that object. You can see this when you go and set QoS-Limit on the volume:
We expect our data scientists to publish data soon around generically whats sorts of typical IO levels we see per GiB for typical workflows hosted on Nimble, you got to love the power of Infosight, the telemetry it produces and the Insights into the install base it provides.
Whats the License Cost or Overhead for this feature?
Come on, you know better than to ask this question. As with all features with NimbleOS, this feature is free to use once you have upgraded to Nimble OS 4.x. There is no performance overhead on the controller by using this feature (clearly limiting performance on a volume or a folder will have a potential impact to the performance of that volume!)
Where can I see this in action?
I have posted two videos demonstrating Nimble QoS-Limit here:
Nimble QOS using the GUI
Nimble QOS using the CLI
Please post any comments, questions or queries below!