With this week’s announcement of Nimble’s new All Flash arrays, the topic of data reduction becomes more important than ever, in particular data deduplication, or dedupe.
For someone unfamiliar with data storage technology, dedupe might appear to be a very simple feature, something your array either has or doesn’t have. But from an engineer’s perspective, there’s a lot more going on under the hood – and these technical differences contribute a huge amount of business value.
We started with a blank slate, but with certain absolute requirements for the dedupe capability:
- inline dedupe;
- high performance;
- high capacity;
- ability to opt out.
In order to achieve these objectives, our engineering teams made significant innovations in a few areas, in particular:
- Nimble’s unique data path;
- scalable indexing;
- variable block dedupe;
- advanced garbage collection.
Let’s look first at the requirements set.
When it comes to removing duplicates there are two broad approaches. The first is to dedupe the data before it lands on the disk; this is inline dedupe. Alternately, you can write the data as-is, but use a background operation to remove duplicates; this is post-process dedupe. These approaches are not mutually exclusive.
Nimble’s Cache Accelerated Sequential Layout (CASL) architecture uses inline compression, and we wanted to use inline dedupe as it’s especially important for flash-only systems. Post-process dedupe results in larger than required space provisioning. In applications like VDI (Virtual Desktop Infrastructure), this can be as high as 5x, which would bloat the cost of a flash-only system significantly. Further, post-process dedupe increases write amplification on the flash chips because it’s writing more, sometimes five times more, than is necessary.
The flip side of inline dedupe is that it affects write performance, no matter how it’s done. It has to look for duplicates, which involves CPU-bound operations such as metadata lookup and data/metadata comparisons. The write performance of CASL has always been a function of available CPU – yes, even in Adaptive Flash systems, as we always write sequentially. This means that any additional CPU work could potentially affect the system’s maximum write IOPS.
But write IOPS cannot be compromised, because real life workloads are largely write dominated. We know this because we have data from a diverse set of more than 7,500 customers, as described in this blog post - InfoSight insights into real life workloads. So even with dedupe, we had to keep the bar high on write performance without compromising anything on read performance.
Yet another dimension of performance is sustained performance. This is where performance is measured as the system ages and fills up. In a log structured file system like CASL, the process of garbage collection (GC) is pivotal for maintaining high sustained performance. CASL’s garbage collection has always been lean and efficient and it was important to keep it that way.
Efficient Memory Organization
Apart from the CPU, dedupe can also place substantial demands on memory. In-memory metadata organization is critical for achieving high dedupe efficiency and performance. Some storage systems assume that metadata will be pinned in memory. This is easier to implement from the system designer’s point of view, but inevitably becomes costly and unwieldy, and difficult to scale in capacity.
We wanted an extremely efficient memory scheme, so that the design could scale to petabytes of storage per array while being able to address terabytes of working set data from memory.
Despite our goal of making inline dedupe extremely fast and lightweight, there will always be some applications or customer requirements where the cost / benefit analysis does not favor allocating any resources to dedupe. In such cases, it doesn’t make sense to force users to dedicate CPU and memory to dedupe. So our aim was to give users as much insight as possible into savings by application type, and let them opt out of dedupe in such situations.
With this design vision as the context let’s have a look at what goes under the hood and why it matters.
The Nimble I/O Path
For data storage systems, the IO path is first and foremost. The picture below depicts the write I/O path. For reads; very little has changed from our original flash-plus-disk hybrid arrays. Reads are practically unaware of whether the block is unique or deduped.
Our inline dedupe engine fits right in the path of data from NVRAM (non-volatile memory) to data disks. The blocks are deduped before they are compressed, and only then are those blocks written to the data drives – always in full stripe writes. Notice that both dedupe and compression have a variable block size. Compression was always that way: compressing blocks to variable length bytes, and now dedupe has its own adaptive nature by not using the same stick to measure all volumes.
Duplicate detection uses the 256-bit version of SHA2 cryptographic hash function. This makes deduplication extremely reliable and yet has a very low cost of computation and comparison, which saves precious CPU cycles.
Even when dedupe is inline, there is no impact on write latencies since writes are always acknowledged from NVRAM, which is orders of magnitude faster than the SAS or NVMe-attached SSDs used by other designs.
Efficient, Scalable Indices
We have incorporated several intelligent, patent-pending techniques to support petabytes of storage capacity while not losing the edge on performance.
For example, there are heuristics that exploit the “flocking” nature of duplicate blocks. In practice, duplicate blocks tend to be found in colonies – duplicate files, email attachments, or VDI images tend to be larger than a block. This gets exploited in Nimble dedupe by using locality-aware indices that make it very easy and efficient to detect a flock.
On the other hand some of the indices make use of the fact that it is much more efficient to eliminate non-dedupe blocks than identifying a duplicate.
Further, as desired, indices are not pinned to memory but can overflow to SSDs to support petabytes of resident storage. And at the same time there are algorithms making sure that working sets of terabytes find their index in memory.
Variable Block Dedupe
In today’s data centers, workload consolidation requires that storage systems be extremely flexible. So it doesn’t make sense to measure all apps with the same yard stick (i.e. the dedupe unit-size), or to put all apps in the same block sharing domain.
For instance, why would you want a Microsoft Exchange application using 32K block size doing the same work and taking up the same memory as a VDI application using 4K blocks? And why would you want your database applications sharing blocks with an email application when that hurts more than it helps?
Here’s an analogy: Isn’t it harder to find a matching sock in a pile full of your whole family’s socks than it is if you are looking for the match in a pile full of just your socks? As you can imagine, it will take more time if the pile is not sorted to get the same work done. Similarly, mixing data blocks from unrelated applications will cause unnecessary CPU / memory churn.
So we decided to make deduplication unit sizes variable and app-aware. Nimble dedupe uses a unit size that fits the application, which maximizes metadata efficiency, provides high performance, and helps maintain high dedupe effectiveness. Block sharing happens within an application domain.
But, as with other features, we let customers bypass this feature and use one global application policy for all data sets (like other less intelligent AFAs). But for most customers having this feature enabled by default means they’ll experience faster performance with zero management overhead.
Beyond that, for applications that do not benefit from dedupe, Nimble lets those volumes / apps not have dedupe enabled. Better still, this decision is not sticky and can be toggled any time after creating a volume.
Lightweight Garbage Collection
CASL uses the process of garbage collection for reclaiming dead space. Currently, most of our deployed Adaptive Flash (hybrid) arrays run with greater than 70% space utilization, with GC running all the time. But no one ever notices because it is very lean and lightweight.
If you think about it, a garbage collection process that is so lean that systems with low-RPM spinning disks and relatively small controllers barely notice it – it’s going to be almost invisible on an All Flash array. The only additional complication is that of dedupe and the block sharing that comes with it.
Nimble dedupe is designed to play well with other functions such as garbage collection (GC). In other systems higher the dedupe, higher is the strain on GC and slower is the system. On the other hand, Nimble dedupe is designed with GC in mind and does not slow down. Efficient GC ultimately translates to high sustained performance as the system ages and fill up.
At this point, we’ve achieved pretty much everything in our original design vision. But we’ll close with one final capability that I believe highlights our obsessive focus on the customer.
Most storage arrays report the space savings from data reduction technologies under one overall number, so it’s unclear which specific technology contributed to the overall dedupe effectiveness for each workload. To maximize transparency and help enable our customers to make data-driven decisions, we improved the way space reduction is reported – compression, deduplication, and clones savings are shown separately. One can turn a feature off if corresponding savings are minimal, thereby optimizing the resources to perfection.
I’m fortunate to work closely with Nimble people on both the engineering and product teams, and we’re all extremely excited about the innovations being delivered with these new dedupe capabilities, which we know will deliver exceptional business value to our customers.