5 Replies Latest reply: Jan 13, 2017 11:58 AM by Dimitris Krekoukias RSS

    How to improve low sequential read performance for large files?

    David Baril Wayfarer

      I was running some performance tests on a small CS/1000 with a well tuned Centos 7.2 system running under VMware, using an external iSCSI LUN from the Nimble CS/1000.


      I was very pleased with write performance of both small and large files (Nimble's forte), but was disappointed with the moderate levels of sequential read performance for large, multi-gigabyte files.


      For example, when writing a 10 GiB file to a tuned XFS file system, I am achieving over 550 MB/sec rates ... using uncompressable data. When I read the same file back (after flushing the Linux file cache buffers), I achieve only 145 to 180 MB/sec.


      I'm using Linux block-layer settings on the "sd" entries of max_io_kb of 1024, read_ahead_kb of 4096, nr_requests of 128, with a queue depth of 64.  The same settings are on the multipath pseudo device. 


      Monitoring the performance as the test is running, I can confirm that 1 MB reads are being done at the iSCSI layer.  The system uses a dual 10GbE NICs, with dual subnets into the two 10GbE ports on the Nimble storage.


      I understand that Nimble CASL architecture identifies large sequential reads and performs them directly from the disks.  However, with an 18+3 CASL layout with a total of 21 disks, I would expect a higher sequential read performance than 9-10 MB/sec per physical disk.


      The Nimble volume is using a 4kb Nimble block size, with a 4kb XFS page size, with proper alignment.


      What large file sequential read performance should be expected on a CS/1000 ?


      Thanks for your help.


      Dave B.

        • Re: How to improve low sequential read performance for large files?
          Nick Dyer Navigator



          Using tools such as IOMeter/IOStat with 256K block sizes, 32 outstanding I/Os and multiple CPU threads with a minimum of 4 iSCSI paths on dual 10Gb NICs we have observed the CS1000 to perform at around 1000MB/sec and writes being around 700MB/sec as an absolute maximum. Of course this is in a perfect environment so your mileage may vary.


          Sounds like something isn't configured correctly somewhere in the stack. Might be worth a call into Support (and your local Nimble SE) to see if they can assist.


          To give you an example I performed a similar test with a CS1000 with 8GB FC using IOMeter recently - and was able to stress it at 735MB/sec without any tuning whatsoever. I'm pretty sure I could get it higher if I spent the time tweaking various knobs. Mixed 50/50 read/write sequential performance was 585MB/sec read, 522MB/sec write for a total of 1107MB/sec.


          Screen Shot 2016-12-15 at 12.14.42.png

            • Re: How to improve low sequential read performance for large files?
              David Baril Wayfarer

              Hi Nick,

              Thank you for the information.  I read it with caution, because the comparisons are not really equivalent.  But, I really appreciate that you took the effort to reply.


              The comparison is really apples and oranges.  Your suggestion to call Nimble support is likely the best guidance.


              Iometer is a useful tool, but for many of us, running a zillion threads to different files with a 50/50 read/write mix is not the real world, especially for bulk "administrative" -kind of tasks.


              But as a comparison point, with it is a useful marker.


              Also, please note, that not all storage vendors have architectures that handle "large files" well, and these vendors are very successful in in the non-large-file market.


              There are a broad class of storage vendors that operate in the 12-15 MiB/sec per physical spindle tier of performance, so my 9-10 MB/sec is not that bad for a simple test.  With compressible data, that number could double. There are also other storage vendors that operate at a ~30 MiB/sec per spindle performance tier.


              Storage vendors that target the HPC and "big data" market typically operate at a ~60 MiB/sec per spindle performance tier, and the top-end for 7200 RPM disk performance (multi-stream sequential) is around 75-90 MiB/sec per spindle.  One of the gating factors to these higher levels of per-spindle performance is the IO size per disk.  To sustain multi-stream sequential performance at 60 MiB/sec/spindle, you have to ensure a minimum effective IO size of at least 1 Mbyte to each disk.  The fact that many storage vendor's architectures were limited to using smaller IO sizes was part of the justification to Hadoop-like software-defined architectures that ensure 1-4 MiB or larger IO sizes per spindle.  I have worked for HPC-class storage vendors, and also for HPC-class customers doing their own storage integration, so I understand the underlying issues.


              As a new Nimble user with significant storage experience, I am trying to understand how well the Nimble handles "large files" ... which admittedly is outside its primary market. The challenge with "large files" is that it effectively negates the benefits of the SSDs for the "data".  The SSDs are still very important for write caching, metadata, and non-large files.


              The significant differences in your described iometer test:

              1)     I am interested in a SINGLE large (10+ Gbyte) file, sequential access, read performance.  Given the CASL architecture, this kind of workload would bypass the SSDs and be serviced directly from the disks. This is also uni-directional performance, which stresses the unidirectional performance of the connections (other than a small amount of IO acknowledgements going the other way.  The only available IO optimizations are likely compression, read-ahead, how the IO is spread across multiple disks and the effective read IO size per spindle, the tightness and CPU efficiency of the IO stack, and how efficient the math algorithms are that are used to generate checksums, parity, and the like.  Nimble has a valuable feature of explicitly handling silent data corruption, which ultimately requires a computation and comparison of a checksum across the data ... for every read.


              The large 10 Gbyte file size, accessed in a single pass also minimizes the potential impact of staging the data into the SSD tier or the RAM cache.  Ultimately, you can't go any faster than the disks ... and the effectiveness of the fan-out parallelism to read multiple disks in parallel.  This one-to-many parallel expansion is a function of the Nimble firmware and the host read-ahead settings.  I was using the recommended Linux setting of a 4MB read ahead with 1 MB max_sectors_kb IO size.


              2) iSCSI over Ethernet/TCP.  iSCSI has significantly more "compute" overhead than fibre channel with a good controller. Properly configured, a 4 MB FC IO generates one host interrupt.  iSCSI requires several hundred at best. This allows a host IO stack to process higher throughput levels with less CPU overhead.  But the iSCSI infrastructure can be significantly less expensive. From personal experience, I can run at 98% of quad-channel FC on the host ... and do it into a single multipath IO stream (assuming the storage system(s) can keep up).  Getting dual 10GbE NICs to operate above ~ 900 Gbytes/sec (which is ~75%) is increasingly difficult, especially at low thread counts.



              BTW, I was testing an best-case scenario.  I verified that the 10 GiB file was physically contiguous in the file system, and it was written all at once on a relatively idle system.


              If I bypass the XFS file system, and do a sequential read of the LVM logical volume, I get similar performance.  I can also read at the LVM physical volume level, the multipath device level, and the block device "sd{xx}" level and performance is similar, so there is minimal additional overhead being introduced on this large-file sequential read.


              By going through the levels, however, I did discover at the LVM logical-volume level, the logical volume appears as a pseudo-disk, with disk-like attributes, such as "max_sectors_kb", read_ahead_kb, queue depth, number of requests, and others, but NOT the IO scheduler type.  Nimble documentation suggests settings for these class of attributes at the block-layer and multi-path layer, but not the logical-volume layer.  Nimble's recommended max_sectors_kb of 1024 for a 1MB IO size existed at the multipath layer, and the block layer, but the logical volume layer had the default value of 512 kb.  So the 1 MB IO was being decomposed to 2x512kb in the logical volume layer, and then coalesced back at the multipath layer.  Even so, the throughput was effectively the same, there was a bit more host CPU expended.


              I will call Nimble support, but from my testing, it appears that the Nimble CS/1000 uses physical disk management architecture that yields the typical ~15 MB/sec per spindle (as many other vendors do), which can be increased by the compression factor. The underlying physics would be an effective per-disk read IO size of 256kb, or 4 back-to-back 64kb reads before incurring a rotational latency penalty.


              Again, Nick, thank you for taking the time to contribute the information.


              The iometer multithreaded read/write results are interesting, but still illustrate a maximum ~500 MB/sec read performance of compressible data, which could be 250 GB/sec of uncompressible data. My single-threaded large file read test (with read ahead) of uncompressible data is approaching 200 MB/sec ... which is not that far off.  200-250 MB/sec for 21 disks is in the classic 10-15 MB/sec design class.


              Dave B

                • Re: How to improve low sequential read performance for large files?
                  Dimitris Krekoukias Newbie

                  David, out of curiosity, what is the application that needs to do a single threaded read of a large file? Can the read not be multithreaded instead?


                  Single threaded sequential is subject to things like queuing issues that can slow things down. Applications like DBs that need to do a massive table scan (which looks like sequential reads) still do this in a multithreaded fashion.


                  But maybe your app is single-threaded.





                    • Re: How to improve low sequential read performance for large files?
                      David Baril Wayfarer

                      Hi Dimitris,


                      Thank you for chiming in.


                      The primary purpose of my question about sequential read rate (with read-ahead), is that I have found it a good indication of he sequential and "large IO" capabilities of the storage system, and to help identify if there are mis-configurations in the IO stack.  Sequential IO of a large file is very easy to parallel-ize with system read ahead threads, without any fancy programming.  It also is often quite typical of administrative tasks working with large files.


                      So please let me clarify ... this is single-threaded at the application level, but potentially deeply threaded with kernel read-ahead threads.  Nimble's standard recommendation is to set the Linux block-layer "read_ahead_kb" setting to 4 MiB,  with a maximum IO size (max_sectors_kb) set to 1 MiB, as an example. In this case, when the Linux detected a sequential pattern, 4 x 1 MiB reads would be launched by kernel threads.  There are read-ahead and IO size attributes at the block, multipath pseudo-device, and LVM logical volume layers of the IO stack, independent of what other read ahead might be done at the file system layer.  XFS, the file system I was using does not itself have any additional read-ahead functionality, but other file systems that can run on Linux can.


                      In reality, when an simple input-process-output style application reads a large file sequentially at the application level using normal IO in a file system such as XFS, very quickly, a sequential access pattern is detected by Linux and read ahead is initiated by asynchronous kernel threads, while the original read request's data is returned to the application.  Ideally, the read-ahead kernel threads can read data from the storage system faster than the application can consume it.  In my simple test program, there is no processing of the data, so the application is basically measuring how fast the read-ahead can work in conjunction with the storage's capabilities.


                      As an aside, when I was working in storage systems that were designed focusing on "big data" applications and multi-petabyte file systems, a single threaded read application as described ran at 3,150 MB/sec on a system with 4x8gbit FC controllers. To sustain that read rate, a 6 x 8 MB = 48 MB or larger read-ahead was occurring, with each parallel 8MB IO running at about 600-650 MB/sec each, with the IO being multiplexed across the quad FC controllers using a well-tuned Linux dm-multipath.


                      Another example of a popular "single threaded" application is "rsync" .  Rsync is used to copy and synchronize changes across disparate source and targets.  It handles one file at a time, perhaps overlapping some input with output, and depends upon the source's read-ahead capability, and the targets write-behind capability to enhance its performance.  When rsync hits a large file, the file is handled serially.


                      A second major factor for large file performance is how the file is striped across the disks, and how large the IO size given to a single disk, before moving to the next disk in the stripe.  With 7200 RPM disks as the basis, the storage marketplace falls into 4 major tiers.  General purpose storage typically yield ~ 12-15 MiB/sec per physical spindle, based on a 256kb or 4 x 64kb max stripe depth.  The second tier is at 25-30 MiB/sec per spindle with 512kb or 4x128 kb stripe depths. The third tier, which is the low-end of the "big data" and HPC-focused storage is at ~ 60 Mib /sec per spindle, with 1 Mib or 4x256kb stripe depth.  The fourth tier is at 75-80 MiB/sec per spindle with 2 MiB or 4 x 512 kb stripe depth.  These last two tiers are often associated with Hadoop, IBM Spectum storage (GPFS), Lustre, Quantum StorNext, ZFS, Panasas, Infinidat, and others.  These IO sequential IO rates are for disk-based IO, not SSDs. How these systems ensure and maintain 1MiB and larger IO per spindle (without breaking it up) is part of their "secret sauce", with several different approaches in the marketplace.


                      I don't know the details of the internal architecture of Nimble.  The CASL architecture has significant benefits and ease of use that these other large-file-focused systems do not have.  If Nimble's core disk topology is oriented toward 256 kb stripe depths, as most general purpose storage is, then there is a 12-15 MiB/sec/spindle ceiling, assuming no other bottlenecking issues, and my 8-12 MiB per spindle read rates are "normal", and something is not grossly mis-configured on the host or network side.  If the Nimble core disk topology is oriented to the 1Mib or greater stripe depth, then this would be the 60+ MiB/sec per spindle performance tier, and my 8-12 MiB/sec per spindle rates would be indicative of some gross mis-configuration.


                      Back to Dimitri's question, there are many applications that deal with large, multi-gb flat files, with a simple single-threaded approach ... at the application-level.  These applications depend on the kernel and/or file systems read-ahead and write-behind threads to parallel-ized the IO stream for higher levels of throughput.  Being able to use simple off-the-shelf, input-process-output style applications and still get excellent performance is a benefit.  Most of the standard Linux tools that handle "bulk" data operations of moving, copying, loading, restoring are often serial at the application level, and are examples of these class of applications.


                      Dave B.

                        • Re: How to improve low sequential read performance for large files?
                          Dimitris Krekoukias Newbie

                          Dave, look up Little's Law.


                          For a single threaded workload, network latency does affect throughput quite a bit. For example, if testing with FC, network latency is lower, and throughput even for a single thread can be faster. With iSCSI, latency might be naturally higher, and that could affect throughput (regardless of array architecture - this is a pipe thing).


                          You can calculate this on your own and derive theoretical max bandwidth for a single threaded workload regardless of array. Just calculate your host to array latency really accurately so your math is correct.


                          What's the array latency when you're doing your single threaded test?


                          Also, what's the array performance policy and block size? If you have workloads that do large sequential reads you may want to set one up with 32K (on the array side) if you have a volume that will consistently expect high sequential IO with large block sizes (say for an HPC workload). If it's truly mixed, then leave it at whatever the application really is (general purpose or a specific one like a DB profile).


                          You can always look at InfoSight Labs to see, per volume, what your true I/O heatmap is block-size-wise.


                          In general though, our systems are optimized for multi-host, concurrent multi-workload access, with auto-QoS, auto headroom optimization, etc etc (check my blog at recoverymonkey.org). The sequential-optimized systems you mention do really poorly on general-purpose stuff (and have next to zero data services, fancy checksums and no triple parity RAID), but do well on single threaded workloads, especially if network latency is really low (which is why many HPC systems don't even use a network but rather use direct SAS and in the future NVMe connections to the disks - removing network latency is important for those apps).


                          The question is, what do you need to do primarily?


                          You could also go the FC route if you want to lower network latency.