10 Replies Latest reply: Dec 22, 2016 2:18 PM by Shiva Krishna Merla RSS

    Linux iscsi.conf nr_sessions.  Why 4 or 2?

    David Baril Wayfarer

      Hello all,

       

      Background:

      Nimble has several best practice -oriented documents that provide some instructions to over-ride the iSCSI configuration defaults under Linux. Most of these documents are somewhat stale, as they do not include information about RHEL/CentOS version 7.x, which is based on the Linux 3.10 kernel.  The documents that I am referring to are "BEST PRACTICES GUIDE, Nimble Storage for Red Hat Enterprise Linux

      6 & Oracle Linux 6", "Deployment Considerations Guide, Nimble Storage Deployment Considerations for Linux on iSCSI", "NFS Gateway Deployment Considerations", "TECHNICAL WHITE PAPER, Nimble Storage for Splunk on Oracle Linux & RHEL 6", and "TECHNICAL REPORT, Nimble Storage Setup Guide for Single Instance Oracle 11gR2 on Oracle Linux 6.4".  There are likely other documents that include Linux iSCSI configuration for Nimble, but these are the ones that I found so far.

       

      Note:  If you do a search in Nimble Infosight, documentation section for "nr_sessions", you get NO hits. I had to use Google searches to find most of these documents.

       

      These documents provide some guidance in configuring the Linux "iscsi.conf" file (which is now named iscsid.conf under RHEL/CentOS 7.x), and discusses over-riding several of the configuration file defaults.  Some of these suggested changes have a description that discussed the rationale behind the change, others setting have no rationale discussed.

       

      The suggested settings also vary by document, without any rationale for the basis of the different settings. The lack of rationale and the inconsistent recommendations across the Nimble papers leads to confusion as to which set of parameters are "better".  I also suggest that some of the recommended settings are sub-optimal, and can lead to under-exploiting the high performance of Nimble storage.

       

      For this posting, I would like to focus on the iscsi.conf configuration variable "session.nr_sessions", which controls the number of iSCSI sessions created per host-initiator:Nimble-target-port pair.  All iscsi sessions created for the same host-initiator:Nimble-target-port pair share the same failure risks.  If you want higher availability, you would try to configure "dual fabrics" for iscsi, which ideally would involve dual NICs on the host, resulting in dual host iscsi initators (called an iscsi "interface") cabled to dual external network switches, connected to two different Nimble ports, and overall using dual subnets to improve the network isolation. I will admit that this is the "ideal", and many Nimble customers do not have such a topology.

       

      So we have a configuration with two separate parallel paths from the Linux host to the Nimble storage.  Using the iscsi.conf default value of "1" for the 'session.nr_sessions" parameter, we get one iSCSI session per host-initiator:Nimble-target-port pair (as expected).  With Linux dm-multipath properly configured, you have a total of 2 active iscsi paths to the Nimble volume, and a robust high-availablity configuration.  If you properly configure the remainder of the IO stack, you can drive very high levels of IOPs and/or large IO bandwidth across the dual paths, and scale performance beyond the level of a single 10GbE connection .... if the storage side and the host side can drive those levels of  performance before bottlenecking.

       

      Why then, does Nimble recommend 2 sessions or 4 sessions PER host-initiator:Nimble-target-port pair?  With the iscsi "session.nr_session" parameter set to 4 in this example, there will be 4 sessions per host-initiator:Nimble-target-port pair, or 8 iscsi sessions total, and Linux dm-multipath will assemble what looks like an 8-path multipath device, that in reality share only two physical paths.

       

      Why multiplex 4 sessions on a single host-initiator:Nimble-target-port pair?  This is more complex, and under load can generate some unneeded congestion between the 4 sessions sharing the physical path. This recommendation seems to infer that there is some resource constraint or bottleneck that prevents full utilization of the physical path between the host-initiator:Nimble-target-port pair, that is remediated by using multiple-sessions per physical path ...from the same host.

       

      If there is a legitimate per-iscsi-session resource constraint (after the other settings are properly configured), it would be useful to be aware of it.  Perhaps there are methods available with newer hardware, NICs, and software IO stack tuning that can help address these inferred restrictions.

       

      For example, Nimble suggests using the vmxnet3 para-virtualized driver under VMware for a Linux VM.  This driver implements and enables multi-queue receives, but does not disable irqbalance, nor set per-queue affinities.  The vmxnet3 driver also implements multi-queue transmits, but does NOT enable the feature, nor set irq affinities.  Not surprisingly, while the "default" VMware Linux vmxnet3 driver is very good, properly enabling multi-queue capabilities, assigning irq affinities for the queues, and stopping irqbalance from randomizing them further improves networking performance, and thereby iSCSI performance, and Nimble performance .... to a single host.

       

      So ... what is the rationale for recommending 4 iscsi sessions per host-initiator:Nimble-target-port pair, especially when using a dual-NIC, dual fabric topology?

       

      Thank you for your help.

       

      Dave B

        • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
          Nick Dyer Pioneer

          Hello David,

           

          A lot of your points refer to old iterations of the documents, which are filed on Infosight for historical reasons.

           

          I must admit i'm not a Linux guy, however i'm pretty sure that the reasons why we recommend this are exactly the same as to why we create a minimum of four paths for within ESX and Windows environments; its to do with queue depth, sequential and single threaded IO stacks. If we created a single interface bond from a NIC to a target NIC then we may not yield best I/O performance due to block size, outstanding queues or even network congestion. Therefore in iSCSI environments it's recommended to create multiple paths in order to drive higher outstanding I/Os to fully utilise a network to it's potential.

           

          You will want to look into using the Nimble Connection Manager for Linux, which will do a lot of this automation path management for you behind the scenes. That is supported for RHEL 6.5 and above. NCM for Linux is available on Infosight to download.

           

          This also ties into the way that we handle virtual path management using the Discovery IP address (rather than physical data IPs) and how we also scale-out systems non-disruptively. We introduced this technology in NimbleOS 2.0: Nimble OS 2.0 Part 1: Manual vs Automatic Networking

           

          I'll send this onto some of our Linux guys internally to see if they can provide some additional discussion points.

          • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
            David Baril Wayfarer

            Hi Nick,

            Thank you for your prompt reply and useful information.  I value your offer to relay my post to some of your "Linux guys".

             

            Your first point was about the Nimble documents being old.  I agree, but what is the alternative for us customers? I was using the most recent documents available, which refers to a OS version that was released over 3 years ago. That is part if the "problem".

             

            I understand that you are not a "Linux guy", but Linux has made great strides in the past several years, and your comments about restricted queue depths, sequential and single threaded IO stacks, are no longer true, if the system architect chooses the proper components, and configures them appropriately.

             

            For example, with a small Nimble CS/1000, I can run a "single-threaded" small file create benchmark that, with proper configuration, runs almost compute bound on a 4-core Linux system VM using the dual vmxnet3 NICs and almost 4,000 small files per second, with an average file size of 1kb.  I happened to have configured the system to perform up to 18 parallel file creates into a single XFS file system.  And yes, in this case, due to proper "tuning", an IO bound workflow has been transformed into a compute bound one, with multi-queue receives and transmits, dual NICs, and a plethora of asynchronous kernel threads doing an admirable job of "write behind" stuff, and caching of several hundred thousand inodes.  If I had access to an 8-CPU configuration, I could have scaled it higher.

             

            If you know what you are doing, this is not difficult to accomplish with Linux ... and it can leverage and exploit the high performance capabilities of Nimble very well.

             

            And BTW .... I judiciously chose what suggested Nimble Linux settings to use, and I found that, If you properly removed the internal Linux IO path bottlenecks to allow full 10GbE throughput and multi-hundred-thousand cached IOPs ... then multiplexing multiple iSCSI sessions on one physical path works against you.

             

            As to the Nimble Connection manager for Linux, I would prefer not to use it at this time.  We had some stability and data corruption issues with the previous connection manager version under VMware.  It also represents a vendor-proprietary multipath manager that does not interoperate well with non-Nimble storage volumes.  While we have multiple Nimble arrays in multiple sites, none of our systems rely solely on Nimble Storage, and have to access storage from multiple vendors.

             

            I look forward to Nimble expanding the information flow to the Linux community. Many of us in the Linux community have been exposed to very high performance systems (with very high prices), and already understand how to configure the host to reach those very high performance levels.  We would like to translate and adapt those techniques to Nimble, and offload some layers of administration complexity by exploiting Nimble's feature set.  The current challenge is that there is a lack of detailed technical information for the Linux community to adapt their well-proven techniques and methods.

             

            Thank you for your feedback.  I look forward to learning more details about the rationale of the Nimble suggested settings for Linux.

             

            Dave B

              • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                Freddy Grahn Wayfarer

                Hi David,

                 

                So first...the authoritative document that you should be using for iSCSI configuration is the following:

                Nimble Storage Deployment Considerations for Linux on iSCSI

                 

                That document can be found on InfoSight.  All of the other documents you refer to (for the most part) have been removed from InfoSight, and are no longer valid. We are working on removing those same documents from the Google search as well...it looks like there were some leftover files still out there. Sorry about the inconveniences!

                 

                That being said, the deployment consideration guide was created to give our customers the basic suggestions of how to configure iSCSI with a Nimble Storage system. We attempted to keep this document very basic on purpose...We did this because, as you have brought up, there are so many considerations and differences amongst the different flavors of Linux, versions etc, and manipulating these settings can have major ramifications on performance etc.

                 

                If you look at the document there are very specific parameters that we mention in this document:

                node.session.nr_sessions = 2

                node.session.timeo.replacement_timeout = 120

                node.conn[0].timeo.noop_out_interval = 5

                node.conn[0].timeo.noop_out_timeout = 10

                 

                In the document, we don't mention why we suggest these settings, which I know we should and will change...however, the general idea was to provide some guidance, without providing an exact number, since there are so many different configurations and IO requirements.  Our idea was to recommend at least nr_sessions to be 2, but also allow for our customers to adjust it to meet their requirements. In certain cases, in order to get peak performance, the only way was to increase the nr_sessions.

                 

                I know that doesn't provide you an exact answer, but I'm hoping that it at least provides you the right document to refer to, as well as the reasoning behind what this document says.

                 

                Please let me know if you have any additional questions!

                Thanks!

                Freddy

              • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                Freddy Grahn Wayfarer

                Hi David,

                 

                In addition, I would like to follow up on the NCM part of you post earlier.  The Nimble Linux Toolkit 2.0 does provide NCM as a part of the installer. I know that it does make our suggested changes to the multipath.conf file on Linux.  Would it be possible to have someone from our team connect with you to try to understand what may have happened, and why you are hesitant to utilize the tool? We are always attempting to make our products better, and I know they would appreciate any and all feedback, suggestions etc.

                 

                Please let me know if that would be ok!

                Thanks!

                Freddy

                • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                  David Baril Wayfarer

                  Hi Freddy,

                   

                  Thank you for the additional information and the reference to the Nimble Storage Deployment Considerations for Linux on iSCSI.  Unfortunately, if you search for "nr_requests" in Infosight, this document is not returned.

                   

                  I had seen this document before, and it was one of the documents that mentioned using a value of 2 rather than 4 for the nr_sessions setting.  This document was a minority in using the value of 2 with no rationale.  This document also introduced a very different Multipath.conf configuration using ALUA, and the Deployment Considerations manual for Fibre Channel also specified ALUA for the Multipath setting with no explanation.  Since Nimble just launched the FC connectivity feature, it could in reality need ALUA mode.

                   

                  Knowing quite a bit about Linux Multipath, and the specifics of the difference between uniform active-active and ALUA, the changeover from uniform active-active to ALUA with no explanation was confusing.  I dismissed the ALUA-mode Multipath.conf iSCSI entry as an incorrect cut/paste of the fibre-channel centric Multipath.conf entry into the iSCSI document.  We have several Nimble storage systems and have been running uniform active-active for years.  I did not have the time to explore and test ALUA with ISCSI as the document suggested.

                   

                  The other side effect of using nr_requests of 2 or more which results in multiple sessions per physical path is that the Linux kernel discovers paths using a "walk" that is primarily depth-first. This tends to create all the sessions (and Linux block devices) for one iscsi interface grouped together, and then the sessions on the second interface, and so on.  When the typical multipath device is created, you end up with "clumping" of the paths on one interface, and then on the next.  This creates a momentary starvation condition, since two adjacent IOs end up being sent to the same interface, rather than the alternate interfaces. While one interface is clumping, the other interface has momentary starvation.  Statistically, this will likely limit throughput to 78% to 83% of the maximum due to the 2 IO clumping.  This all assumes that the compute intensity of the IO stack does not get in the way.  This is difficult to achieve with 10/40/100 GbE unless you are using some high-end hardware with very tight drivers.  With fibre channel, on the other hand, it is relatively easy.  Fibre channel is also vulnerable to the less-than-optimum ordering of paths when there are multiple paths per interface.

                   

                  To make a long story short, I was deploying quad 8-gbit FC, scaling to 98% across all four interfaces, full duplex ... over 5 years ago, with RHEL 6.x.  It is not really a fair fight, since properly configured, a 4 MiB FC IO generates a single host interrupt, vs several thousands for Ethernet.   I was able to configure standard Linux Multipath under RHEL6 and newer to achieve higher performance levels and lower latencies across multiple vendor's storage than the "premier" EMC PowerPath product, which supported a very limited set of non-EMC storage.

                   

                  Yes, I am willing to have off-line discussions with with Nimble technical staff about Nimble Connection Manager.  The problems I described with stability and data corruption was related to the VMware version, not the new Linux version.  I was not directly involved with that deployment, but because of that in-house negative experience, NCM is not being used company-wide.  With only a single small CS/1000 to attach, the risks appeared to outweigh the risks.

                   

                  Please contact me offline to set up additional discussions.  I had previously also been talking to both Brian MacDonald, and Stephen Daniel to set up some information exchange, but nothing has come of it so far.  I had shared some of my past papers with Brian and Stephen as an example of what could be done.


                  I have already applied my optimization techniques to one of our Nimble configurations, and the results have been good, given the capabilities of the storage model and host systems being used.  I wonder how much better we could do if I access to additional information.  Unfortunately, I can't justify spending a lot of time reverse-engineering behaviors at my current position.


                  I look forward to an continued discussion on Linux IO path optimizations for Nimble.


                  I already have in place a existing workload system that scanned a 21 million-file system, and logged all the file names, types, and lengths.  I can "playback" this "worklist" in various ways to better simulate the range of file sizes that we use.  Unfortunately, we have millions of small files and symbolic links, and then a reasonable amount of medium-to-large files.  The "average" file size is very misleading due to the skew toward tiny files.



                  I am currently using a total of 2 iSCSI sessions, one per 10GbE interface in a dual-fabric/dual subnet topology, to a very small CS/1000.  I have timings of the 18 million mixed file read and write tests.  "If I get a chance" I will try 2-sessions per interface, but I believe that I am already saturating the capabilities of the modest CS/1000.


                  Regards,


                  Dave B.




                    • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                      Freddy Grahn Wayfarer

                      Hi David,

                       

                      In regards to the multipath.conf file that you see in the iSCSI Deployment Consideration guide, this comes directly from our interoperability test group.  We have attempted to simplify the multipath.conf file, so it will look very similar to the FC multipath.conf file.  Our interop team tests using this multipath.conf configuration, and makes modifications based on things we are seeing in the field, as well as known issues that we've seen before.

                       

                      There are only a couple of small changes between the FC and iSCSI multipath.conf.

                       

                      Hope this helps as well!

                      Thanks!

                      Freddy

                    • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                      David Baril Wayfarer

                      Hi Freddy,

                       

                      Yes, I understand that the differences between the FC and iSCSI multipath.conf versions are small in NUMBER, but NOT small in functionality.  ALUA, asymmetric logical unit access, is very different than "const" priorities, unless the priorities returned by the storage are equal, which according to the example listings, it appears that they do.  However, the "rr_weight" field is set to "priorities", which happens to be "50", according to the listings.  This value, 50, is then multiplied by "rr_min_io_req" (which is 1) to determine how many IOs go down a path before switching.  So, this "ALUA" multipath configuration pro-actively creates 50-IO clumping on a single path (and 50-IO starvation on the other paths) , where the past non-ALUA multipath configuration that used "const" for the priority type and "uniform" for the rr_weight, and 1 for rr_min_io_req ... resulting in only 1 IO being sent  down a path before switching, resulting in no IO clumping.  With the IO clumping behavior of 50, like the new recommendations suggest ... you might need multiple iscsi sessions per physical connections as a mechanism to counteract the clumping.  The real solution is to use the proper Multipath settings in the first place, and do NOT amplify potential bottlenecks by introducing forced clumping.

                       

                      Unfortunately, Nimble is not the only storage vendor that treats multipath and other configuration settings casually, and these inaccurate settings can have negative impacts on performance, and negate other "optimizations" elsewhere.

                       

                      Regrettably, we, the customers suffer the consequences, and these inaccuracies damage the credibility of the vendor.  It also opens the door for a less-capable competitor with more appropriate "settings" to operate at higher efficiency levels.

                       

                      I have the battle scars of having identified and "corrected" the multipath and other settings of several storage vendors, yielding substantial efficiency gains.

                       

                      The reason I asked the original questions was to elicit the rationale why a iscsi.conf setting was proper because it seemed counter-intuitive.  The multipath.conf settings are also very inaccurate, in my opinion, and there are many other settings that are incorrect due to omission (like the block-layer settings for the LVM logical volume).

                       

                      I look forward to an ongoing discussion within the Community, and would suggest that Nimble re-visit the explicit values of the various settings or the acceptability of the "defaults".

                       

                      Yes, the settings "work", in that the device is operational, but the efficiency and resulting throughput (or smaller hardware requirements) can be substantial.

                       

                      Dave B

                        • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                          Shiva Krishna Merla Newbie

                          Hi David,

                           

                          Really appreciate your inputs and in-depth analysis of current settings. We tried to update recent documents to certain extent and will be refining further soon. I will try to briefly answer some of the questions. nr_sessions of 2 is suggested to obtain redundancy from target side. During discovery, iSCSI initiator will try to connect to target discovery_ip and target will redirect the connection to certain data_ip. By default Linux will create a single session to the target and end-up using single data ip connection per initiator. with nr_sessions set to 2, iscsid will duplicate the connection and try to connect to discovery_ip again, which will be redirected to different data_ip than original session. Thus each initiator will login to both data_ip addresses. However if only single data_ip nic is connected on the array, this is not necessary as exact duplicate initiator-target pairs will be created( not very useful).

                           

                          Regarding multipath.conf settings, i honestly agree that we need to refine these settings and working on this to update shortly. Our goal was to maintain same settings in device section for both iSCSI and FC.

                           

                          path_group settings:

                            prio: alua

                            path_grouping_policy "group_by_prio"

                           

                          These two settings ensure that for FC, stand-by paths are treated with low priority and that path group is not used for I/O unless there is active path failure. ( prio=50 for active-optimized paths, prio=1 for stand-by(ghost) paths). For iSCSI, there will be single path_group and only active paths will be discovered(prio=50).

                           

                          path selector settings:

                            rr_weight: uniform

                            rr_min_io_rq: 1

                            path_selector: round-robin.

                           

                          These are the default value settings in device-mapper-multipath( In latest versions, service-time is default though). We are working on refining these as we currently suggest "rr_weight to priorities and rr_min_io_rq to 20, but doesn't seem to be right fit for all workloads, as 1000 I/O's will be sent to each path before switching( prio(50) * rr_min_io_rq(20) ).

                           

                          hardware_handler settings:

                            hardware_handler "1 alua"

                           

                          we need scsi_dh_alua to handle ALUA specific error conditions returned from the target. This is to allow SCSI midlayer to retry NOT_READY check conditions during transition. This is applicable for both iSCSI and FC, even though there are no stand-by paths for iSCSI config.

                           

                          Regarding I/O clumping you mentioned with paths originating from same initiator port( both FC and iSCSI), i had posted a patch in the past and it got merged into device-mapper-multipath as well. With this currently multipath will group similar paths by alternating initiator ports/IP addresses. Let me know if you are seeing any further issues with this as well.

                           

                          [dm-devel] [PATCH]multipath-tools: Re-ordering of child paths in priorit

                           

                          Hope this is helpful.

                           

                          Thanks

                          Shiva

                            • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                              David Baril Wayfarer

                              Hello Shiva,

                               

                              Thank you for your thoughtful reply.

                               

                              Small world.  I went to look at the dm-multipath patch that you discussed, and discovered that you previously worked elsewhere for a vendor I was using at a previous job, were I was optimizing 8 and 16 total paths across 4 FC controllers to each LUN, on a system with over 5000 paths.  We may have been indirectly involved with each other, as I shared my findings, and my user-space workarounds with your past employer. Using the techniques I had identified, we were achieving 98% scaling across 4 FC controllers, full duplex into a single dual-socket Nehalem/Westmere class system. It would be even better today with native PCIe 3.0 controllers and Sandy Bridge/Ivy Bridge class and faster systems.

                               

                              Please contact me offline and I will share my research report on dm-multipath optimization. Your patch is a very useful first step. The basic issue can be expanded to a generic multi-level "topology vector", where the ultimate goal is to maximize disk-jointness between adjacent paths.  There are also methods to perform the reverse topology mapping ... to quickly translate a pseudo-block device path, like "sdx" into its path components (topology vector) from end-to-end, which enables some interesting performance analysis by about a shared sub-component in the path. The expanded concepts would also apply balance across multiple storage arrays.

                               

                              As to the iscsi session connection behavior under Linux ... my observations with our CS/1000 are different that you describe.  Perhaps it is sensitive to using the discovery IP address rather than the IP address of one of the targets.

                               

                              In our dual-NIC, dual-subnet, dual fabric topology with 1 10GbE port on the Nimble connected to each subnet, with the iSCSI nr_sessions set to 1 ... we get a total of two sessions, as expected.

                                   NIC1 on subnet1 => Nimble port on subnet 1

                                   NIC2 on subnet2 => Nimble port on subnet 2

                               

                              There are only 2 physical paths possible.  If I run the iscsiadm discovery command to one of the TARGET IPs (not the discovery IP), it returns the 2 paths, not just its own.  The two paths returned also include the "other" path, but an "other" path that is in a disjoint subnet.  I believe this is similar to using the discovery address.

                                   "iscisiadm -m discovery -t st {Nimble_target_IP}:3260

                               

                              If I increase the iscsid.conf nr_sessions to "2", the result is that an additional set of paths are return, each a duplicate of one of the existing paths.

                               

                              I always use the "sendtargets" type of discovery, and I have never experienced a case with Nimble where a sendtargets request on one IP returned only 1 path when more were available.  Therefore, there is no need for iscsiadm to query the Nimble more than once to yield the additional paths.

                               

                              So I suggest that some additional clarification may be needed regarding the different behavior when using the Nimble discovery address, and the behavior when using one of the target IP's.   If the behavior is different, it will be confusing to the customer.

                               

                              From my limited testing, iSCSI discovery using "sendtargets" to a target IP works properly and returns the correct number of paths. If "sendtargets" discovery finds all the paths, then there is no need to multiply the number of sessions PER PATH by 2 or 4 with the nr_sessions setting.

                               

                              Your description infers that with nr_session set to 1, a "sendtargets" discovery returns an incomplete subset of the total number of paths, and therefore need a larger-than-1 value of nr_sessions to correctly identify all the paths.  I have NOT seen this behavior under CentOS 7.x or 6.x with the in-box version of the iscsi packages.

                               

                              Please also be aware that the Linux iscsid,conf configuration is a GLOBAL configuration file and applies to all iSCSI devices, some of which can be different than Nimble.

                               

                              Shiva, you did correctly indicate that the "rr_weights" value of the multipath.conf file should be "uniform" and NOT "priorities".  This is contrary to the documentation that I mentioned earlier, and was a source of confusion. From my past experience with dm-multipath, I was aware of non-appropriateness of any rr_weight value other than "uniform".

                               

                              May I also suggest that the apparent lack-of-impact in performance when changing rr_min_io_req is because the impact of that setting also depends on rr_weight, ALUA/non-ALUA, the numeric value of path priorities returned by ALUA, the system defaults, and the Linux version (which determines the version of dm-multipath)

                               

                              There are two major versions of dm-multipath ... the earlier BIO-based dm-multipath where the largest IO "chunk" was typically 256kb or 512kb AND was dependent on other configuration settings, and the newer request-based dm-multipath which the largest IO chunk is the requested IO size coming down the stack, which could be multiple blocks.  The details are not as important as understanding that the BEHAVIOR was different, and this occured during the transition from RHEL/CentOS 5.x to 6.x.  The other important factor is that the Linux system defaults changed version to version, and sometimes update to update.  Some customers often copy their old version multipath.conf file and use it in a newer Linux version, possibly inheriting some defaults from the multipath.conf file itself.

                               

                              So be very wary of assuming the "default" behavior matches some out-of-version documentation that was not updated in years. You need to confirm what defaults are in effect on that specific system.  "multipath -k" is an easy to display the in-effect configuration under newer versions of Linux.

                               

                              Yes, it is confusing, but the hardware and software vendor community seem to be amplifying the issue, not adding clarity.  And with this confusion come less-than-optimal configuration, and less-efficient operation.

                               

                              Two other points.

                               

                              Another source of IO behavior masking for the performance-centric settings of dm-multipath, are the settings of the IO stack above and below dm-multipath, and some of these layers have IO size maximum settings and queue and scheduler related settings. A layer above or below dm-multipath can negate or amplify the effect of the performance-centric settings.  This is often why changing a setting in one layer appears not to "work". The effect is being masked by another layer.  For example, iscsid.conf can set global iSCSI maximum IO size settings and some queuing parameters, as can the LVM-logical-volume layer, the multipath device layer, the block IO layer, and the controller driver.  The IO stack may also be dependent on memory allocation, which may be indirectly specified somewhere else.  With iSCSI, you then add all the settings related to the networking stack.

                               

                              Shiva mentioned that sometimes changing  the multipath rr_min_io_req setting from 1 to 20 had little noticeable effect.  This can be often easily explained by a lower layer re-coalescing the 20 IOs into a fewer IOs, and

                               

                              "service-time" dm-multipath scheduling.

                               

                              I have tested the "service-time" setting on Nimble.  This was done by accident, since it is the RHEL/CentOS 7.x default. Red Hat suggests that "service-time" is "better" than "round-robin", at least for fibre channel. From my extensive real-world testing at past employers across multiple petabytes of FC storage, we found this NOT the case.

                               

                              When I tested "service-time" with an iSCSI LUN on a Nimble CS/1000 with dual-NIC, dual-subnet, dual 10GbE connections ..... the sevice-time setting caused IO to "stick" to a single IO path when under a heavy single-stream load. This also meant that any multi-threaded read-ahead or write-behind optimizations in the IO stack would not be able to utilize the "other" path that was idle.  You probably would not see this behavior on a busy system with multiple LUNs and multiple activity flows .... but a single, high-intensity stream could get repeatedly "stuck"  on a single path. Perhaps it was related to some low-level interrupt affinity settings that the IO intensive application was "closer" to one NIC than the other, but the end result was a 500+ MiB/sec stream on one path, with zero on the other path ... for multiple 10-second performance monitor screen updates. This was IO clumping big-time.

                               

                              Also, selecting "service-time" at the multi-path layer, and then using a block-layer IO scheduler other than "noop", will result in the multipath scheduler measuring the latency of the block-layer queuing and not the actual congestion of the external path to the device.  If you take the Red Hat/CentOS 7.x defaults ... which combines "service time" multpath scheduling with timesharing-oriented "cfq" block-layer scheduling, you get the worst possible combination, effectively disabling the multi-threaded optimization algorithms in an advanced filesystem such as XFS.

                               

                              Changing to "round-robin", with "noop" at the block layers above and below dm-multipath yielded the best overall performance, especially to a super-smart device like Nimble. You want to push the IO out as fast as possible, with the best balance across the paths to the intelligent storage, so the storage intelligence can make more informed optimization decisions.  Perhaps if you were connected directly to non-intelligent JBOD you would want some host-based IO biased scheduling, but not for intelligent storage.

                               

                              Shiva, please contact me offline for my additional multipath research findings.

                               

                              Regards,

                               

                              Dave B

                                • Re: Linux iscsi.conf nr_sessions.  Why 4 or 2?
                                  Shiva Krishna Merla Newbie

                                  Hi David,

                                   

                                  Small world indeed!..I remember now where the I/O reordering issue and suggestion came from. I will be happy to expand the solution and post additional patches upstream. I will contact you offline and we can use your help to further refine our settings. Brief response on further questions.

                                   

                                  "nr_sessions" : You are right, setting this to 2 will be only helpful if you have at-least 2 data IP's in same subnet. That's when array will redirect host login to both of the data IP's in same subnet. In your case this doesn't apply. As each data IP is in its own subnet, sessions will be duplicated end-to-end. We will make this clear in the documentation. Our target will return portals from each subnet configured on array, so that host can login to both subnets by default. Its to make use of redundancy of data IP's in same subnet, we had this suggestion to change to 2/4.

                                   

                                  "rr_weight" : Yes, it will be updated in the documentation to "uniform". As prio values range from 130, 50, 1, it will cause un-even I/O distribution and starvation on paths by round-robin path selector. Your inputs on service-time path selector is helpful. We have decided to use/suggest only round-robin path selector with Nimble devices.

                                   

                                  BIO/Request based DM: We already denote different multipath.conf settings for 2.6.18 kernels vs 2.6.32 and above. But we will add a note for migrations from 5.x to 6.x versions as well.

                                   

                                  I/O stack tuning: We can discuss offline on the various settings you have in mind. ( max_sectors_kb/scheduler/read_ahead_kb etc).

                                   

                                  Thanks again!

                                  Shiva