16 Replies Latest reply: Jan 3, 2016 8:41 AM by Alan Price RSS

    A disruptive non-disruptive failover?

    Alan Price Adventurer

      I thought about opening a support ticket but decided that I'd reach out to the friendly Nimble community (not to be confused with your Friendly Neighborhood SE) to start a discussion and find out what I must have done wrong.

       

      Since sometime early in NOS 2.x, possibly about when I switched over to using NCM for path management, we have a bit of an issue during our formerly non-disruptive updates.  The first few upgrades after we bought our Nimble we done midday and never caused a stir but lately our whole network takes a pause after the controller failover.  Watching the Nimble during the upgrade I see that the tge interfaces have no traffic and, consequently, neither do the volumes.  After a couple of minutes the paths all seem to reconnect and the servers get their drives back.  Luckily our VMware guests handle this pretty well and just kind of sit there while they wait for their disk requests to go through.  But, I suspect it's not a great idea to tear out a bunch of hard drives while they're in use.

       

      Has anyone else seen this kind of behavior?  Our topology hasn't changed except for NOS 2.x (currently 2.5.0), NCM, and some UCS firmware releases.  I've reviewed all of the setup guides a couple of times to make sure I'm not doing something obvious, so I hope this turns out to not be something obvious.

       

      The Details

      We have a Nimble CS220G-X8 connected to a pair of Cisco UCS FIs as a direct-attached appliance.  VLANs are set so that FI-A has one and FI-B has another; the Nimble's tge1 ports run to FI-A and tge2 runs to FI-B.  We use VMFS datastores and NCM multipathing, which VMware reports is fully functional.  I haven't been connected to our ESXi hosts during a failover so I'm not sure what they report for their paths during the failover.  We use the software iSCSI client and have one vmnic bound to one vmkernal port per iSCSI VLAN.

        • Re: A disruptive non-disruptive failover?
          valdereth Adventurer

          Never hurts to start a case with support.

           

          What address zone are you using?  (Single Zone, Bisect, Even/Odd)

            • Re: A disruptive non-disruptive failover?
              rfenton Tracker

              Hi Alan,

               

              I've performed several controller upgrades (physical) and software.  Typically you will see a pause in IO as the controllers transition from the Active to Standby controller.  If you were pinging the management interface you may see a dropped packet, typically when viewing ESX or a Guest you will see a pause in the IO activity for a few seconds.  More often or not this is around 5-10 seconds but it recommend setting SCSI timeout to 120 seconds (which is best practice for pretty much all storage vendors).

               

              This is also the value that NCM will set for Windows/ESX.

               

              valdereth advice is good - if your seeing unexpected behaviour then call to support should be made so they can check over your configuration.   I doubt address zoning will come into it if your plugged directly into Fabric Interconnects of your UCS as this really only comes into play when there are inter-switch links in the mix

               

              Cheers

              Rich

                • Re: A disruptive non-disruptive failover?
                  valdereth Adventurer

                  That'd be my lack of UCS experience showing through

                   

                  The longest period I've where seen dropped pings to the discovery addresses was probably only 5-10 seconds during a software update.  This has always been during low I/O periods - I'm not sure if timeouts increase during intense I/O.

              • Re: A disruptive non-disruptive failover?
                Alan Price Adventurer

                Support is my next stop but I was curious to see if anyone else would have noticed the same behavior.  I do expect an IO pause during a failover while everything is re-learned, and we used to have those short pauses, but now it's a couple of minutes and noticeably causes servers to stop responding.

                 

                I'm using single zones since we split into two distinct subnets on two switches.

                 

                Thanks!

                • Re: A disruptive non-disruptive failover?
                  Alan Price Adventurer

                  Hi all.

                  I might be onto something but it will take until the next software upgrade to confirm if I've got it fixed.  While at the NIOP course it got me thinking to check a setting that changed in NOS 2.x.  It turns out it hadn't been set right after the upgrade (my bad).  I tried a test failover today and it was as smooth as expected; a software upgrade will be the real challenge.  I'll post back as soon as that's done.

                   

                  Also, Marty, to your request:

                  Know that Nimble's failovers are supposed to be transparent, and I've seen them work that way personally in the past.  Nimble's InfoSight metrics still show well over half of their customers performing software upgrades during business hours.  If you haven't seen the blog post from November check it out: Nimble Storage Blog | Go Ahead – Update Your Storage Operating System in the Middle of the Day.  Those numbers are still tracking from what I'm heard and I'll be back at midday upgrades once I get this bug worked out.

                   

                  Alan

                  • Re: A disruptive non-disruptive failover?
                    Alan Price Adventurer

                    After updating to 2.2.6.0 today it appears I still have a problem.  Our network took a brief pause during the update and vCenter logged an "all paths down" event for every Nimble datastore on each of our hosts.  The outage last over two minutes as indicated by other logs that note the event has been over 140 seconds long and the hosts are switching to I/O fast fail mode.  Looks like I'll need to open a support ticket.

                    • Re: A disruptive non-disruptive failover?
                      Amirul Islam Adventurer

                      Alan, please review the following:

                      Network Control Policy in UCS

                      Flow Control policy in UCS

                      Portfast is enabled in switches connecting from FIs

                      I saw a similar issue in a slightly different configuration and it was down to spanning tree on the uplink switches.

                        • Re: A disruptive non-disruptive failover?
                          Alan Price Adventurer

                          I double-checked those policies and our core switch and they're all set to the recommended configurations.  Spanning tree is a very good thought.  It would describe the problem I'm seeing but I'm using Appliance Ports and have an STP edge directive on our uplinks, so the obvious areas aren't falling victim to a STP  timer.  I'm keeping it in mind for continued research, though.

                           

                          Thanks!

                          Alan

                        • Re: A disruptive non-disruptive failover?
                          christoph.berthoud@vista.co Wayfarer

                          We also had our first disruptive failover with 2.2.6.0 and support are investigating

                          • Re: A disruptive non-disruptive failover?
                            Alan Price Adventurer

                            Hi all.

                            I've been working with Support on this issue and we checked a few things I wanted to share.  Our last upgrade this past weekend worked great, but we've also had things work great in the past only to break again.  So, I don't consider these a resolution yet but they did appear to help.  I'll confirm as the next few releases roll out and I install them.

                             

                            • Array logs showed that there was an iSCSI login timeout during the last upgrade so hosts didn't reconnect to the datastores in a timely fashion.  We don't use anything beyond initiator WWN authentication so it's not a CHAP issue.
                            • I double-checked Discovery IPs.
                            • The support engineer double-checked our UCS network control and flow control policies per the integration guide, and as noted above by Amirul, and found them to be correct.
                            • The engineer double-checked our NCM installation to make sure that it was indeed still current and running, and yes, it was.
                            • The engineer mentioned that he's seen problems before when outdated (read: default) VMWare NIC drivers are used and suggested I make sure the Cisco drivers are current.  I installed the latest Cisco enic bundle on all of our hosts, since we use SW iSCSI.  The drivers, if you're looking for them, are available from the vSphere download pages or as an all-in-one ISO from Cisco.  Only the enic drivers apply to us but make sure to check your fnic or other vHBA drivers as required.

                             

                            In summary, the one change I made at Support's request was to update the NIC drivers.  I did that, things worked fine during the update, and I'll post back after the next upgrade.

                             

                            Alan

                            • Re: A disruptive non-disruptive failover?
                              Alan Price Adventurer

                              We haven't had a maintenance window in some time but just completed one over the holidays. Everything seemed to be fine this time with the only disruption to one particular Linux VM (which belongs to a family we've had many different issues with before). I think Support's answer regarding the Cisco custom drivers was the best one, since everything else had been checked a few times before.

                               

                              Hope someone else finds this useful in the future!

                              Alan