AnsweredAssumed Answered

A disruptive non-disruptive failover?

Question asked by Alan Price on Mar 25, 2015
Latest reply on Jan 3, 2016 by Alan Price

I thought about opening a support ticket but decided that I'd reach out to the friendly Nimble community (not to be confused with your Friendly Neighborhood SE) to start a discussion and find out what I must have done wrong.


Since sometime early in NOS 2.x, possibly about when I switched over to using NCM for path management, we have a bit of an issue during our formerly non-disruptive updates.  The first few upgrades after we bought our Nimble we done midday and never caused a stir but lately our whole network takes a pause after the controller failover.  Watching the Nimble during the upgrade I see that the tge interfaces have no traffic and, consequently, neither do the volumes.  After a couple of minutes the paths all seem to reconnect and the servers get their drives back.  Luckily our VMware guests handle this pretty well and just kind of sit there while they wait for their disk requests to go through.  But, I suspect it's not a great idea to tear out a bunch of hard drives while they're in use.


Has anyone else seen this kind of behavior?  Our topology hasn't changed except for NOS 2.x (currently 2.5.0), NCM, and some UCS firmware releases.  I've reviewed all of the setup guides a couple of times to make sure I'm not doing something obvious, so I hope this turns out to not be something obvious.


The Details

We have a Nimble CS220G-X8 connected to a pair of Cisco UCS FIs as a direct-attached appliance.  VLANs are set so that FI-A has one and FI-B has another; the Nimble's tge1 ports run to FI-A and tge2 runs to FI-B.  We use VMFS datastores and NCM multipathing, which VMware reports is fully functional.  I haven't been connected to our ESXi hosts during a failover so I'm not sure what they report for their paths during the failover.  We use the software iSCSI client and have one vmnic bound to one vmkernal port per iSCSI VLAN.