7 Replies Latest reply: Jan 3, 2016 12:37 PM by Alan Price RSS

    In-guest stun times during failover

    Vlad Valeriu Velciu Wayfarer

      Hi all,

       

      Was wondering what is the usual duration of the in-guest stuns that you see during failovers. From the moment you lose connectivity to the moment you regain it (or from the moment the guest stops disk activity til it restarts it).

       

      We did some manual failovers and seen recovery times between 20 and 30 secs. In my opinion it is a bit much and was hoping to find some examples from your experience.

       

      Thanks

        • Re: In-guest stun times during failover
          Chuck Colht Wayfarer

          Not sure what you mean by 'stuns'. I associate 'stun' with the vm freeze that occurs during vMotions which affects the entire vm. Array failover only affects IO. I have seen iscsi connections fail for up to 30 seconds during failover of the array controllers. Most of the time it is around 15-20seconds. Since my scsi timeouts are longer than that it isn't an issue for most vms. However some MS Failover Cluster nodes freak out if they miss a partner io so I have to monitor them during planned failovers. VMware datastores act the same, no issues there at all. Haven't had an unplanned failover yet but I have tested by pulling controllers and the results are about the same. 

           

          Nimble likes to point out that lots of companies upgrade during business hours but I think that is silly. Why run the risk? I'd never schedule a failover during potentially heavy IO periods.

            • Re: In-guest stun times during failover
              Vlad Valeriu Velciu Wayfarer

              Thanks Chuck for sharing your experience.

               

              By stuns I mean IO freeze to vmdk not guest stun.

               

              I also haven't seen any issues with guests but vmware datastores report all paths down after 10 seconds of connection loss. Other than this event, nothing major. Was just hoping that it may be lower then 10 seconds as to avoid the APD errors.

               

              I agree with you on not upgrading during business ours.

            • Re: In-guest stun times during failover
              Alan Price Adventurer

              Hi Vlad.

              You might take a look at a post I made a while back regarding similar problems: A disruptive non-disruptive failover?. There's a lot of different feedback there. It's tied to UCS in particular but it also deals with VMware. One thing I'd note is to make sure your NCM package is current since it will help with path selection and failover to the array. Once I made all of the changes and updated our drivers our guests don't notice the IO pause and just keep running.

               

              Alan

                • Re: In-guest stun times during failover
                  Vlad Valeriu Velciu Wayfarer

                  Hi Alan,

                   

                  Thanks for chipping in. I have gone through your pos in the past and check the host network from the start. Everything is up to date and when pinging the Nimble iSCSI interface from the host during a failover, we only se a ping fail or gets delayed. So the switch and nics work fine. I went through the esxi logs and it seems that Nimble doesn't start processing iSCSI  immediately, only after 20-30 seconds.

                   

                  I am waiting on support to get a confirmation.

                   

                  Vlad

                    • Re: In-guest stun times during failover
                      Alan Price Adventurer

                      Sounds good. As I noted in the post I marked as an answer, when Support pulled our logs they found abnormal login timeouts. Installing Cisco's enic drivers helped get those under control, along with keeping our ESXi and NCM patches up-to-date.

                       

                      Out of curiosity, do you have you Nimble connected to dedicated uplink switches? Our UCS setup uses Appliance Ports, so some minor issues come up because more paths and VLAN routes have to go down during failover of the controllers or UCS FIs. In previous iSCSI setups I've run we had fully redundant uplink switches so there was extremely little disruption to the paths even during an array's failover (since there were multiple physical paths to a given interface or VLAN). I've opted not to purchase switches just for this purpose since UCS and Nimble are designed to handle changes gracefully, but that means I've seen slightly longer, though rarely problematic, failovers.

                       

                      Alan

                        • Re: In-guest stun times during failover
                          Vlad Valeriu Velciu Wayfarer

                          We have a 5412zl HP chassis with 2 x 10 gig/8 port modules. These are used only for iSCSI traffic. Each module is on a separate VLAN.

                           

                          During failover we have not seen any IP disruption, only the occasional delayed ping between host iscsi port and nimble iscsi port which is expected as the target IP floats between controllers.

                           

                          According to the default vmware timeouts and to my understanding of those, after 10 seconds of no iSCSI servicing (RecoveryTimeout), the esxi host drops the iscsi connection and tries to login. Every 5 seconds after, the hosts attempts to login until it succeeds.

                           

                          We have applied the settings from KB-000087 although only LoginTimeout would have made sense even if we have a small environment, but we have seen no changes in behavior.