11 Replies Latest reply: May 14, 2015 7:38 AM by Gary Martin RSS

    VMware ESXi w/ iSCSI boot - controller failover behavior

    jjohnston1127 Newbie

      Hi,

       

      I have ran into many occurrences over time where ESXi hosts throw up an alarm stating they lost connectivity to the datastore backing the boot filesystem if they were booted using iSCSI.  I have experienced it on any array that supports iSCSI boot and the only fix is to restart the management agent on the hosts. This usually happens during a storage controller failover.  The storage controller failover process usually takes anywhere from 15-30 seconds.

       

      I was doing some research and found a software iSCSI adapter setting named Recovery Timeout that, according to Cormac Hogan, is the number of seconds before an active path is marked dead.  It is currently set to 10 seconds.  I was wondering if I changed the setting to something like 60 seconds if there would be any adverse affects anyone could think of.  My thought is within 60 seconds the controller failover process should have completed and in the rare case that both controllers are dead it wouldn't matter if 60 seconds went by before the hosts started freaking out.


      Thoughts?

        • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
          Nick Dyer Navigator

          Hi,

           

          Out of interest, do you have the Nimble NCM for VMware PSP installed in your ESX hosts? I'm curious to see if this would resolve the path timeout issue as you've observed.

          • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
            Kevin Losey Newbie

            Interested in this also, have you found the correct fix? I get the warning on all my hosts, NCM (NCS and PSP) is installed. I am manually failing over controllers for preproduction testing.

             

            I changed the Recovery Timeout to 60 and still get the warnings.

             

            jjohnston1127 is this post yours? I am not finding much info online on this.

            iSCSI boot hosts - lose access to boot file system upon controller failover

              • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                jjohnston1127 Newbie

                I have not found a fix.  I have multiple targets for the boot LUN - the original static target that gets set when it boots by VMware pointing to TG1, then one static target to each of the discovery IPs for the LUN.  When I look at the paths for the datastore inside ESXi, all three show active (two going to TG1 and one going to TG2 IP), but it seems if the path to the LUN that the server booted from goes down, VMware freaks out even though there is still an active path to the data store.  I did post that in the VMware communities, hoping someone would have run into this before or told me I have something configured wrong, but no luck.  Nothing obvious stands out to me.

                  • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                    Kevin Losey Newbie

                    In my testing, manual failover does work. No loss of datastores or vm corruption, only receive the warning on the hosts.

                     

                    I Have been running FC attached storage since 2002 with VMware, new to iSCSI. I too believe my iSCSI config is good,

                     

                    I would be willing to compare configs sometime.

                     

                    Kevin

                      • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                        jjohnston1127 Newbie

                        Yeah, I've been doing iSCSI for a long time and iSCSI boot for a quite some time. My experiences are any data store that has multiple paths fails over fine, but the boot LUN for whatever reason does not - or at least not quick enough to avoid the warning message and having to restart the host management agents.

                         

                        My iSCSI setup is pretty general, per the recommended setup.  Two iSCSI vmkernels, each bound to one host vmnic with no standby adapters.  Select both vmkernels for Network Configuration, setup dynamic discovery to both iSCSI discovery IPs.

                          • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                            Kevin Losey Newbie

                            Similar configuration. I am going to hit up my local Nimble engineer, will post any findings. Thanks

                            • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                              Brad Fluit Newbie

                              I've been doing iSCSI boot for some time as well, also right in line with best practices.  Not until yesterday did I understand why the lost connectivity error comes up when where is an iSCSI interruption on the boot volume.  I'll try to summarize.

                               

                              - I am usually working in Cisco UCS environments where we create 2 iSCSI boot options.  Its important to note that this doesnt equal multipath, since the BIOS is only choosing one of those paths to boot from.  Once booted, that remains the only NIC/path used for the boot LUN.

                              - MPIO is not supported for iSCSI boot LUN's/volumes. So the path chosen at boot time remains the only path for runtime. This explains why the error is only seen on the boot volume, and all other volumes continue to run fine after a failover.

                              - If the path for the boot LUN/volume is interrupted, vSphere throws the error warning we are familiar with, and never clears it. After a controller failover, my experience has been that the host is able to once again see its boot volume, but the error remains.

                              - A reboot of the host, or less disruptively, a restart of the management agents on the host will clear the error.

                               

                              Hope that helps you out a little.

                                • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                                  Kevin Losey Newbie

                                  Thanks for the reply. I understand why we get the warning, I was hoping there was a setting to eliminate the warning.

                                   

                                  Thanks guys

                                    • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                                      Mark Weimer Wayfarer

                                      I have the exact same issue with my iSCSI boot volumes and I found a solution/workaround.

                                       

                                      If you lose connectivity to the NIC that runs the boot LUN (switch reboot, cable disconnect, controller reboot/failover, etc.), you will see the following error: Lost connectivity to the device backing the boot filesystem. As a result, host configuration changes will not be saved to persistent storage. This error is being displayed because connectivity is lost and the iSCSI boot does not support Multi-pathing, which means that if connectivity is lost between the controller on the Nimble and the NIC on the host, the host can no longer access its boot lun and cannot write logs, etc. The good news is that the whole ESXi OS is loaded into memory so there is no outage for the VMs or the hosts. Once connectivity is restored the host can access the storage again. The bad news is that the error does not clear automatically. I can neither confirm nor deny that the host does in fact reestablish connectivity automatically after the failover and would be able to write logs even while still displaying the error message. I suspect that this is the case, but perhaps someone with a deeper understanding can speak to that.

                                       

                                      The easiest way to fix this error/warning is to put the host into maintenance mode and reboot it. Unfortunately, this takes time and requires lots of vMotion activity.

                                       

                                      The other way to resolve this (and it can be done without a reboot) is to restart the management agents on the host. This can be done in two ways:

                                       

                                      1) Use the remote KVM of each host, log into the ESXi console and follow the menu options to restart the management agents.

                                      2) SSH into each host and run the commands to restart the management agents.

                                       

                                      I'm including a link here to a blog post that outlines these processes well.

                                      https://fvandonk.wordpress.com/2014/01/08/iscsi-boot-disk-disconnect-fix/

                                        • Re: VMware ESXi w/ iSCSI boot - controller failover behavior
                                          Gary Martin Wayfarer

                                          Hi,

                                           

                                          Just picked up this thread because I am about to embark on moving our ESX boot volumes from NetApp to Nimble.  It's quite timely too as I recently had an issue where I took down a NetApp (one node in a cluster) but due to running in single image mode the boot configuration had locked in the boot path (as found above no multi-pathing for boot volume).  This killed off a few hosts and did some things I didn't like.  Was hopeful that moving to Nimble might free me of this configuration but looks like it might actually be ever so slightly worse (I was dumb to take down my NetApp node, as I had disabled clustering).

                                           

                                          So, it looks like I will need to rebuild my ESX hosts with a slightly bigger datastore (currently running booting from 1GB LUN, no local datastore and remote datastore for swap/logs) booting from Nimble.  Almost tempted to add local disks to my servers, but seems like a waste of UCS to do that (trying to keep stateless).

                                           

                                          I'll keep in mind the information here.  I found some info on the VMWare Communities site about using Powershell and PowerCLI to restart management agents (might be quicker than enabling SSH or using KVM to the console in to each box).

                                           

                                          PowerCLI command to restart management agents o... | VMware Communities

                                           

                                          Script is

                                          Get-VMHostService -VMHost MyEsx | where {$_.Key -eq "vpxa"} | Restart-VMHostService -Confirm:$false -ErrorAction SilentlyContinue


                                          Could probably get hosts in a cluster and pipe it into that command to restart on each host, maybe introduce a sleep between each one so they don't all stop responding at the same time.


                                          Ideally VMWare would allow the changing of the configuration location (/etc) to a datastore (where MPIO would be available) but then really there is an overlap between that PXE booting.  I'd love to do PXE boot, but we don't have the kind of money required for Enterprise Plus licensing.  Or even just a way to adjust that disk timeout other that the one already tried.