Just wanted to share some of our experiences with SRM (VMware Site Recovery Manager) and Nimble. Last night we performed our 4th SRM test, every time we run a test we learn something new. We thought we had all our issues resolved from the last test and were hoping this would go smoothly and get home at a responsible hour. Well of course that didn't happen, this is IT, something new always seems to pop up. Here is our systems current configuration.
- Nimble CS240's OS 1.4.6
- Nimble SRA 126.96.36.199
- VCenter 5.1
- SRM 5.1
- ESXi 5.0
- Servers in this SRM Recovery Plan - 23
- 50mb MPLS between sites
We have tested SRM in the past and have had issues once the machines come up at the DR site but last night was a little different. SRM usually just works, however this time it failed, and then failed again and again and again. The good part is that SRM seems to be very forgiving, you just keep running the recovery plan until it works. This isn't a problem as long as you have scheduled enough downtime to get the system back to the production site. In our experience it usually takes us about 60 minutes for the recovery from HQ to DR, then another 30 minutes or so for the re-protect and then another 60 minutes to recover back to HQ. And then don't forget to re-protect the second time, again for about 30 minutes in our case.
Last night our first problem was that the vCenter server service at the production site just stopped about 18 minutes into the recovery plan. We looked at the event viewer and noticed we had hit the 10gb limit on SQL Express for vCenter. This Vmware KB article explains how to resolve the problem. We actually ran into this once before in the middle of the day and changed the retention period from 180 days to 90 days. We now changed it to 30 days and will see what it looks like in 30 days. Running the purge took about 40 minutes on our vcenter server and cleaned up around 5gb of space. The recovery plan failed (this was a first) and as you can imagine we were a little worried about what would happen to the recovery. So on to our second failure. We started the recovery again and it skips all the steps that it completed successfully the first time around. The first time it failed at "Preparing Protected Site Vm's for Migration." The second time it failed at "Changing Recovery Site Storage to Writable " We aren't exactly sure what happened this time. All the Vmware event said was "data.faults" It appears as though the VCAdmin account was getting locked out from an old backup process. This Vmware KB article touches on it. We killed the backup device and tried it again. On to the third failure. This time the recovery plan failed when powering on the VM's and waiting for the Vmware tools to respond. The default it is set to 300 seconds and a couple of machines didn't respond in time. We ran the plan AGAIN, and it finally completed successfully. We were expecting this recovery plan to take about 60 minutes to complete. With all the errors and troubleshooting it took about 140 minutes. When we reversed the recovery plan from DR to Production it took about 60 minutes.
Here is a breakdown on how long each step takes when it works with no problems.
- Pre-synchronize Storage - 16 Minutes
- Shutdown VMs at Protected Site- 5 Minutes
- Resume VMs Suspended by Previous Recovery - Inactive
- Restore hosts from standby - Not Used
- Prepare Protected Site VMs for Migration- 10 minutes
- Synchronize Storage - 12 minutes
- Suspend Non-critical VMs at Recovery Site- Inactive
- Change Recovery Site Storage to Writeable- 5 minutes
- Power On VM's, re-configure NIC's, reboot again - 15 minutes
Here are some of the other issues we have had with SRM in the past.
- Servers register in DNS then fall out of DNS - Happens to every 2008 box and you need This Hotfix from Microsoft to resolve it.
- Our DNS in general needed some tweaking, we added some manual "AD Domain Services Connections" in AD sites and Services so DNS would replicate quicker.
- We had several small issues where things would not work. It seems if you just bounce the box again it seems to work fine. I am not sure if it is because too many servers are powering on at once or what.
- SQL TempDB Data and Logs Drives have the wrong drive letters at the DR site. We have this problem because we did not want to replicate TempDB data or Logs. This article explains why and how to do it. I didn't follow it exactly, that's why we have the problem. We just have to change the drive letter and reboot the SQL box, that fixes it.
- We recovered a couple of Domino servers, Traveler and Quickr, for whatever reason the Domino service will not start. We have to start the Domino server as an application.
- Our Citrix and netscalers still have a few issues, we will have to test that again.
- SRM didn't change the IP address of one of the servers, we changed it manually and all was well.
- One of the servers was not responding on the network, turns out the NIC was disconnected in vmware settings. Enabled it and all was well.
- IBM Lotus Domino mail and app servers have no problem since they don't use SRM. They are clustered out of the box and just work. Domino has been the shining star of our entire DR plan. Easy to setup, easy to maintain and it works.
So onto the lessons learned.
- Test your Recovery Plan
- Test it again
- Estimate how long you think it will take to test the plan, then double the number and tell it to management
- Check your SQL Express DB size before you start a recovery plan.
- Bounce vCenter and SRM at both sites before kicking off the plan - I am not sure this will help or not, but it seemed like SRM had a lot of weird issues. I am assuming a reboot before hand would have eliminated some of our failures.
- Install the MS Hotfix for DNS issues beforehand.
- Change SanProvider.fixRecoveredDatastores in SRM so it will automatically rename your volume on failback. This can found in the SRM Admin Guide.
- Nimble just plain works, no issues with the SRA.
- Test your Recovery Plan
This is my experience with our setup and our recovery plan. If something doesn't make sense or sound right, let's discuss. I am always looking to learn new ways to do things.