Hello sir, welcome to the Nimble forums!
These are great questions, and are always hot topics with every storage system. Unfortunately the answer really is "it depends!".
In my experience, users create large, multi-10s of TB volumes for VMFS datastores because they're easy to manage versus having lots of little datastores. Yet problems often occur using this method as there tends to be I/O, locking and timeout issues on the datastore as you only have a set amount of connections on the volume. For example, in a typical dual 10Gb switch, dual 10Gb NIC, single 10Gb array setup there would be a total of 8 iSCSI or Fibre Channel connections per volume per ESX host. If you have 30-40 VMs on said volume, you're now oversubscribing 30-40 servers down 8 connections - which can lead to performance issues. The other thing to consider here is that if you suffer from the dread "noisy neighbour" VM - a VM which is dominating I/O, throughput or latency - then that could also lead to performance issues as the other 30-40 VMs are effectively being suffocated from resource. The other main problem here is troubleshooting that problem, which is why we designed VMVision as part of Infosight to help find and troubleshoot those pesky noisy neighbours. Finally, the last thing to be aware of are the dreaded VMware snapshot issues; every snapshot of the datastore will pause and potentially disrupt all 30-40 VMs at the same time, and can only continue with the process after all VMs have responded. If each VM takes 30-90 seconds for the workflow that’s a lot of time waiting!
Whilst in no way attributed to Nimble whatsoever, there was a great piece of work carried out on this subject by Jason Boche here which is an interesting read. In his studies he concluded that the optimum balance of VMs per volume were 10 HIGH IO VMs, 15 AVERAGE IO VMs or 20 LOW IO VMs. Of course the next question really is what constitutes HIGH IO vs AVERAGE IO, but it gives you a good guidance.
Finally - the whole discussion above falls away with the introduction of VMware Virtual Volumes (aka VVOLs) which are available as part of vSphere 6.x and NimbleOS 3. If you’re interested in finding out more about VVOLs I recorded a Webinar + Demo on the implementation, and it's available on YouTube (i've directly linked to the video below). it’s 1.0 in it’s implementation today - expect those little bugs to be ironed out by VMware in the future.
Hope this helps!
Thanks for your reply.
It gave me some things to work with.
I can see that my 4 volumes are to few, to handle my 144 VM, so I will create an few more, and smaller, to have the I/O handled out on more volumens.
Do you have any input about Thin vs. Thick in VMware, when the volumes are running Thin Provision?
Allan R. Larsen
Nicks points are great and match my experiences so far.
I am running almost 500 VM's on my Nimble 500 arrays. I have found that maximizing the network connections to the arrays and making sure that they are on switches that can handle the load is a good first step. Originally we had 1GB copper but moved to 10GB for network. That helped with performance. Another thing to note is if you have any MS SQL VM's. I have found that they tend to run better with the Database, Log, and temp volumes directly attached to the VM utilizing the Nimble Windows toolkit and the MS iSCSI initiator.
Finally watch any snapshots carefully. If there is a busy VM that causes things to time out on a volume VMware sometimes will not commit the snap and your end up with multiple "orphaned" snapshots that you do not see until you get to the datastore level. This definitely causes performance issues when a VM is running on top of a weeks worth of snapshots. Please understand that I am referring to the VMware snapshot that occurs when you do a Nimble snapshot.
I have had very good performance with my VM's since I have increased the network throughput and learned to keep a close eye on snapshots. OH just thought of something do not forget to have the NCM installed on the hosts.
Hope this gives you some additional insight.
I'm not an expert on Nimble best practices but I can tell you what is working well for me.
We've been using a CS500 for about half a year and it's been excellent. In fact, it's been so fast compared to what we're used to that we haven't spent a lot of time optimizing and balancing our setup. We have our datastores sized at 6 TB which allows us to put 15-30 thin VMs on each one. That means we range from right where Nimble recommends (20 VMs/datastore is what I was told) to 50% over what they recommend. Despite that, we still average right at 1 millisecond latency for reads and .16 milliseconds for writes. Aside from these general datastores, we also have a few others that are reserved for specific VMs such as large/busy file servers or DB server. I've found that this sizing allows us to just think "Will my VM fit in this datastore without filling it past VMware's 80% warning threshold?" If yes, it can go there. If no, use a different datastore or possibly create a new one. Since the VMware datastore selection windows show space and not VM or VMDK count, this seems to be the simplest way of distributing VMs.
Once we deploy NimbleOS 3 (whenever it's available for hybrid arrays) we will be migrating all of our VMs to new VMFS datastores to implement deduplication and encryption. When this happens, I also plan on changing our datastores to 8 TB and all thick provisioned. The reason for this change is because it's very hard to get a good, high level view of how much data is actually being consumed by our VMware environment. Since Nimble only knows how much data a VMFS datastore ever grew to and those datastores in turn only know how big a VMDK ever got, both systems are unreliable. Add to that the great data reduction rates we're seeing on the CS500 which also skew the reality of data consumed and it's very hard to get that overview. Thus, we are planning to make all of our VMs thick provisioned since multiple layers of thin provisioning don't provide any benefit and we feel that Nimble has the best view of how much data has been consumed so we will only thin provision on the array.
I hope that helps.
Thanks for all of yours feedback.
Its really good to read about how yours setups are running, and how you have build it.
We are running FC to the Nimble SAN at 8Gbit, and are planning to upgrade to 16Gbit FC in a few weeks.
My last concern is the numbers of volumes in the Nimble.
4 volumes at 10TB each, to cover for 144 VM.
Is this a good idea (Apparently it is an Nimble technician who have created the volumens, when my company have bought the unit).
I'm an SE here at Nimble. As Nick pointed out above it's more about shared SCSI queues per LUN/host in ESX than size of datastore. All VMs on a gvien host/datastore will share the same SCSI queue and therefore, depending on the amount of IO per VM, you may want to limit the number of VMs per datastore you have.
In your example you have 144 VMs with 4 datastores, so 36 VMs per datastore. The default outstanding IOs is 32 and these would be shared by the 36 VMs on that datastore/LUN. The next question is how many hosts you have in your cluster. If you had one host then all 36 VMs share the 32 outstanding IOs (still maybe ok for low IO VMs) but if you have say 4 hosts and the VMs are evenly balanced you would have 9 VMs per host on that datastore sharing the 32 outstanding IOs for that LUN which would likely be fine.
I'm assuming you do have more than 2 hosts given the number of VMs so I would expect this design to work ok.
From our experience with vSphere 6 and CS700 arrays we have found our balance point to be 2TB volumes. I know, right. We had to balance the number of datastores per host with the number of VMs per datastore and how the Nimble handles snapshots. Especially if you do any vCenter coordinated snapshots you really have to watch the number of VMs on those datastores or you run into timeout issues and orphaned snapshots. We have set our VMs to be thin provisioned and our Nimble volumes with 0% reserve for a total thin provisioned environment. vSphere thinks it has filled up the datastores long before Nimble does, so far we haven't had any issues with that. If you have any questions, please let me know and I would be happy to share our experiences with you.