I have a new Nimble CS-1000 connected via 10GbE and Jumbo frames
I am in the process of validating host-side configuration settings.
I ran a simple read test repeatedly re-reading the same 4Kb block using unbuffered direct IO.
The goal of the test is to read directly from the Nimble controller cache to test network connections and network latency.
Ideally, on a tight network with low latencies, the performance would approach wire speed, as the IO size increased as the round trip network latencies would be amortized across larger requests. This also tests that the large TCP window sizes are working properly.
I am getting unusually low values ... of only about 2700 - 3000 4kb reads from cache of the SAME data.
I've seen the read IOPs increase to ~ 3500 when I increase the read size to 8kb ... this could be explainable given Nimble's preference for 8kb page size. If I increase the read size (from same locations) to 1MB or more, the IO rate approaches 350 MB/sec single-threaded with no overlap or read-ahead ... which is a reasonable rate.
Iscsi immediate data max size is 32kb.
Centos 7.2 is the host OS, running the bundled open iSCSI stack.
I've gathered host-side network statistics at the Ethernet and TCP layers, and for a 1000 IO test, 1002 to 1005 packets are sent and received. No retransmits, no delayed acks. I even have a latency plot, which shows a few outliers, but otherwise it looks fairly consistent, except that the round trip time is 0.3 to 0.4 milliseconds.
with Linux "ping" reporting min/avg/max/mdev of 0.224/0.274/0.431/0.058.
The Linux host is running as a VM under VMware ESXi 6.x on a relatively idle system, using the vmxnet3 paravirtual NIC.
I'm running dual-fabric 10GbE for iSCSI, on separate subnets. The Nimble iSCSI volume mounted as an external ISCSI LUN.
For this initial testing, I am only using a single 10GbE path.
The network path is vmxnet3 virtual NIC => VMware virtual switch => ESXi physical 10GbE NIC => External 10GbE Switch => Nimble.
I am not that familiar with iSCSI network latencies, especially in a virtualized environment. I have extensive experience with 10GbE and faster networking with non-virtualized environments (non-iSCSI), and low-latency Fibre channel (non-virtualized, with only a single-hop)
If you increase the number of threads and introduce multipath, some unexpected curves show up:
Let me emphasize ... these synthetic tests are designed to read from Nimble memory cache ... to stress the network connection.
I am running "noop" IO scheduler, on both block devices and the multipath pseudo-device.
My gut feel is that I am experience some additional latency caused by some form of interrupt moderation, large receive offload, or host TCP stack coalescing latency ... but this test results in a single packet transmit (the Read request), and a single packet receive (the data from the Nimble). Since this artificial test is unbuffered and synchronous, there is no opportunity for coalescing.
The latency chart shows that there are no massive spikes, that might occur due to delayed acks or packet retransmits. The network statistics confirm no retransmits or delayed acks.
This is likely not a Nimble-issue per-se, but I was hoping that others in the community may have experienced similar behaviors and have identified what host or ESXi configuration setting was adding the additional latency, easily seen with "ping".
I will admit I have not gone through and implemented every procedure identified in the VMware best practice guide for low-latency operation, but on a relatively idle large ESXi host I was not expecting that it would make that much of a difference ... at these performance levels.
Are these iSCSI latencies representative or am I overlooking some configuration setting?
Thanks for your help.