NFS Shared Datastore 50% Throughput

jgebhart2 · ‎2019-08-05

I'm seeing a weird issue that seems to be specific to a particular application.

First, this application is a 3rd party/customized application which processes data sets around 100GB in size. The raw data is read, analyzed, and written back out.

We moved this data set from physical servers to VMware with NFS datastores. For various reasons, it's a single server which is processing this data, and in some cases, we only have 1 GbE available for the storage. I don't expect to see more than 1 Gbps throughput, but I do expect to see it in each direction as the network is Full Duplex.

However, when this data set is processed, I see almost exactly 1 Gbps of throughput... half read, half write... as if there's a Half Duplex link somewhere in the mix. After enlisting the help of our server and network teams and not finding the smoking gun, I fired up a synthetic workload (iometer) and was able to validate that we get 1 Gbps in each direction simultaneously. So I know there is no misconfiguration - the infrastructure is capable of Full Duplex 1 Gbps.

However, this application reaches a very suspicious plateau at 500 Mbps read & 500 Mbps write. Unfortunately, when this occurs, something gets so congested that disk queue length on the guest shoots up to about 25, which is roughly 1 per spindle in the aggregate and the server becomes extremely sluggish and unresponsive. Takes roughly 5 minutes to log in via RDP.

While testing, I moved the workload to a 10 GbE network and again, verified I could push more than 5 Gbps, and I can with iometer. However, the application does not seem to be able to. It doesn't have the disk queue length issue and sluggishness, but I have a feeling if I increase the workload, I'll find that "50% limit" on the 10 GbE network, just like on the 1 GbE network.

I moved the server's C:\ disk to block storage and left the data drive on the NFS datastore, and this seems to have resolved the sluggish responsiveness of the guest, even though disk queue length on the guest is 25 on the data drive. So I think I found the workaround... what I'd like to understand better is why disk queue length jumps so high when the application is only filling the wire 50%.

Has anyone ever seen this before? Can anyone offer an explanation?

SpindleNinja · ‎2019-08-06

what was your queue depth setting for NFS vmware side?

jgebhart2 · ‎2019-08-06

It's the default for this version, 4294967295

SpindleNinja · ‎2019-08-06

try 128.

jgebhart2 · ‎2019-08-06

Setting the queue depth to 128 actually made the condition worse. We are completely unable to log into the server remotely with the workload running.

SpindleNinja · ‎2019-08-06

Yikes... sorry about that. It's helped in the past on some troubleshooting issues i've run through. 64 is another option.

And it's just this one VM that's having this issue?

And have you opened a performace case with netapp?

jgebhart2 · ‎2019-08-06

More than one VM... There are several that process this type of data set around the country and they have the same behavior.

Haven't opened a performance case because I can't find evidence of a performance problem from the NetApp perspective. Volume latency is fine, network throughout is 50%, and I can push a larger workload with a synthetic workload.

SpindleNinja · ‎2019-08-07

from the network side, is everything pretty consistant / best practices applied?

jgebhart2 · ‎2019-08-09

As far as I can tell, we only seem to differ from best practice in that we are not using 9000 byte frames. However, I don't believe that is an issue in this case because again, with the synthetic tool, I can push the link to 100% utilization with multiple different workloads modeled.

SpindleNinja · ‎2019-08-09

That's consistant across all datastores/hosts though correct? i.e. no host has Jumbo Frames running.

NFS Shared Datastore 50% Throughput

Introducing GenAI Search on NSS