Network Throughput NetApp FAS8060

Neutrino · ‎2020-05-24

I have a strange issue with NFS provisioned datastore in VMware environment.

For some twisted reason I can achieve only somewhere around 5.6Gbit/s which is far from 10Gbit/s network speed between the storage and the ESXi.

We have 4 x 10Gbit/s ports used for NFS data. They are configured according to the best practices to the upstream switches.

NetApp: NetApp Release 9.6P3

ESXi: VMware ESXi, 6.7.0, 15160138

The described throughput was observed first with storage vmotion which appeared somehow limited.

We have carefully examined our configuration, and now the best practices described on the link below are also in place.

https://library.netapp.com/ecmdocs/ECMLP2843689/html/GUID-346ACB95-6AD4-4DEA-8901-C9697AC3530F.html

However "the limit" is still there.

In addition, I have verified the speed with network test-link run-test -vserver my-veserver -destination ip.addr.of.iperf

ran from Netapp toward VM running an iperf on the ESXi host itself.

I found a discussion related to the similar issue, but unfortunately did not helped me. Somehow the network behave as half duplex.

Any suggestions or help will be greatly appreciated.

Ontapforrum · ‎2020-05-25

Hi,

Some pointers.

There could be number of factors that may be 'responsible' for the slow bandwidth performance for example. However, the very first thing I would like you to check is - flow-control (Just in case you haven't considered it yet).

flow-control: what's the flow-control settings end-2-end in your environment ?

Over the years, NetApp's recommendations for flow-control has evolved and at present NetApp only recommends disabling flow-control for 'cluster-ports' (which is bydefault disabled) and rest of the Ports such as Mgmt & data should be in line with the rest of the settings in your network.

My advise would be to check:
1) What is the flow-control settings on ESXi host interface?
2) What is the flow-control settings on SWITCH?
3) What is the flow-control settings on NetApp?

For identifying the flow-control settings on the NetApp, please run this command:

First: Identify the physical ports part of the vlan-igrp serving ESXi:
::> network port ifgrp show -node node-xx

Next: Identify the flow-control on the Physical ports bonded to vlan-ifgrp: (Not their settings)
::> network port show -fields flowcontrol-admin,flowcontrol-oper -node node-xx

Please note: flow-control only applies to 'physical ports', they are not applicable to interface group (ifgrp) or VLAN therefore you don't have to note their values.

Once you know the current 'flow-control' settings on NetApp side (Physical Ports), ensure it is same end-2-end. For example: If it is disabled (set to none) then disable flow-control on SWITCH and on Host side as well.

According to the newer studies by Network evangelist and use-case recommendations:
TCP is the 'real' end-to-end Flow Control mechanism and TCP is more granular/scalable and handles it better higher-up the stack (Instead of pacing data for entire port, it is better it is handled up the stack by 'tcp'). Therefore, the recommendation would be to : "disable" flow-control "end-2-end" (i.e on NetApp, Switch & Host).

Give this a try and see if it makes any difference, if it's already done (flow-control disabled) and you are still experiencing slow-ness, plz log a ticket with NetApp.

Thanks!

Neutrino · ‎2020-06-07

Hello,

Please accept my apologies for the delayed answer on this.

We have tried the suggestions. We ended up with disabled flow control on all the participating devices.

On the initial tests - the flow control was still enabled on the NetApp nodes, but on the other hand was disabled on the switches.

However the speed still appears to be limited. The limit actually concerns a single TCP stream. According to the theory this have to do something with the TCP Window size. So I went and change that on the NetApp for a particular SVM.
https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.cdot-famg-nfs%2FGUID-52C721B0-567A-4EA0-A534-29E0713CC972.html

On the ESXi side I have tried changing Net.TcpipDefLROMaxLength to maximum as suggested here:
https://docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.networking.doc/GUID-8E451976-8BD4-4052-A492-FFCAC0105690.html

And still no luck.

The tests are performed while running iperf as a server on the ESXi host and "network test-link run-test -vserver svm-nnn -destination ip.of.the.esxi" on the NetApp.

Ontapforrum · ‎2020-06-07

Thanks for the update. We will have to do further investigations here.

In theory:
10 Gig = should expect around 1.25 GB/sec

You are getting:
5.6 Gbps = 700 MB/sec, which is around 57 % of the theoretical value.

May I ask these questions:
As I understand, you are only able to achieve 5.6Gbps ? Is the Host pushing enough data for it saturate the Pipe ?, just trying to understand the issue so that we can isolate the cause further.

Q1) Is the ifgrp (10g ports) dedicated for NFS alone, or there are other Protocols/services riding on it?
Q2) What is the MTU set for Storage/Client/Data-Switches?

Could you give us the output of the following:
1) ::> ifgrp show -fields node,ifgrp,ports,mode
2) ::> vlan show -port <igrp_port>
3) ::> node run -node <whichever_node_the_ports_exists)
>sysconfig -a

Also, could you try:
1) Carve out a 500GB NFS volume and mount it on Linux machine on the same ifgrp-vlan port
2) Go to /mnt/nfs-volume
3) Dump some big chunks of data around 100G and check the Network throughput ?
4) During copy hows the CPU utilization on your NetAp node?

Thanks!