How to determine if the performance issue is from the cluster or from a client side?

heightsnj · ‎2020-08-12

NFS datastore with one Window VM in it, in vSphere environment. AFF SSD aggr.

The user is complaining taking much longer to complete a job, and the graph shows about 75 ms latency in about 18 hours.

During the same period of the time for this volume, I can see in OCUM about 10 ms latency, and also 17k IOPs and 1000MBs throughput which is much and much higher than some other time.

I have checked around, I don't see any issues on the storage cluster. 10ms is not good, but not terribly bad neither. So, how do I tell if the slowness was really caused by the Window VM or the storage cluster?

AlainTansi · ‎2020-08-12

Hi,

Where are you reading the 75ms latency?

From what OCUM is telling you, we can clear the suspition of this being a storage side bottleneck. Although you see the 10ms, it is not bad.

You can further validate from the cluster side by reviewing qos statistics;

cluster1::> qos statistics workload latency show -iterations 100 -rows 3

cluster1::> qos statistics volume latency show -volume VOLUME_NAME -vserver VSERVER_NAME

Check latency reported in ms in Network column for any abnormal readings

How to troubleshoot performance issues in both clustered Data ONTAP and Data ONTAP 7-Mode systems

https://kb.netapp.com/app/answers/answer_view/a_id/1031202/~/how-to-troubleshoot-performance-issues-in-both-clustered-data-ontap-and-data

I am not particular sure how you can determine performance from the client side, but the above will tell us if there is a performance issue from the cluster side.

Please share your feedback

netappmagic · ‎2020-08-12

What is the method to tell how many IOPs a SSD aggr can provide?

heightsnj · ‎2020-08-12

75 ms was shown in one of vSphere performance graphs for the particular Window VM, matching 10ms as I saw in OCUM in the same period of the time.

What if people is saying that 10ms was too high, since it should be 2ms, as seen on the storage most of the time, and that means storage could not handle the workload?

paul_stejskal · ‎2020-08-12

Perf L2 TSE here! First of all, perf issues can be a pain to track down sometimes. We are working on some KBs to try to simplify the process, but could use some feedback as well!

So, a few things:

1) What kind of filer, and what kind of disks (X # with fw version please too).

2) In AIQUM, in the volume, you can see the breakdown of cluster components in the latency graph. Where is the latency?

3) What protocol?

4) Have you ever had latency at 10 ms?

5) Is this workload normal or unusual in terms of load?

6) What is the expected latency? If not sure, will need to validate with AIQUM.

7) Version of ONTAP?

Now I can think of a few scenarios this could be:

1) Latency amplification because vSphere is reporting upstream of ONTAP. In this case, if you took a VROPs/vSphere latency graph at the datastore level compared to an AIQUM graph, they would have the same or similar graphs, but just higher peaks and valleys in VMware. In this case, I'm not surprised because it may be queuing at the network/host side a tad bit. Probably 10 ms is too high here.

2) A huge discrepancy due to network delay. If iSCSI/NFS, I just like to get a tcpdump from the filer during the issue. It's honestly the best way to troubleshoot network problems. Bonus points if you can get tcpdumps from ESX (you'll need to start two, one for ingress and one for egress and merge with Wireshark).

3) This latency has been this way all this time, but something changed in the amount of data the job has to process.

And to the question above about the max IOPs an aggregate can produce, we don't publish those guidelines because unlike Solidfire for example, ONTAP is designed for many different workloads and varying performance profiles from small to large block, and metadata from small to big needs. An op will have a different cost, and it is highly dependent on your individual workloads. Customers have different workloads even if they say they run a "standard" VMware install for example, so I've found there is no standardization.

If you want sizing numbers, please get with your account rep/SE to discuss sizing and what your limits are. Generally that is done before sale of a system, even for an upgrade/head swap.

TMACMD · ‎2020-08-12

Check the basics first?

Using Jumbo frames all the way through ? (it was indicated an NFS datastore)

-> net ping -lif nfs_lif -vserver esx_svm -dest esx_nfs_ip -d tue -p 5000

Is there any routing going on? some people do (against best practices) route to get to datastores, increasing latency

-> net int show -vserver esx_svm

Have the appropriate NFS tunings been set in ESXi per Virtual Storage Console?

All easy and verifiable.

heightsnj · ‎2020-08-14

@paul_stejskal

1) What kind of filer, and what kind of disks (X # with fw version please too).
2xA300 node within 8 nodes cluster. 7TB ssd aggr, FW: NA52

2) In AIQUM, in the volume, you can see the breakdown of cluster components in the latency graph. Where is the latency?
During the entire 18hours when VM was around avg 45ms, 75ms at peak, the volume was only around 4ms, and reached 12ms max during 2 hours.

3) What protocol?
NFS Datastore/vSphere.

4) Have you ever had latency at 10 ms?
No, most of other time, only 1-2ms.

5) Is this workload normal or unusual in terms of load?

don't know. The user didn't clear share such infor, and just believed the storage caused the issue.

6) What is the expected latency? If not sure, will need to validate with AIQUM.
expecting 2ms, as shown in most of other time.

7) Version of ONTAP?
9.6p2

Yes, graphs on VM and the filer are similar, but just got amplified. Also, the IOPs was about 16k nd throughput was almost 1000 MBs within the entire 10 hours. In other normal time, IOPs was only about 300, and very small throughput. So, to me, it was just due to high workload on VM during this period of the time.

I will take your advise, and run tcpdump on both sides, if the issue starts to occur again.

The difficulty is how to explain to them that the problem was not on the storage side?

@TMACMD

We are using Jumbo frame, and all in the same 2 layer subnet. VAAI installed. We tested, jumbo frame doesn't make too much difference.

TMACMD · ‎2020-08-14

If you are using Jumbo frames, you should verify that they are working all the way through. Do a quick check, ping from each ONTAP NFS LIF to each NFS vmk on each ESXi host.

I have seen cases (especially where host profiles are not used) where a host or two was overlooked and did not have jumbo enabled like everyone else.

Does not hurt to do a quick check.

TMACMD · ‎2020-08-12

Have you enabled jumbo frames?

Are you routing? In other words is the ESXi host vmk and the nfs data store on the same subnet?

Do you have VAAI installed?

Are you using VSC? Have you enabled the tunings that VSC suggests?