Base on that stat, am I correct that the 2050A are hitting I/O performance issue since the disk util are hitting 100% sometimes and the utilization of the five data disks in the raid group will reach ~90% (normally is 50 ~70%)?
Should there be some other issue like network that may cause the NFS timeout? I have submitted the case to NetApp support. They claimed that enable flowcontrol should fix the issue. But it seems to me it is more a I/O bottleneck. Anynone have idea how to check verify if there is network issue?
It is also what I am thinking that 5 data disks is not fast enough to handle the I/O request. We submitted case to NetApp before, interestingly here is their reply:
"The disk util being between 90 to 100% is no cause for concern. This is simply indicating that ONTAP is utilizing all or most memory available when needed to maximize performance. The figures do not indicate high disk I/O. I understand how some
figures on perfstat can be misleading."
It seem high disk util doesn't mean anything and it should not be a concern. I am confusing about this and also contrary to what I found on google.
They also point out that it is a network bottleneck and suggest that enabling full flow control on network switch should fix the issue as they found out "Misses" on NFS reply cache statistics are high relative to what is "In Progress". It implies there is more retransmission:
I can't really justify what they said. But we will have a try to enable full flow control to observe if can improve the sutituation. Just wonder if anyone of you encounter this error before and flow control can fix the issue?
We managed to add more disk to the 2050 now. We figured out there is a significant improvment on our filer and also much reduce in the number of NFS timed out occurance. The disk util% are dropping to 80. This prove it is an I/O botterneck.
I have to say, the Tech support is giving us the wrong direction by saying that it is network problem but can't identifiy it is an I/O bottleneck from the perfstat we send. Also they calm the disk util being 90% to 100% is no cause for concern which is not the case. We had been losing time to trace the problem from network prospective.