Network and Storage Protocols
Network and Storage Protocols
Hi,
I occasionally got NFS timeout error when accessing the volumes on the FAS2050A storage system from multiple Linux clients.
nfs: server 192.168.x.x not responding, timed out
nfs: server 192.168.x.x not responding, timed out
nfs: server 192.168.x.x not responding, timed out
I have ran simple sysstat and statit to capture some stat:
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0/plex0/rg0:
0a.17 3 1.49 0.00 .... . 1.38 13.42 1516 0.11 32.00 63 0.00 .... . 0.00 .... .
0a.19 7 3.10 0.34 1.00 138667 2.64 7.48 1453 0.11 32.00 63 0.00 .... . 0.00 .... .
0a.21 89 104.58 101.71 1.15 40939 2.18 8.16 1574 0.69 8.33 2900 0.00 .... . 0.00 .... .
0a.23 89 109.29 107.79 1.39 32168 0.69 24.67 1432 0.80 6.14 4000 0.00 .... . 0.00 .... .
0a.25 90 108.94 106.76 1.41 32073 1.26 13.45 1459 0.92 5.25 3905 0.00 .... . 0.00 .... .
0a.27 88 106.19 104.81 1.39 33839 0.69 24.83 1121 0.69 7.17 2279 0.00 .... . 0.00 .... .
0a.29 88 110.55 108.83 1.39 33139 0.80 21.29 953 0.92 5.25 1690 0.00 .... . 0.00 .... .
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
14% 1575 0 0 1575 353 745 2540 0 0 0 8 97% 0% - 100% 0 0 0 0
15% 953 0 0 953 192 2461 4528 24 0 0 8 95% 0% - 100% 0 0 0 0
13% 970 0 0 970 221 1769 3364 0 0 0 9 95% 0% - 100% 0 0 0 0
38% 1364 0 0 1364 425 5721 5284 4804 0 0 9 99% 59% Zf 100% 0 0 0 0
33% 2404 0 0 2404 472 6967 7112 2224 0 0 9 95% 77% Zf 78% 0 0 0 0
28% 2865 0 0 2865 596 7068 5804 1212 0 0 9 94% 94% 2f 71% 0 0 0 0
12% 1051 0 0 1051 221 1691 1884 668 0 0 9 96% 100% :f 100% 0 0 0 0
20% 777 0 0 777 343 328 748 212 0 0 9 96% 28% : 55% 0 0 0 0
Base on that stat, am I correct that the 2050A are hitting I/O performance issue since the disk util are hitting 100% sometimes and the utilization of the five data disks in the raid group will reach ~90% (normally is 50 ~70%)?
Should there be some other issue like network that may cause the NFS timeout? I have submitted the case to NetApp support. They claimed that enable flowcontrol should fix the issue. But it seems to me it is more a I/O bottleneck. Anynone have idea how to check verify if there is network issue?
Thanks
Hi,
It does look like a disk bottleneck at this particular time when the data was captured - which with only five data disks is not a big surprise I think.
Can you either run systat for longer, or (even better) a perfstat for a day?
It would be handy to correlate peak times with what actually is going on in your environment - occasional peaks not necesarily are very bad (well, server timeouts are though)
Regards,
Radek
Hi,
It is also what I am thinking that 5 data disks is not fast enough to handle the I/O request. We submitted case to NetApp before, interestingly here is their reply:
"The disk util being between 90 to 100% is no cause for concern. This is simply indicating that ONTAP is utilizing all or most memory available when needed to maximize performance. The figures do not indicate high disk I/O. I understand how some
figures on perfstat can be misleading."
It seem high disk util doesn't mean anything and it should not be a concern. I am confusing about this and also contrary to what I found on google.
They also point out that it is a network bottleneck and suggest that enabling full flow control on network switch should fix the issue as they found out "Misses" on NFS reply cache statistics are high relative to what is "In Progress". It implies there is more retransmission:
UDP:
In progress Delay hits Misses Idempotent Non-idempotent
606 0 71944 17 0
I can't really justify what they said. But we will have a try to enable full flow control to observe if can improve the sutituation. Just wonder if anyone of you encounter this error before and flow control can fix the issue?
Thanks,
Jason
Hi Jason,
Using sysstat to troubleshoot networking issues might not be the right tool. Typicall I d use
ping
traceroute
enable debugging:
nfs.mountd.trace on
nfs.per_client_stats.enable on
pktt (packet trace)
rdfile /etc/messages > look for error messages here
ifstat -a > look for error messages here
check for errors on your clients. see if they have got the same NFS version, use UDP or TCP etc.
I d also configure NFS to use TCP instead if UDP but thats your call. From experience I can tell this is typically a networking
error. your sysstat does not indicate issue. If you are worried about disk bottlnecks then again sysstat is not the best tool, use "statit".
priv set advanced
statit -b
wait a couple of mins
statit -e
read output. if you dont know how to interpret output send it to tech support. I wouldnt trust someone on google more than tech support.
its in netapps interest to solve your issue, they re not just trying to get rid of you mate.
Cheers,
Eric
Hi,
We managed to add more disk to the 2050 now. We figured out there is a significant improvment on our filer and also much reduce in the number of NFS timed out occurance. The disk util% are dropping to 80. This prove it is an I/O botterneck.
I have to say, the Tech support is giving us the wrong direction by saying that it is network problem but can't identifiy it is an I/O bottleneck from the perfstat we send. Also they calm the disk util being 90% to 100% is no cause for concern which is not the case. We had been losing time to trace the problem from network prospective.
Thanks,
Jason