Hi,
I occasionally got NFS timeout error when accessing the volumes on the FAS2050A storage system from multiple Linux clients.
nfs: server 192.168.x.x not responding, timed out
nfs: server 192.168.x.x not responding, timed out
nfs: server 192.168.x.x not responding, timed out
I have ran simple sysstat and statit to capture some stat:
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0/plex0/rg0:
0a.17 3 1.49 0.00 .... . 1.38 13.42 1516 0.11 32.00 63 0.00 .... . 0.00 .... .
0a.19 7 3.10 0.34 1.00 138667 2.64 7.48 1453 0.11 32.00 63 0.00 .... . 0.00 .... .
0a.21 89 104.58 101.71 1.15 40939 2.18 8.16 1574 0.69 8.33 2900 0.00 .... . 0.00 .... .
0a.23 89 109.29 107.79 1.39 32168 0.69 24.67 1432 0.80 6.14 4000 0.00 .... . 0.00 .... .
0a.25 90 108.94 106.76 1.41 32073 1.26 13.45 1459 0.92 5.25 3905 0.00 .... . 0.00 .... .
0a.27 88 106.19 104.81 1.39 33839 0.69 24.83 1121 0.69 7.17 2279 0.00 .... . 0.00 .... .
0a.29 88 110.55 108.83 1.39 33139 0.80 21.29 953 0.92 5.25 1690 0.00 .... . 0.00 .... .
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
14% 1575 0 0 1575 353 745 2540 0 0 0 8 97% 0% - 100% 0 0 0 0
15% 953 0 0 953 192 2461 4528 24 0 0 8 95% 0% - 100% 0 0 0 0
13% 970 0 0 970 221 1769 3364 0 0 0 9 95% 0% - 100% 0 0 0 0
38% 1364 0 0 1364 425 5721 5284 4804 0 0 9 99% 59% Zf 100% 0 0 0 0
33% 2404 0 0 2404 472 6967 7112 2224 0 0 9 95% 77% Zf 78% 0 0 0 0
28% 2865 0 0 2865 596 7068 5804 1212 0 0 9 94% 94% 2f 71% 0 0 0 0
12% 1051 0 0 1051 221 1691 1884 668 0 0 9 96% 100% :f 100% 0 0 0 0
20% 777 0 0 777 343 328 748 212 0 0 9 96% 28% : 55% 0 0 0 0
Base on that stat, am I correct that the 2050A are hitting I/O performance issue since the disk util are hitting 100% sometimes and the utilization of the five data disks in the raid group will reach ~90% (normally is 50 ~70%)?
Should there be some other issue like network that may cause the NFS timeout? I have submitted the case to NetApp support. They claimed that enable flowcontrol should fix the issue. But it seems to me it is more a I/O bottleneck. Anynone have idea how to check verify if there is network issue?
Thanks