Re: NFS timed out error on 2050A

jasonwo · ‎2009-10-30

Hi,

I occasionally got NFS timeout error when accessing the volumes on the FAS2050A storage system from multiple Linux clients.

nfs: server 192.168.x.x not responding, timed out
nfs: server 192.168.x.x not responding, timed out
nfs: server 192.168.x.x not responding, timed out

I have ran simple sysstat and statit to capture some stat:

disk             ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0/plex0/rg0:
0a.17              3   1.49    0.00   ....     .   1.38 13.42 1516   0.11 32.00    63   0.00   ....     .   0.00   ....     .
0a.19              7   3.10    0.34   1.00 138667   2.64   7.48 1453   0.11 32.00    63   0.00   ....     .   0.00   ....     .
0a.21             89 104.58 101.71   1.15 40939   2.18   8.16 1574   0.69   8.33 2900   0.00   ....     .   0.00   ....     .
0a.23             89 109.29 107.79   1.39 32168   0.69 24.67 1432   0.80   6.14 4000   0.00   ....     .   0.00   ....     .
0a.25             90 108.94 106.76   1.41 32073   1.26 13.45 1459   0.92   5.25 3905   0.00   ....     .   0.00   ....     .
0a.27             88 106.19 104.81   1.39 33839   0.69 24.83 1121   0.69   7.17 2279   0.00   ....     .   0.00   ....     .
0a.29             88 110.55 108.83   1.39 33139   0.80 21.29   953   0.92   5.25 1690   0.00   ....     .   0.00   ....     .

CPU   NFS CIFS HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache CP   CP Disk    FCP iSCSI   FCP kB/s
                                  in   out   read write read write   age   hit time ty util                 in   out
14% 1575     0     0    1575   353   745   2540      0     0     0     8   97%   0% - 100%      0     0     0     0
15%   953     0     0     953   192 2461   4528     24     0     0     8   95%   0% - 100%      0     0     0     0
13%   970     0     0     970   221 1769   3364      0     0     0     9   95%   0% - 100%      0     0     0     0
38% 1364     0     0    1364   425 5721   5284   4804     0     0     9   99% 59% Zf 100%      0     0     0     0
33% 2404     0     0    2404   472 6967   7112   2224     0     0     9   95% 77% Zf 78%      0     0     0     0
28% 2865     0     0    2865   596 7068   5804   1212     0     0     9   94% 94% 2f 71%      0     0     0     0
12% 1051     0     0    1051   221 1691   1884    668     0     0     9   96% 100% :f 100%      0     0     0     0
20%   777     0     0     777   343   328    748    212     0     0     9   96% 28% :   55%      0     0     0     0

Base on that stat, am I correct that the 2050A are hitting I/O performance issue since the disk util are hitting 100% sometimes and the utilization of the five data disks in the raid group will reach ~90% (normally is 50 ~70%)?

Should there be some other issue like network that may cause the NFS timeout? I have submitted the case to NetApp support. They claimed that enable flowcontrol should fix the issue. But it seems to me it is more a I/O bottleneck. Anynone have idea how to check verify if there is network issue?

Thanks

radek_kubka · ‎2009-10-30

Hi,

It does look like a disk bottleneck at this particular time when the data was captured - which with only five data disks is not a big surprise I think.

Can you either run systat for longer, or (even better) a perfstat for a day?

It would be handy to correlate peak times with what actually is going on in your environment - occasional peaks not necesarily are very bad (well, server timeouts are though)

Regards,
Radek

jasonwo · ‎2009-11-11

Hi,

It is also what I am thinking that 5 data disks is not fast enough to handle the I/O request. We submitted case to NetApp before, interestingly here is their reply:

"The disk util being between 90 to 100% is no cause for concern. This is simply indicating that ONTAP is utilizing all or most memory available when needed to maximize performance. The figures do not indicate high disk I/O. I understand how some

figures on perfstat can be misleading."

It seem high disk util doesn't mean anything and it should not be a concern. I am confusing about this and also contrary to what I found on google.

They also point out that it is a network bottleneck and suggest that enabling full flow control on network switch should fix the issue as they found out "Misses" on NFS reply cache statistics are high relative to what is "In Progress". It implies there is more retransmission:

UDP:
In progress Delay hits Misses Idempotent Non-idempotent
606 0 71944 17 0

I can't really justify what they said. But we will have a try to enable full flow control to observe if can improve the sutituation. Just wonder if anyone of you encounter this error before and flow control can fix the issue?

Thanks,

Jason

eric_barlier · ‎2009-11-11

Hi Jason,

Using sysstat to troubleshoot networking issues might not be the right tool. Typicall I d use

ping

traceroute

enable debugging:

nfs.mountd.trace on

nfs.per_client_stats.enable on

pktt (packet trace)

rdfile /etc/messages > look for error messages here

ifstat -a > look for error messages here

check for errors on your clients. see if they have got the same NFS version, use UDP or TCP etc.

I d also configure NFS to use TCP instead if UDP but thats your call. From experience I can tell this is typically a networking

error. your sysstat does not indicate issue. If you are worried about disk bottlnecks then again sysstat is not the best tool, use "statit".

priv set advanced

statit -b

wait a couple of mins

statit -e

read output. if you dont know how to interpret output send it to tech support. I wouldnt trust someone on google more than tech support.

its in netapps interest to solve your issue, they re not just trying to get rid of you mate.

Cheers,

Eric

jasonwo · ‎2010-01-15

Hi,

We managed to add more disk to the 2050 now. We figured out there is a significant improvment on our filer and also much reduce in the number of NFS timed out occurance. The disk util% are dropping to 80. This prove it is an I/O botterneck.

I have to say, the Tech support is giving us the wrong direction by saying that it is network problem but can't identifiy it is an I/O bottleneck from the perfstat we send. Also they calm the disk util being 90% to 100% is no cause for concern which is not the case. We had been losing time to trace the problem from network prospective.

Thanks,

Jason