Re: Heavy packet loss under ping -f

davidmcgiven · ‎2010-09-20

Dar NetApp Users,

I'm having some problems regarding linux NFS clients connecting to a FAS270 filer. The error is the infamous :

nfs: server xxxx not responding, still trying

nfs: server xxxx OK

I've tried different options on the client side (both under UDP or TCP NFS). Depending on the combination of nfs mount flags, it makes the problem to be less frequent, but sooner or later it always ends up showing this message.

I thought that, since there are two hops from those linux clients to the filer, a network problem might be ocurrying.

So I started to check for network errors and found this very alarming behaviour with the filer :

If I issue a "ping -f filer" I get very high percentages of packets dropped (from 30% to 70%). I've checked this between other hosts on the same switch and LAN and it doesn't happen (0% packet loss). It only happens when I ping a filer, no matter if it's on the same switch (0 hops) or on another switch in a different place (2 hops).

Is this normal ?

The filer is a NetApp FAS270 version 7.2.1.1. The client machines from which I issue the pings are linuxes (2.6) and Mac OS X.

Thanks in advance.

Best Regards,

David

jakub_wartak · ‎2010-09-20

Hi,

have you tried on your linux box:

1) netstat -in

2) traceroute -I <netapp>

3) ip -s link show

On the netapp itself:

1) netstat -in

2) vif status

3) sysconfig -c 10 6

Please paste the results here.

-Jakub.

davidmcgiven · ‎2010-09-21

Hi Jakub!

Thank you very much for your prompt reply.

This is the output of the commands you suggested :

On the linux side :

[root@bender root]# netstat -in
Kernel Interface table
Iface   MTU Met   RX-OK RX-ERR RX-DRP RX-OVR   TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0   1500 0 433164336      0      0      0216933618      0      0      0 BMRU
eth1   1500 0 457557747     11      0      0325302497      0      0      0 BMRU
lo    16436 0 91271105      0      0      091271105      0      0      0 LRU

[root@bender root]# traceroute -l 192.168.0.201
traceroute to 192.168.0.201 (192.168.0.201), 30 hops max, 38 byte packets
1 gea (192.168.0.201) 0.213 ms (255) 0.132 ms (255) 0.107 ms (255)

[root@bender root]# ifconfig
eth1      Link encap:Ethernet HWaddr 00:13:72:52:BB:86
          inet addr:192.168.0.100 Bcast:192.168.0.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:457567208 errors:11 dropped:0 overruns:0 frame:6
          TX packets:325311954 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3836596886 (3.5 GiB) TX bytes:4211863382 (3.9 GiB)
          Base address:0xdcc0 Memory:fe4e0000-fe500000

On the filer side :

gea> netstat -in
Name   Mtu   Network      Address      Ipkts Ierrs Opkts Oerrs Collis Queue
e0a    1500 158.109.208/ 158.109.209.    2g     11     2g      0       0      0
e0b    1500 192.168.0/24 192.168.0.20    1g      0   902m      0       0      0
lo     9188 127          127.0.0.1      87k      0    87k      0       0      0

gea> vif status
No configured vifs present

gea> sysconfig -c 10 6
No arguments are allowed with this option.
usage: sysconfig [ -A | -c | -d | -m | -r | -t | -V ]
sysconfig [ -av ] [ <slot> ]

gea> sysconfig -c
sysconfig: There are no configuration errors.

gea> sysconfig
NetApp Release 7.2.1.1: Tue Jan 23 00:43:25 PST 2007
System ID: 0084276005 (gea)
System Serial Number: 2065922 (gea)
System Rev: D0
slot 0: System Board
        Processors:         2
        Processor revision: B2
        Processor type:     1250
        Memory Size:        1022 MB
slot 0: FC Host Adapter 0b
        14 Disks:            5923.5GB
        1 shelf with AT-FCX, 1 shelf with EFH
slot 0: FC Host Adapter 0c
slot 0: Dual SB1250-Gigabit Ethernet Controller
        e0a MAC Address:    00:a0:98:07:0f:26 (auto-100tx-fd-up)
        e0b MAC Address:    00:a0:98:07:0f:27 (auto-1000t-fd-up)
slot 0: NetApp ATA/IDE Adapter 0a (0x00000000000001f0)
        0a.0                 245MB

jakub_wartak · ‎2010-09-21

OK, correction, on the Netapp:

1) ifconfig -a

2) sysstat -c 10 6

3) sysstat -x -c 10 6

-Jakub.

davidmcgiven · ‎2010-09-21

Hi Jakub,

Thanks for helping me. Here's what you asked for :

1) ifconfig -a

I'm hidding here the public IP of the host to avoid problems. I guess that is OK.

gea> ifconfig -a
e0a: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 158.109.X.X netmask 0xfffff800 broadcast 158.109.X.X
        ether 00:a0:98:07:0f:26 (auto-100tx-fd-up) flowcontrol full
e0b: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.0.201 netmask 0xffffff00 broadcast 192.168.0.255
        ether 00:a0:98:07:0f:27 (auto-1000t-fd-up) flowcontrol full
lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 9188
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1

2) sysstat -c 10 6

I see that this command is kind of a "top" for CPU and Network traffic for the netapp, so, just in case, I attach two outputs : The first, without pinging the filer, the second, pinging with -f from the linux host.

FIRST :

gea> sysstat -c 10 6
CPU    NFS   CIFS   HTTP      Net kB/s     Disk kB/s      Tape kB/s    Cache
                               in   out     read write    read write     age
7%      0      0      0       2     0      456   4988       0     0     >60
0%      0      0      0       5     0        1      0       0     0     >60
1%      0      0      0       2     0      124    134       0     0     >60
2%     52      0      0     891    29      121    131       0     0     >60
0%      0      0      0       3     0        1      0       0     0     >60
2%      0      0      0       2     0      276   1157       0     0     >60
0%      0      0      0       3     0        1      0       0     0     >60
5%    125      0      0    4294   117      125    134       0     0     >60
4%      0      0      0       4     0      156    772       0     0     >60
4%      6      0      0       6     1      125   4260       0     0     >60

SECOND :

gea> sysstat -c 10 6
CPU    NFS   CIFS   HTTP      Net kB/s     Disk kB/s      Tape kB/s    Cache
                               in   out     read write    read write     age
0%      1      0      0      12     4        0      0       0     0     >60
1%      3      0      0      17     4      137    150       0     0     >60
0%      0      0      0      12     4        0      0       0     0     >60
1%      4      0      0      15     4      109    122       0     0     >60
1%      2      0      0      11     4      126    141       0     0     >60
0%      0      0      0      11     4        1      0       0     0     >60
1%      2      0      0      13    22      128    137       0     0     >60
0%      0      0      0      12     4        1      0       0     0     >60
1%     13      0      0      14     6      143    145       0     0     >60
0%      0      0      0      11     4       27      1       0     0     >60

3) sysstat -x -c 10 6

Same as before, just in case, two times, first with normal filer operation, second with normal filer operation + ping -f from linux host

FIRST :

gea> sysstat -x -c 10 6
CPU   NFS CIFS HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache CP   CP Disk    FCP iSCSI   FCP kB/s
                                  in   out   read write read write   age   hit time ty util                 in   out
4%    98     0     0      98 3211    88    132    141     0     0   >60 100%   5% T    4%      0     0     0     0
0%     2     0     0       2     3     0      0      0     0     0   >60 100%   0% -    0%      0     0     0     0
6%     0     0     0       0     3     0    485   3801     0     0   >60 100% 20% T   16%      0     0     0     0
1%     0     0     0       0     3     0    151    145     0     0   >60 100%   3% Tf   2%      0     0     0     0
0%     0     0     0       0     2     0      7      5     0     0   >60   99%   2% :    2%      0     0     0     0
1%    10     0     0      10     5     3    122    132     0     0   >60 100%   5% T    4%      0     0     0     0
0%     2     0     0       2     6     1      1      0     0     0   >60 100%   0% -    0%      0     0     0     0
1%     8     0     0       8     5     3    123    141     0     0   >60 100%   5% T    3%      0     0     0     0
3%    96     0     0      96 3219    88      1      0     0     0   >60 100%   0% -    0%      0     0     0     0
6%     3     0     0       3    21     1    392   3791     0     0   >60 100% 19% T   15%      0     0     0     0

SECOND :

gea> sysstat -x -c 10 6
CPU   NFS CIFS HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache CP   CP Disk    FCP iSCSI   FCP kB/s
                                  in   out   read write read write   age   hit time ty util                 in   out
26%   245     0     0     245 8326   235    689 13038     0     0   >60 100% 35% 2   28%      0     0     0     0
0%    11     0     0      11   253    11      0      0     0     0   >60 100%   0% -    0%      0     0     0     0
16%   235     0     0     235 7984   222    431   6555     0     0   >60 100% 19% T   13%      0     0     0     0
1%     7     0     0       7   188     9      1      0     0     0   >60 100%   0% -    0%      0     0     0     0
5%     1     0     0       1    12     4    439   2781     0     0   >60 100% 16% T   20%      0     0     0     0
3%    97     0     0      97 3283    93      1      0     0     0   >60 100%   0% -    0%      0     0     0     0
6%     1     0     0       1    18     4    284   3787     0     0   >60 100% 19% T   14%      0     0     0     0
0%     0     0     0       0    13     4      2      0     0     0   >60   99%   0% -    0%      0     0     0     0
1%     0     0     0       0    13     4    127    138     0     0   >60 100%   5% T    4%      0     0     0     0
1%     0     0     0       0    12     4    140    157     0     0   >60 100%   5% T    4%      0     0     0     0

Any ideas ?

davidmcgiven · ‎2010-09-21

Jakub,

I've found a strange thing :

Both the linux host and the filer have one public network interface and one private interface. The private interface is plugged into the the same 3Com Switch as the linux host and many other servers, and the public one to a switch not managed by me.

The thing is like that :

- If I ping the filer from the linux host, using it's private address, I get 60% packet loss

- If I ping the filer from the linux host, using it's public address, I get 0% packet loss

Then I thought ... That's it! There must be a problem with the private network interface ...

But ... If I ping the filer to it's public address from, for instance, my workstation, I get 60% packet loss again.

I don't get it!!!!!!

Edit : Sorry, forget this message. It happens both on the public and private interfaces no matter from where I issue the ping.

jakub_wartak · ‎2010-09-21

I don't understand; how you can ping from the linux to the public interface? (traceroute should flow be completley different) You have some default routing i guess?

From netapp review logs via "rdfile /etc/messages", ifstat -a, ifinfo -a, etc.

Some further ideas, from linux:

1) traceroute -I <publicNetappIP>

2) mii-tool

3) perhaps don't try to ICMP flood, but try something more sane: ping -i 0.2 -c 100 <netappIP> # paste only last 3-4 lines

4) netstat -rn

5) ensure you don't have firewalling enabled:

a) iptables -nvL

b) lsmod | grep -i -e track -e conn -e nf -e ip

From netapp i can see that you are mostly using 100Mb/s link (high Ipkts & Opkts values in netstat -in output).

Ping -f is not the best tool for checking this, due to various ICMP rate limits, I would settup normal/typical NFS mount from Netapp filer using options "tcp,hard,intr,rsize=32768,wsize=32768" and start from there.

-Jakub.

palbers · ‎2010-09-21

Hi David,

You may also want to check to see what your option:

"ip.ping_throttle.drop_level"

is set to. It is likely at the default of 150.

"ip.ping_throttle.drop_level

Specifies the maximum number of ICMP echo or echo reply packets (ping

packets) that the filer will accept per second. Any further packets

within one second are dropped to prevent ping flood denial of service

attacks. The default value is 150."

Your client could very well be sending more than what ever this value is

per second. You

can test this pretty quickly, but changing the value of this option to

something quite high

and running the 'ping flood' test again.

Best Regards,

-paul

davidmcgiven · ‎2010-09-21

Hi Paul!

Thanks for the tip. I've tried it and, unfortunately, the behaviour is exactly the same.

zinovik_igor · ‎2010-10-06

I was having exactly same issue with fas3020 while pinging it with flood ping.

You can simply disable ping throttling by doing:

filer> options ip.ping_throttle.drop_level 0

This will disable ping throttling mechanism and filer wont be

dropping packets anymore. After I changed this option I did not

saw ICMP packet loss anymore.

You can read more about it in Data ONTAP Network Management Guide.

rtpnctm · ‎2015-02-23

is the answer editing the config to disable the ping throlling? im getting the same disconnects from the server/ VMs. tried pinging from the server to the flier and got 30 to 50% packet loss. i tried pktt trace from the filer to the host IP and had no connection loss but now I cant Vsphere to host nor putty into the VM using the filer datastore.

thanks for the help---

Heavy packet loss under ping -f

Get ready to power on