Network and Storage Protocols
Network and Storage Protocols
Dar NetApp Users,
I'm having some problems regarding linux NFS clients connecting to a FAS270 filer. The error is the infamous :
nfs: server xxxx not responding, still trying
nfs: server xxxx not responding, still trying
nfs: server xxxx not responding, still trying
nfs: server xxxx OK
nfs: server xxxx OK
nfs: server xxxx OK
I've tried different options on the client side (both under UDP or TCP NFS). Depending on the combination of nfs mount flags, it makes the problem to be less frequent, but sooner or later it always ends up showing this message.
I thought that, since there are two hops from those linux clients to the filer, a network problem might be ocurrying.
So I started to check for network errors and found this very alarming behaviour with the filer :
If I issue a "ping -f filer" I get very high percentages of packets dropped (from 30% to 70%). I've checked this between other hosts on the same switch and LAN and it doesn't happen (0% packet loss). It only happens when I ping a filer, no matter if it's on the same switch (0 hops) or on another switch in a different place (2 hops).
Is this normal ?
The filer is a NetApp FAS270 version 7.2.1.1. The client machines from which I issue the pings are linuxes (2.6) and Mac OS X.
Thanks in advance.
Best Regards,
David
Hi,
have you tried on your linux box:
1) netstat -in
2) traceroute -I <netapp>
3) ip -s link show
On the netapp itself:
1) netstat -in
2) vif status
3) sysconfig -c 10 6
Please paste the results here.
-Jakub.
Hi Jakub!
Thank you very much for your prompt reply.
This is the output of the commands you suggested :
On the linux side :
[root@bender root]# netstat -in
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 433164336 0 0 0216933618 0 0 0 BMRU
eth1 1500 0 457557747 11 0 0325302497 0 0 0 BMRU
lo 16436 0 91271105 0 0 091271105 0 0 0 LRU
[root@bender root]# traceroute -l 192.168.0.201
traceroute to 192.168.0.201 (192.168.0.201), 30 hops max, 38 byte packets
1 gea (192.168.0.201) 0.213 ms (255) 0.132 ms (255) 0.107 ms (255)
[root@bender root]# ifconfig
eth1 Link encap:Ethernet HWaddr 00:13:72:52:BB:86
inet addr:192.168.0.100 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:457567208 errors:11 dropped:0 overruns:0 frame:6
TX packets:325311954 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3836596886 (3.5 GiB) TX bytes:4211863382 (3.9 GiB)
Base address:0xdcc0 Memory:fe4e0000-fe500000
On the filer side :
gea> netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Collis Queue
e0a 1500 158.109.208/ 158.109.209. 2g 11 2g 0 0 0
e0b 1500 192.168.0/24 192.168.0.20 1g 0 902m 0 0 0
lo 9188 127 127.0.0.1 87k 0 87k 0 0 0
gea> vif status
No configured vifs present
gea> sysconfig -c 10 6
No arguments are allowed with this option.
usage: sysconfig [ -A | -c | -d | -m | -r | -t | -V ]
sysconfig [ -av ] [ <slot> ]
gea> sysconfig -c
sysconfig: There are no configuration errors.
gea> sysconfig
NetApp Release 7.2.1.1: Tue Jan 23 00:43:25 PST 2007
System ID: 0084276005 (gea)
System Serial Number: 2065922 (gea)
System Rev: D0
slot 0: System Board
Processors: 2
Processor revision: B2
Processor type: 1250
Memory Size: 1022 MB
slot 0: FC Host Adapter 0b
14 Disks: 5923.5GB
1 shelf with AT-FCX, 1 shelf with EFH
slot 0: FC Host Adapter 0c
slot 0: Dual SB1250-Gigabit Ethernet Controller
e0a MAC Address: 00:a0:98:07:0f:26 (auto-100tx-fd-up)
e0b MAC Address: 00:a0:98:07:0f:27 (auto-1000t-fd-up)
slot 0: NetApp ATA/IDE Adapter 0a (0x00000000000001f0)
0a.0 245MB
OK, correction, on the Netapp:
1) ifconfig -a
2) sysstat -c 10 6
3) sysstat -x -c 10 6
-Jakub.
Hi Jakub,
Thanks for helping me. Here's what you asked for :
1) ifconfig -a
I'm hidding here the public IP of the host to avoid problems. I guess that is OK.
gea> ifconfig -a
e0a: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 158.109.X.X netmask 0xfffff800 broadcast 158.109.X.X
ether 00:a0:98:07:0f:26 (auto-100tx-fd-up) flowcontrol full
e0b: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.201 netmask 0xffffff00 broadcast 192.168.0.255
ether 00:a0:98:07:0f:27 (auto-1000t-fd-up) flowcontrol full
lo: flags=1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 9188
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
2) sysstat -c 10 6
I see that this command is kind of a "top" for CPU and Network traffic for the netapp, so, just in case, I attach two outputs : The first, without pinging the filer, the second, pinging with -f from the linux host.
FIRST :
gea> sysstat -c 10 6
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache
in out read write read write age
7% 0 0 0 2 0 456 4988 0 0 >60
0% 0 0 0 5 0 1 0 0 0 >60
1% 0 0 0 2 0 124 134 0 0 >60
2% 52 0 0 891 29 121 131 0 0 >60
0% 0 0 0 3 0 1 0 0 0 >60
2% 0 0 0 2 0 276 1157 0 0 >60
0% 0 0 0 3 0 1 0 0 0 >60
5% 125 0 0 4294 117 125 134 0 0 >60
4% 0 0 0 4 0 156 772 0 0 >60
4% 6 0 0 6 1 125 4260 0 0 >60
SECOND :
gea> sysstat -c 10 6
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache
in out read write read write age
0% 1 0 0 12 4 0 0 0 0 >60
1% 3 0 0 17 4 137 150 0 0 >60
0% 0 0 0 12 4 0 0 0 0 >60
1% 4 0 0 15 4 109 122 0 0 >60
1% 2 0 0 11 4 126 141 0 0 >60
0% 0 0 0 11 4 1 0 0 0 >60
1% 2 0 0 13 22 128 137 0 0 >60
0% 0 0 0 12 4 1 0 0 0 >60
1% 13 0 0 14 6 143 145 0 0 >60
0% 0 0 0 11 4 27 1 0 0 >60
3) sysstat -x -c 10 6
Same as before, just in case, two times, first with normal filer operation, second with normal filer operation + ping -f from linux host
FIRST :
gea> sysstat -x -c 10 6
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
4% 98 0 0 98 3211 88 132 141 0 0 >60 100% 5% T 4% 0 0 0 0
0% 2 0 0 2 3 0 0 0 0 0 >60 100% 0% - 0% 0 0 0 0
6% 0 0 0 0 3 0 485 3801 0 0 >60 100% 20% T 16% 0 0 0 0
1% 0 0 0 0 3 0 151 145 0 0 >60 100% 3% Tf 2% 0 0 0 0
0% 0 0 0 0 2 0 7 5 0 0 >60 99% 2% : 2% 0 0 0 0
1% 10 0 0 10 5 3 122 132 0 0 >60 100% 5% T 4% 0 0 0 0
0% 2 0 0 2 6 1 1 0 0 0 >60 100% 0% - 0% 0 0 0 0
1% 8 0 0 8 5 3 123 141 0 0 >60 100% 5% T 3% 0 0 0 0
3% 96 0 0 96 3219 88 1 0 0 0 >60 100% 0% - 0% 0 0 0 0
6% 3 0 0 3 21 1 392 3791 0 0 >60 100% 19% T 15% 0 0 0 0
SECOND :
gea> sysstat -x -c 10 6
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
26% 245 0 0 245 8326 235 689 13038 0 0 >60 100% 35% 2 28% 0 0 0 0
0% 11 0 0 11 253 11 0 0 0 0 >60 100% 0% - 0% 0 0 0 0
16% 235 0 0 235 7984 222 431 6555 0 0 >60 100% 19% T 13% 0 0 0 0
1% 7 0 0 7 188 9 1 0 0 0 >60 100% 0% - 0% 0 0 0 0
5% 1 0 0 1 12 4 439 2781 0 0 >60 100% 16% T 20% 0 0 0 0
3% 97 0 0 97 3283 93 1 0 0 0 >60 100% 0% - 0% 0 0 0 0
6% 1 0 0 1 18 4 284 3787 0 0 >60 100% 19% T 14% 0 0 0 0
0% 0 0 0 0 13 4 2 0 0 0 >60 99% 0% - 0% 0 0 0 0
1% 0 0 0 0 13 4 127 138 0 0 >60 100% 5% T 4% 0 0 0 0
1% 0 0 0 0 12 4 140 157 0 0 >60 100% 5% T 4% 0 0 0 0
Any ideas ?
Jakub,
I've found a strange thing :
Both the linux host and the filer have one public network interface and one private interface. The private interface is plugged into the the same 3Com Switch as the linux host and many other servers, and the public one to a switch not managed by me.
The thing is like that :
- If I ping the filer from the linux host, using it's private address, I get 60% packet loss
- If I ping the filer from the linux host, using it's public address, I get 0% packet loss
Then I thought ... That's it! There must be a problem with the private network interface ...
But ... If I ping the filer to it's public address from, for instance, my workstation, I get 60% packet loss again.
I don't get it!!!!!!
Edit : Sorry, forget this message. It happens both on the public and private interfaces no matter from where I issue the ping.
I don't understand; how you can ping from the linux to the public interface? (traceroute should flow be completley different) You have some default routing i guess?
From netapp review logs via "rdfile /etc/messages", ifstat -a, ifinfo -a, etc.
Some further ideas, from linux:
1) traceroute -I <publicNetappIP>
2) mii-tool
3) perhaps don't try to ICMP flood, but try something more sane: ping -i 0.2 -c 100 <netappIP> # paste only last 3-4 lines
4) netstat -rn
5) ensure you don't have firewalling enabled:
a) iptables -nvL
b) lsmod | grep -i -e track -e conn -e nf -e ip
From netapp i can see that you are mostly using 100Mb/s link (high Ipkts & Opkts values in netstat -in output).
Ping -f is not the best tool for checking this, due to various ICMP rate limits, I would settup normal/typical NFS mount from Netapp filer using options "tcp,hard,intr,rsize=32768,wsize=32768" and start from there.
-Jakub.
Hi David,
You may also want to check to see what your option:
"ip.ping_throttle.drop_level"
is set to. It is likely at the default of 150.
"ip.ping_throttle.drop_level
Specifies the maximum number of ICMP echo or echo reply packets (ping
packets) that the filer will accept per second. Any further packets
within one second are dropped to prevent ping flood denial of service
attacks. The default value is 150."
Your client could very well be sending more than what ever this value is
per second. You
can test this pretty quickly, but changing the value of this option to
something quite high
and running the 'ping flood' test again.
Best Regards,
-paul
Hi Paul!
Thanks for the tip. I've tried it and, unfortunately, the behaviour is exactly the same.
I was having exactly same issue with fas3020 while pinging it with flood ping.
You can simply disable ping throttling by doing:
filer> options ip.ping_throttle.drop_level 0
This will disable ping throttling mechanism and filer wont be
dropping packets anymore. After I changed this option I did not
saw ICMP packet loss anymore.
You can read more about it in Data ONTAP Network Management Guide.
is the answer editing the config to disable the ping throlling? im getting the same disconnects from the server/ VMs. tried pinging from the server to the flier and got 30 to 50% packet loss. i tried pktt trace from the filer to the host IP and had no connection loss but now I cant Vsphere to host nor putty into the VM using the filer datastore.
thanks for the help---