FAS2020 Regular Network Timeouts

TotallConfused · ‎2015-01-29

Hi

I hope someone can help me with this issue. Its not a normal sort of network issue that’s for sure. I have checked everything I can think of. I have provided as much detail below I can think of. Feel free to ask for more if you have some ideas.

Thanks

Summary:

NetApp FAS2020 ClusterMode Controller
Cisco 3750 Stack Switch Etherchannel LACP link
Version:NetApp Release 7.3.7P2: Sun Apr 21 03:24:44 PDT 2013
Issue:Regular timeout for management traffic, snmp, ndmp

Detail:

I am seeing what I believe is a layer 2 issue occurring on this one.

Intermittently (approximately every 20 minutes) both our monitoring server and NetApp DFM and my workstation are unable to communicate to the NetApp FAS2020 old SAN we use for Test/Dev work. I can be entering commands on the filer and suddenly the connection is dropped. We have four of these controllers on the network and each has had the issue. A case was raised with NetApp however there was no end resolution and the suggestion was simply to run performance gathering, however this is no use of course if the link goes down.

Everytime the filer stops communicating to pings, snmp or NDMP, DFM then sends a pile of alerts. I have seen nothing however that shows there is an issue either from NFS etc.

I have checked all network configuration and can find no errors at all. What I do see however is a pile of odd errors in the netstat –s command indicating destination issues.

Filer consists of a trunked multimode connection across two Ethernet ports e0a and e0b.

Subnet for NFS, Server VLAN and Management is passed across this.

NFS and Server subnet link directly to the VM Boxes so no routing is required.

Management is pinged/connected from various subnets, hence local GW is required which is the Management subnet GW.

I have out of band management enabled and can connect to the Filer console from this. When the filer loses connectivity, if I ping the management gateway from the SAN (ie from the filer ping the default gateway), the SAN reports its alive after about a second AND also outside connectivity is then restored from Monitoring/DFM and workstation ping to the trunked management IP etc.

The monitoring servers are on the 192.168.98.x subnet so don’t require the GW to connect back to.

I have checked the IP settings for throttle are disabled, and also multipath makes no difference either.

It’s almost as if I have an ARP issue for some reason. I did see looking at the ARP cache the gateway MAC is incomplete MAC address wise, however this also appears to be intermittent occurrence as well. I am not seeing any issues on the NFS path relating to disconnects.

The Cisco 3750 switches shows no issues in the logs or interfaces on it either. There is also short traffic path across them as well. I see no other timeouts on any other devices either over the LAN.

Switch Topology

Switch ----------> Switch---------------SAN e0a,e0b
Core Datacenter
Mgmt VLAN

Filer Network configuration:

STOR01> rdfile /etc/rc

hostname STOR01

ifconfig e0a flowcontrol full

ifconfig e0b flowcontrol full

vif create lacp VIF-STOR01 -b ip e0a e0b

vlan create VIF-STOR01 70 90 94

ifconfig VIF-STOR01-70 `hostname`-VIF-STOR01-70 netmask 255.255.255.0 partner VIF-STOR02-70 mtusize 1500 trusted -wins up

ifconfig VIF-STOR01-70 alias 192.168.98.21 netmask 255.255.255.0

ifconfig VIF-STOR01-90 `hostname`-VIF-STOR01-90 netmask 255.255.255.192 partner VIF-STOR02-90 mtusize 1500 trusted -wins up

ifconfig VIF-STOR01-90 alias 192.168.97.11 netmask 255.255.255.192

ifconfig VIF-STOR01-94 `hostname`-VIF-STOR01-94 netmask 255.255.255.128 partner VIF-STOR02-94 mtusize 1500 trusted -wins up

route add default 192.168.97.129 1

routed off

options dns.domainname name.com

options dns.enable on

options nis.enable off

savecore

ip.fastpath.enable          on        (value might be overwritten in takeover)
ip.icmp_ignore_redirect.enable off       (value might be overwritten in takeover)
ip.ipsec.enable             off
ip.match_any_ifaddr         on        (value might be overwritten in takeover)
ip.path_mtu_discovery.enable on        (value might be overwritten in takeover)
ip.ping_throttle.alarm_interval 0         (value might be overwritten in takeover)
ip.ping_throttle.drop_level 0         (value might be overwritten in takeover)
ip.tcp.newreno.enable       on        (value might be overwritten in takeover)
ip.tcp.sack.enable          on        (value might be overwritten in takeover)
ip.v6.enable                off       (value might be overwritten in takeover)
ip.v6.ra_enable             on        (value might be overwritten in takeover)

3750 Config:

interface GigabitEthernet2/0/18
description EC7 SAN01 C2 e0b
switchport trunk encapsulation dot1q
switchport mode trunk
flowcontrol receive on
channel-group 7 mode active
spanning-tree portfast trunk
spanning-tree guard root
end

interface Port-channel7
description EC7 SAN01 C2
switchport trunk encapsulation dot1q
switchport mode trunk
flowcontrol receive on
end

Filer Netstat -s errors:



icmp:

        44 calls to icmp_error

        0 errors not generated because old message was icmp

        Output histogram:

                echo reply: 2160282

destination unreachable: 44

        0 messages with bad code fields

        0 messages < minimum length

        0 bad checksums

        0 messages with bad length

        Input histogram:

                echo reply: 57

destination unreachable: 3884

                echo: 2160282

        0 pings dropped due to throttling

        0 ping replies dropped due to throttling

        0 redirects ignored

        2160282 message responses generated

CLonga · ‎2015-08-19

Did you have any luck resolving this issue?

ZiemowitP · ‎2017-03-16

Yup, we have a FAS2240... same issue. Any resolution?