I hope someone can help me with this issue. Its not a normal sort of network issue that’s for sure. I have checked everything I can think of. I have provided as much detail below I can think of. Feel free to ask for more if you have some ideas.
NetApp FAS2020 ClusterMode Controller Cisco 3750 Stack Switch Etherchannel LACP link Version:NetApp Release 7.3.7P2: Sun Apr 21 03:24:44 PDT 2013 Issue:Regular timeout for management traffic, snmp, ndmp
I am seeing what I believe is a layer 2 issue occurring on this one.
Intermittently (approximately every 20 minutes) both our monitoring server and NetApp DFM and my workstation are unable to communicate to the NetApp FAS2020 old SAN we use for Test/Dev work. I can be entering commands on the filer and suddenly the connection is dropped. We have four of these controllers on the network and each has had the issue. A case was raised with NetApp however there was no end resolution and the suggestion was simply to run performance gathering, however this is no use of course if the link goes down.
Everytime the filer stops communicating to pings, snmp or NDMP, DFM then sends a pile of alerts. I have seen nothing however that shows there is an issue either from NFS etc.
I have checked all network configuration and can find no errors at all. What I do see however is a pile of odd errors in the netstat –s command indicating destination issues.
Filer consists of a trunked multimode connection across two Ethernet ports e0a and e0b.
Subnet for NFS, Server VLAN and Management is passed across this.
NFS and Server subnet link directly to the VM Boxes so no routing is required.
Management is pinged/connected from various subnets, hence local GW is required which is the Management subnet GW.
I have out of band management enabled and can connect to the Filer console from this. When the filer loses connectivity, if I ping the management gateway from the SAN (ie from the filer ping the default gateway), the SAN reports its alive after about a second AND also outside connectivity is then restored from Monitoring/DFM and workstation ping to the trunked management IP etc.
The monitoring servers are on the 192.168.98.x subnet so don’t require the GW to connect back to.
I have checked the IP settings for throttle are disabled, and also multipath makes no difference either.
It’s almost as if I have an ARP issue for some reason. I did see looking at the ARP cache the gateway MAC is incomplete MAC address wise, however this also appears to be intermittent occurrence as well. I am not seeing any issues on the NFS path relating to disconnects.
The Cisco 3750 switches shows no issues in the logs or interfaces on it either. There is also short traffic path across them as well. I see no other timeouts on any other devices either over the LAN.
ip.fastpath.enable on (value might be overwritten in takeover) ip.icmp_ignore_redirect.enable off (value might be overwritten in takeover) ip.ipsec.enable off ip.match_any_ifaddr on (value might be overwritten in takeover) ip.path_mtu_discovery.enable on (value might be overwritten in takeover) ip.ping_throttle.alarm_interval 0 (value might be overwritten in takeover) ip.ping_throttle.drop_level 0 (value might be overwritten in takeover) ip.tcp.newreno.enable on (value might be overwritten in takeover) ip.tcp.sack.enable on (value might be overwritten in takeover) ip.v6.enable off (value might be overwritten in takeover) ip.v6.ra_enable on (value might be overwritten in takeover)
interface GigabitEthernet2/0/18 description EC7 SAN01 C2 e0b switchport trunk encapsulation dot1q switchport mode trunk flowcontrol receive on channel-group 7 mode active spanning-tree portfast trunk spanning-tree guard root end
interface Port-channel7 description EC7 SAN01 C2 switchport trunk encapsulation dot1q switchport mode trunk flowcontrol receive on end
Filer Netstat -s errors:
44 calls to icmp_error
0 errors not generated because old message was icmp