ONTAP Hardware
ONTAP Hardware
Hi
I hope someone can help me with this issue. Its not a normal sort of network issue that’s for sure. I have checked everything I can think of. I have provided as much detail below I can think of. Feel free to ask for more if you have some ideas.
Thanks
Summary:
NetApp FAS2020 ClusterMode Controller
Cisco 3750 Stack Switch Etherchannel LACP link
Version:NetApp Release 7.3.7P2: Sun Apr 21 03:24:44 PDT 2013
Issue:Regular timeout for management traffic, snmp, ndmp
Detail:
I am seeing what I believe is a layer 2 issue occurring on this one.
Intermittently (approximately every 20 minutes) both our monitoring server and NetApp DFM and my workstation are unable to communicate to the NetApp FAS2020 old SAN we use for Test/Dev work. I can be entering commands on the filer and suddenly the connection is dropped. We have four of these controllers on the network and each has had the issue. A case was raised with NetApp however there was no end resolution and the suggestion was simply to run performance gathering, however this is no use of course if the link goes down.
Everytime the filer stops communicating to pings, snmp or NDMP, DFM then sends a pile of alerts. I have seen nothing however that shows there is an issue either from NFS etc.
I have checked all network configuration and can find no errors at all. What I do see however is a pile of odd errors in the netstat –s command indicating destination issues.
Filer consists of a trunked multimode connection across two Ethernet ports e0a and e0b.
Subnet for NFS, Server VLAN and Management is passed across this.
NFS and Server subnet link directly to the VM Boxes so no routing is required.
Management is pinged/connected from various subnets, hence local GW is required which is the Management subnet GW.
I have out of band management enabled and can connect to the Filer console from this. When the filer loses connectivity, if I ping the management gateway from the SAN (ie from the filer ping the default gateway), the SAN reports its alive after about a second AND also outside connectivity is then restored from Monitoring/DFM and workstation ping to the trunked management IP etc.
The monitoring servers are on the 192.168.98.x subnet so don’t require the GW to connect back to.
I have checked the IP settings for throttle are disabled, and also multipath makes no difference either.
It’s almost as if I have an ARP issue for some reason. I did see looking at the ARP cache the gateway MAC is incomplete MAC address wise, however this also appears to be intermittent occurrence as well. I am not seeing any issues on the NFS path relating to disconnects.
The Cisco 3750 switches shows no issues in the logs or interfaces on it either. There is also short traffic path across them as well. I see no other timeouts on any other devices either over the LAN.
Switch Topology
Switch ----------> Switch---------------SAN e0a,e0b
Core Datacenter
Mgmt VLAN
Filer Network configuration:
STOR01> rdfile /etc/rc
hostname STOR01
ifconfig e0a flowcontrol full
ifconfig e0b flowcontrol full
vif create lacp VIF-STOR01 -b ip e0a e0b
vlan create VIF-STOR01 70 90 94
ifconfig VIF-STOR01-70 `hostname`-VIF-STOR01-70 netmask 255.255.255.0 partner VIF-STOR02-70 mtusize 1500 trusted -wins up
ifconfig VIF-STOR01-70 alias 192.168.98.21 netmask 255.255.255.0
ifconfig VIF-STOR01-90 `hostname`-VIF-STOR01-90 netmask 255.255.255.192 partner VIF-STOR02-90 mtusize 1500 trusted -wins up
ifconfig VIF-STOR01-90 alias 192.168.97.11 netmask 255.255.255.192
ifconfig VIF-STOR01-94 `hostname`-VIF-STOR01-94 netmask 255.255.255.128 partner VIF-STOR02-94 mtusize 1500 trusted -wins up
route add default 192.168.97.129 1
routed off
options dns.domainname name.com
options dns.enable on
options nis.enable off
savecore
ip.fastpath.enable on (value might be overwritten in takeover)
ip.icmp_ignore_redirect.enable off (value might be overwritten in takeover)
ip.ipsec.enable off
ip.match_any_ifaddr on (value might be overwritten in takeover)
ip.path_mtu_discovery.enable on (value might be overwritten in takeover)
ip.ping_throttle.alarm_interval 0 (value might be overwritten in takeover)
ip.ping_throttle.drop_level 0 (value might be overwritten in takeover)
ip.tcp.newreno.enable on (value might be overwritten in takeover)
ip.tcp.sack.enable on (value might be overwritten in takeover)
ip.v6.enable off (value might be overwritten in takeover)
ip.v6.ra_enable on (value might be overwritten in takeover)
3750 Config:
interface GigabitEthernet2/0/18
description EC7 SAN01 C2 e0b
switchport trunk encapsulation dot1q
switchport mode trunk
flowcontrol receive on
channel-group 7 mode active
spanning-tree portfast trunk
spanning-tree guard root
end
interface Port-channel7
description EC7 SAN01 C2
switchport trunk encapsulation dot1q
switchport mode trunk
flowcontrol receive on
end
Filer Netstat -s errors:
icmp:
44 calls to icmp_error
0 errors not generated because old message was icmp
Output histogram:
echo reply: 2160282
destination unreachable: 44
0 messages with bad code fields
0 messages < minimum length
0 bad checksums
0 messages with bad length
Input histogram:
echo reply: 57
destination unreachable: 3884
echo: 2160282
0 pings dropped due to throttling
0 ping replies dropped due to throttling
0 redirects ignored
2160282 message responses generated
Did you have any luck resolving this issue?
Yup, we have a FAS2240... same issue. Any resolution?