I am a network engineer who is working with a team who manages a NetApp FAS8200 that is connected to my network with a 4 port interface group with LACP (e0e,e0f,e0g,e0h). We're having an issue where VMs disconnect from storage for 30-60 seconds when we unplug 2 of the four links from the Interface group. During this time, we only drop one (or zero) pings between the VMWare host and the NetApp - so the network seems fine overall.
The one thing I noticed from the network side is that I'm being sent two different system IDs from the NetApp. On two links, I'm getting 00:00:00:00:00:00. On the other two links, I'm seeing a normal looking MAC/SystemID that matches on those interfaces.
Obviously this is configured incorrectly on the NetApp side. Every other port-channel/aggregated-ethernet bundle I have on my switches sees the same system ID coming from the remote end on all member interfaces and everything works fine.
Also - what we determined is that we can pull two links with no failure - as long as we pull one from each system-ID. It's only when we pull both links with the same system-ID that it fails.
This is also tied to a VIF and they can't tell me whether or not it is failing to another node when this happens.
Has anyone else seen something similar? Could you point me to what I should tell the NetApp admins on our side to look at with their configuration?
I'm assuming this is config for NFS datastores for VMware?
It's a pretty straight forward config for LACP groups on ONTAP but lets look at the netapp config. do you have access or can you get them to send you output from commands?
Can you post:
net int show
net port show
net port ifgrp show -instance
"This is also tied to a VIF and they can't tell me whether or not it is failing to another node when this happens."
The command "net int show" during the "broken" state will show if a LIF is home or not.
can you clarify this: "Also - what we determined is that we can pull two links with no failure - as long as we pull one from each system-ID. It's only when we pull both links with the same system-ID that it fails."
Are you using VCP?
Also make sure everyone is following best practice configs for Vmware/netapp/etc
Correction - the system IDs do match. They had actually disabled two interfaces to test right when I ran the command to check system IDs.
But the behavior is:
e0e, e0f, e0g, e0h are in an interface group with LACP.
- disconnect any one interface = no failure
- disconnect e0e,e0f = failure
- disconnect e0g,e0g = failure
- disconnect one of e0e or e0f AND one of e0g or e0h = no failure
I don't have access to run any commands and the team we're working with has been pretty hostile, insisting the network is the issue. We've had at least 10 different people from the vendor look at it and they've all insisted that the network configuration and behavior is exactly as it is supposed to be. The fact that we don't drop pings means that the network is working. Everything is on the same VLAN, going through switches only - so the network wouldn't treat pings differently than other types of traffic.
I wish we had more visbility, but I was hoping maybe I would describe the symptoms here and it would be something obvious.
In ONTAP/NetApp a LIF moving over or an ifgrp failing, or even partially failing, shouldn't have any outage. The lif will move when the port/ifgrps becomes unavailable to whatever failover policy it has, but that's it.
But at this point, without looking at any configs/layouts (network or NetApp), knowing what protocal(s) are being used. there's not much more help I can offer.
I still have this odd feeling that they're trying to run NFSv4.