Ask The Experts

LACP Question

lando1234
3,640 Views

I am a network engineer who is working with a team who manages a NetApp FAS8200 that is connected to my network with a 4 port interface group with LACP (e0e,e0f,e0g,e0h).  We're having an issue where VMs disconnect from storage for 30-60 seconds when we unplug 2 of the four links from the Interface group.  During this time, we only drop one (or zero) pings between the VMWare host and the NetApp - so the network seems fine overall.

 

The one thing I noticed from the network side is that I'm being sent two different system IDs from the NetApp.  On two links, I'm getting 00:00:00:00:00:00.  On the other two links, I'm seeing a normal looking MAC/SystemID that matches on those interfaces.

 

Obviously this is configured incorrectly on the NetApp side.  Every other port-channel/aggregated-ethernet bundle I have on my switches sees the same system ID coming from the remote end on all member interfaces and everything works fine.

 

Also - what we determined is that we can pull two links with no failure - as long as we pull one from each system-ID.  It's only when we pull both links with the same system-ID that it fails.

 

This is also tied to a VIF and they can't tell me whether or not it is failing to another node when this happens.

 

Has anyone else seen something similar?  Could you point me to what I should tell the NetApp admins on our side to look at with their configuration?

 

Any help is appreciated.

 

Thanks!

7 REPLIES 7

SpindleNinja
3,623 Views

I'm assuming this is config for NFS datastores for VMware?    

It's a pretty straight forward config for LACP groups on ONTAP but lets look at the netapp config.   do you have access or can you get them to send you output from commands? 

Can you post: 

row 0 

net int show 

net port show 

net port ifgrp show  -instance 

 

"This is also tied to a VIF and they can't tell me whether or not it is failing to another node when this happens."  

The command "net int show" during the "broken" state will show if a LIF is home or not.   

 

can you clarify this:  "Also - what we determined is that we can pull two links with no failure - as long as we pull one from each system-ID.  It's only when we pull both links with the same system-ID that it fails."

 

Are you using VCP? 

 

Also make sure everyone is following best practice configs for Vmware/netapp/etc 

 

 

 

lando1234
3,595 Views

Correction - the system IDs do match.  They had actually disabled two interfaces to test right when I ran the command to check system IDs.

 

But the behavior is:

 

e0e, e0f, e0g, e0h are in an interface group with LACP.

 

- disconnect any one interface = no failure

- disconnect e0e,e0f = failure

- disconnect e0g,e0g = failure

- disconnect one of e0e or e0f AND one of e0g or e0h = no failure

 

I don't have access to run any commands and the team we're working with has been pretty hostile, insisting the network is the issue.  We've had at least 10 different people from the vendor look at it and they've all insisted that the network configuration and behavior is exactly as it is supposed to be.  The fact that we don't drop pings means that the network is working.  Everything is on the same VLAN, going through switches only - so the network wouldn't treat pings differently than other types of traffic.

 

I wish we had more visbility, but I was hoping maybe I would describe the symptoms here and it would be something obvious.

 

SpindleNinja
3,581 Views

And there's no VCP in play correct? 

 

Are they trying to run NFSv4 for VMware datastores? 

lando1234
3,578 Views

On the switch side, it is spine & leaf EVPN/VXLAN.  I'm not sure what they're doing on the NetApp side.

 

We've recreated the way the NetApp connects and tested a bunch of failure scenarios with other gear and never have an outage of more than 100ms when we fail.

SpindleNinja
3,562 Views

In ONTAP/NetApp a LIF moving over or an ifgrp failing, or even partially failing, shouldn't have any outage.     The lif will move when the port/ifgrps becomes unavailable to whatever failover policy it has, but that's it.  

 

But at this point, without looking at any configs/layouts (network or NetApp), knowing what protocal(s) are being used.  there's not much more help I can offer.   

 

I still have this odd feeling that they're trying to run NFSv4. 

 

lando1234
3,488 Views

We just got off of a troubleshooting call w/ Netapp and the switching vendor.  Not muched happened, but they confirmed (verbally at least) that they aren't running NFSv4.

 

Now it's time for packet captures.

 

Thanks for your help!  I'll report back when we find the problem.

SpindleNinja
3,484 Views

Good luck to ya.  Update if you're able to as to the cause if/when it's found! 

 

Public