Re: IFGRP port "imbalance"

colsen · ‎2017-11-28

Hello,

So I've been scratching my head on this one and was wondering if anyone else out there has run into this. Anyway, we have an AFF8080 (2 node/HA) each with a two 10GbE-port IFGRP configured LACP:

node ifgrp distr-func mode up-ports
-------- ----- ---------- -------------- --------
node1 a0a ip multimode_lacp e0e,e0g

On "node1" we're seeing a good balance of I/O between ports e0e and e0g. Looking at the graphs on OnCommand UM the two ports track eachother pretty well depending on what kind of work is going on.

On "node2" we're seeing a super-busy e0e port and a mostly idle e0g port. I did a few statit gathers and here's what we're seeing:

node1

e0e recv 313,367,472.46

xmit 221,804,526.36

e0g recv 31,750,676.82

xmit 16,976,874.19

node2

e0e recv 217,023,661.04

xmit 377,199,469.46

e0g recv 4,764.75

xmit 20,303,389.42

Ignore the "lower" numbers for e0g on "node1" in the list above - it's usually just as busy. However, e0g on "node2" just never seems to get much traffic. Both ports are healthy (i.e. no dropped packets, no retransmits, etc) so it's just like e0g isn't holding up its part of the ifgrp on "node2".

I've got a note into our network team to see if they can see anything wonky with the port/switch/etc. Any ideas the community might have as to where to start would be greatly appreciated.

Thanks!

Chris

michael_england · ‎2017-11-28

I'm not sure I'd call the data you have balanced but maybe it was just when you grabbed the stats. Do you have lots and lots of clients or just a few? You might want to check the load balancing on the switch side as cisco != netapp load balacing. Maybe it's set incorrectly (as the impalance seems to be inbound to the netapp) or maybe you've just got a couple of clients which happen to be hitting the same port from the switches perspective.

https://www.cisco.com/c/en/us/support/docs/lan-switching/etherchannel/12023-4.html

In that doc there's a command to see what channel data will flow over:

e.g.

show channel hash 865 10.10.10.1 10.10.10.2

selected channel port: 1/1

That might help to narrow things down if you have a few clients or if you have lots and lots then check the load balance method on the switch

colsen · ‎2017-11-29

Thanks - we have a bunch of clients on these particular nodes (database servers, hypervisors, etc) all separated by VLANs but using the same IFGRP. The node1 stats I captured are somewhat of an anomaly - usually e0e and e0g are right in line on that node.

I'm hoping to engage with our network engineering team this week to see what they can see. We're Arista-based on the private storage network, but given that Cisco sued Arista for how much they copied IOS, the commands you provided should be pretty close.

The good thing is that we're not saturating any of our 10GbE pipes right now and our latency from the clients is right in alignment with what we'd expect. Just want to make sure we're not leaving throughput on the table going forward.

Thanks again,

Chris

hadrian · ‎2018-02-22

Hi Chris,

As the previous poster alluded to, incoming traffic distribution is controlled by the network switch's load balancing policy, whereas outgoing traffic from a NetApp node is controlled by the NetApp's interface group load balancing policy.

Normally these types of imbalances are caused by unfairness in how the load balancing algorithm is calculated based on too many common third or fourth octets of the incoming IP addresses. Since you mentioned there are multiple VLANs being served up from these port channel groups, that is less likely.

Moving on;

I would validate both the health of the Arista port channel group and what load balancing policy is enabled - mac / ip / port. It is ideal but not critical for the NetApp's load balancing policy to match what is set on the switch. What is really critical is for the load balancing mode to match - dynamic/lacp vs static/multi.

You can give a heads up to your switch team by providing the exact interface the NetApp is connected to on the switch by using the command network device-discovery, it is really awesome for stuff like this.

After you've validated the network looks clean, one thing you may want to try is to eject the quiet port from the port channel group from the NetApp side or the switch side and see if re-adding it back to the LACP will cause it to "wake" up and start receiving more traffic.

Good luck!

Hadrian