2017-11-28 11:50 AM
So I've been scratching my head on this one and was wondering if anyone else out there has run into this. Anyway, we have an AFF8080 (2 node/HA) each with a two 10GbE-port IFGRP configured LACP:
node ifgrp distr-func mode up-ports
-------- ----- ---------- -------------- --------
node1 a0a ip multimode_lacp e0e,e0g
On "node1" we're seeing a good balance of I/O between ports e0e and e0g. Looking at the graphs on OnCommand UM the two ports track eachother pretty well depending on what kind of work is going on.
On "node2" we're seeing a super-busy e0e port and a mostly idle e0g port. I did a few statit gathers and here's what we're seeing:
e0e recv 313,367,472.46
e0g recv 31,750,676.82
e0e recv 217,023,661.04
e0g recv 4,764.75
Ignore the "lower" numbers for e0g on "node1" in the list above - it's usually just as busy. However, e0g on "node2" just never seems to get much traffic. Both ports are healthy (i.e. no dropped packets, no retransmits, etc) so it's just like e0g isn't holding up its part of the ifgrp on "node2".
I've got a note into our network team to see if they can see anything wonky with the port/switch/etc. Any ideas the community might have as to where to start would be greatly appreciated.
2017-11-28 03:33 PM
I'm not sure I'd call the data you have balanced but maybe it was just when you grabbed the stats. Do you have lots and lots of clients or just a few? You might want to check the load balancing on the switch side as cisco != netapp load balacing. Maybe it's set incorrectly (as the impalance seems to be inbound to the netapp) or maybe you've just got a couple of clients which happen to be hitting the same port from the switches perspective.
In that doc there's a command to see what channel data will flow over:
show channel hash 865 10.10.10.1 10.10.10.2
selected channel port: 1/1
That might help to narrow things down if you have a few clients or if you have lots and lots then check the load balance method on the switch
2017-11-29 05:52 AM
Thanks - we have a bunch of clients on these particular nodes (database servers, hypervisors, etc) all separated by VLANs but using the same IFGRP. The node1 stats I captured are somewhat of an anomaly - usually e0e and e0g are right in line on that node.
I'm hoping to engage with our network engineering team this week to see what they can see. We're Arista-based on the private storage network, but given that Cisco sued Arista for how much they copied IOS, the commands you provided should be pretty close.
The good thing is that we're not saturating any of our 10GbE pipes right now and our latency from the clients is right in alignment with what we'd expect. Just want to make sure we're not leaving throughput on the table going forward.