Solved: Cisco port-channel group 1 interface up, 2nd interface shows suspended No LACP PDUs.

DBWannaBe · ‎2024-02-08

I did an upgrade on our AFF-C250 from 9.12.p8 to 9.12.p10 and received a bunch of error messages along the lines of "Your LIFs are non-redundant". When I checked on the switch, I found that the port-channel groups (we have 2) both showed that one interface was up and participating in LACP, and that the other connection in the group showed "suspended - No LACP PDUs".

Has anyone encountered this?

Things I've tried:

1. Shut / No shut the interface

2. Replaced the cable - twice

3. Replaced the SFPs

4. Default the port and recreate it on the Cisco side.

5. Delete the port-channel group and recreate on the Cisco side.

6. Configure a new port-channel group and new interfaces and moved cables there.

I also opened a case with NetApp, but all that we've done is to delete the problematic port and then add it back. They seem to be ready to punt this to Cisco and, honestly, I don't blame them. While I first noticed the error during an upgrade, I can't be certain that's what caused it.

JJC-NTAP::> ifgrp show
Port Distribution Active
Node IfGrp Function MAC Address Ports Ports
-------- ---------- ------------ ----------------- ------- -------------------
JJC-NTAP-01
a0a ip d2:39:ea:56:cf:67 partial e0a, e0b
JJC-NTAP-02
a0a ip d2:39:ea:56:d3:f7 partial e0a, e0b

JJC-NTAP::> node run -node JJC-NTAP-01 ifconfig -v a0a
a0a: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
uuid: f7eaeaab-8567-11ee-81dd-d039ea56cf67
options=4ec07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether d2:39:ea:56:cf:67
pcp 4
media: Ethernet autoselect
status: active
groups: lagg
laggproto lacp lagghash l3
lagg options:
flags=4<USE_NUMA>
flowid_shift: 16
lagg statistics:
active ports: 1
flapping: 2
lag id: [(8000,D2-39-EA-56-CF-67,002B,0000,0000),
(8000,84-78-AC-1D-C2-41,0012,0000,0000)]
laggport: e0b flags=4<ACTIVE> state=d<ACTIVITY,AGGREGATION,SYNC>
[(8000,D2-39-EA-56-CF-67,002B,8000,0008),
(8000,84-78-AC-1D-C2-41,0012,8000,0111)]
input/output LACPDUs: 181 / 325
laggport: e0a flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
[(8000,D2-39-EA-56-CF-67,002B,8000,0009),
(8000,84-78-AC-1D-C2-41,0012,8000,0A11)]
input/output LACPDUs: 27709 / 828590

From the above commands we can see that the ifgrp is only partially participating in the LACP and that one of the ports in each port-channel is actually down.

Any ideas? Thanks!

DBWannaBe · ‎2024-02-13

It appears as though this has been resolved by a reboot of each node.

NetApp support informed that some network cards get locked after a reboot and the solution is to reboot the nodes again.

There is a KB out on this but we don't have the specific CNA's that the article mentions. This may be an extension of that issue though. NetApp pulled some logging from my system while we were troubleshooting so maybe they'll find something in there that points to the locking issue.

View solution in original post

DBWannaBe · ‎2024-02-09

NetApp Support does not have any concrete ideas yet.

Looking at the Cisco, in one port-channel we have two interfaces; one is up and passing traffic and the other is down/suspended. On the NetApp e0a is the interface passing traffic and the e0b interface is the one that is linked to the suspended interface on the Cisco. Yesterday we swapped the two cables for e0a and e0b on the NetApp. The interfaces on the Cisco automatically swapped their up/down statuses. This means that the e0b interface on the NetApp is the "bad" port and that the problem really appears to be on the NetApp.

NetApp support is asking for our Cisco configurations today, so we'll see what that leads to.

Thanks for reading along.

DBWannaBe · ‎2024-02-09

Also opened a case with Cisco, but they don't see anything wrong with the switch or the configs. They were also unaware of any bugs that might be causing this behavior.

DBWannaBe · ‎2024-02-13

It appears as though this has been resolved by a reboot of each node.

NetApp support informed that some network cards get locked after a reboot and the solution is to reboot the nodes again.

There is a KB out on this but we don't have the specific CNA's that the article mentions. This may be an extension of that issue though. NetApp pulled some logging from my system while we were troubleshooting so maybe they'll find something in there that points to the locking issue.