Solved: Re: Ontap 9.5 P1 Cluster Links both down during/after upgrade

hochschuleda · ‎2019-03-27

Hi,

i tried to update the AFF220 Cluster from 9.4 P3 to 9.5P1

1st node went fine, 2nd node then stuck. Got an Email which said Automatic NDU paused.

Now i see both links down:

e0a Cluster Cluster down 9000 1000/- - false
e0b Cluster Cluster down 9000 1000/- - false

they dont see each other anymore. Cables are fine, they worked before and noone touched them. i can access both BMC and both nodes but it says :

3/27/2019 14:13:02 aff220-01 ALERT callhome.andu.pausederr: subject="AUTOMATED NDU PAUSED", epoch="9fb37de9-7eae-497e-8a65-e2a1132d88b0"
3/27/2019 14:12:02 aff220-01 ALERT callhome.andu.pausederr: subject="AUTOMATED NDU PAUSED", epoch="60d38721-a585-42c5-83a5-bba67f05ddb9"
3/27/2019 14:11:46 aff220-01 ERROR net.ifgrp.lacp.link.inactive: ifgrp a0a, port e0d has transitioned to an inactive state. The interface group is in a degraded state.
3/27/2019 14:11:43 aff220-01 ERROR net.ifgrp.lacp.link.inactive: ifgrp a0a, port e0c has transitioned to an inactive state. The interface group is in a degraded state.

How do i get it active ? Cables are fine.

-----------------
aff220-01
Partner: aff220-02
Hwassist Enabled: true
Hwassist IP: 10.0.220.111
Hwassist Port: 4444
Monitor Status: active
Inactive Reason: -
Corrective Action: -
Keep-Alive Status: healthy

Warning: Unable to list entries on node aff220-02. RPC: Couldn't make
connection [from mgwd on node "aff220-01" (VSID: -1) to mgwd at
169.254.23.217]

aff220::storage failover hwassist*> show
Node
-----------------

Warning: Unable to list entries on node aff220-01. RPC: Couldn't make
connection [from mgwd on node "aff220-02" (VSID: -1) to mgwd at
169.254.1.118]

aff220-02
Partner: aff220-01
Hwassist Enabled: true
Hwassist IP: 10.0.220.112
Hwassist Port: 4444
Monitor Status: active
Inactive Reason: -
Corrective Action: -
Keep-Alive Status: healthy

xandervanegmond · ‎2019-03-28

Would love to get an update on what caused this.

In ONTAP9.5 NetApp did a lot of changes on their networking stack (see TechONTAP podcast 181)

The firmware update failure could indicate that the ASIC is broken and you would need to replace the controller.

I have not heard of any bugs with e0a/b/c/d on AFF220, but since platform is relatively new, our installed base is limited.

/Xander

View solution in original post

SpindleNinja · ‎2019-03-27

Is this production? if so I would open a P1 ASAP.

That said...

The Cluster ports and the alerts you got don't look like they are related. e0a and e0b are the cluster ports, the others, e0c and e0d, are part of an LACP group.

Are all the ports on the controller dead?

Are both nodes currently booted?

Is the cluster currently in a mix-version state?

hochschuleda · ‎2019-03-27

Mar 27 17:30:31 [aff220-02:netif.init.failed:ALERT]: Initialization of network interface e0a failed due to unexpected software error ix:9.

Mar 27 17:30:33 [aff220-02:netif.init.failed:ALERT]: Initialization of network interface e0b failed due to unexpected software error ix:9.

I moved all VMs away from it before updating 😉 So no production.

Its just cluster ports e0a and e0b are dead, aprox. since during the update of 1 node from 9.4 to 9.5.

Node 02 went fine, even saw the check mark in the cluster update overview, the 2nd node then began to be stuck. and i recently saw a line which said firmware update of the nic e0a failed ? Could that be the reason why the cluster interconnect

was down ?

i can boot both, yesterday i could even backup_boot 9.4 or choose 9.5 normal boot.

Since Node 01 never went completely through with the upgrade i have a mixed version state yes.

SpindleNinja · ‎2019-03-28

Wow, that's crazy/weird! I would for sure open a ticket.

But yeah, thats why your cluster can't come up. You could try rolling back at this point and see what happens, but honestly not 100% sure that would help at this point if those ports are bricked.

xandervanegmond · ‎2019-03-28

Would love to get an update on what caused this.

In ONTAP9.5 NetApp did a lot of changes on their networking stack (see TechONTAP podcast 181)

The firmware update failure could indicate that the ASIC is broken and you would need to replace the controller.

I have not heard of any bugs with e0a/b/c/d on AFF220, but since platform is relatively new, our installed base is limited.

/Xander

hochschuleda · ‎2019-03-28

Actually we just wiped it again and started all over. The good thing is, its all working now though. Didnt have time to play around longer since i need it. Links are up, 9.5 P1 (its a Lenovo AFF220 Think System) cause i cant get P2 yet. But its all back to normal, im gonna try to Cluster Update it again when i get P2 and will report if i see something special or if the failover again fails.

thanks for the help

xandervanegmond · ‎2019-03-28

Good to hear, bit invasive but it works 🙂

You should at least make sure you can update your Service Processor to the lalest firmware.

If Lenovo hasn't released it yet, I would be very very surprised considering NetApp Security Advisory.

/Xander

Briscoe · ‎2019-05-14

Just had this issue after upgrading to 9.5P3 from 9.4P3. After 2 days of trying to figure it out with Netapp Engineers. The solution was simply to issue the power cycle command (not a node reboot) in the BMC... I like simple solutions but COME ON two days and it was that simple, I feel stupid... Glad it was that easy though... 🙂