Solved: One of the network interfaces is showing up/down, need to troubleshoot

NetappGuy7 · ‎2022-02-25

Hi,

One of our interfaces (Cluster Interface) inexplicably has the status of up/down. The port itself is healthy and is home commands doesn't seem to fix it. Please advise on how to proceed.

Even using the status admin down command and bring it back up didn't fix the issue

Ontapforrum · ‎2022-02-25

What Protocols are being served ? It depends... but in general 'Takeover and giveback' allows HA configuration to perform nondisruptive operations and avoid service interruptions.

However, I think it makes sense to raise a ticket with NetApp, so that they can take a look at the logs/messages (root cause) and suggest next course of action.

View solution in original post

Ontapforrum · ‎2022-02-25

Could you check this kb: (May need a reboot to fix it )

Network interfaces show up/down following a Vifmgr restart in ONTAP 9.5 and earlier
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/ONTAP_network_interfaces_show_up_down

Another Kb on similar lines:
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Systems/FAS_Systems/Logical_Interfaces_(LIF)_up_down_after_ONTAP_node_Reboot

NetappGuy7 · ‎2022-02-25

I'm not aware of Vifmgr being restarded, but that seems like a decent solution. However i'm hesitatant to reboot one of the nodes because we're currently serving clients and I want to avoid a power outage. I know we have a failover system but I still feel hesitant

Ontapforrum · ‎2022-02-25

What Protocols are being served ? It depends... but in general 'Takeover and giveback' allows HA configuration to perform nondisruptive operations and avoid service interruptions.

However, I think it makes sense to raise a ticket with NetApp, so that they can take a look at the logs/messages (root cause) and suggest next course of action.

NetappGuy7 · ‎2022-02-28

Yep I've already done so, hoping for the best

We're aware of nondisruptive operations but unfortunately it's still not a risk we're willing to take since we're with the government and we serve data to literally hundreds of thousands of clients

As of late, a new error popped up when trying to run the following command:

stcffn::> node run -node stcffn-02
telnet: connect to address ***.***.***.*: Host is down
telnet: Unable to connect to remote host

Despite the fact that both heads are healthy and reachable. Very odd

DarrenJ · ‎2022-03-01

This is likely because the command being routed to the node shell of that node has to traverse the cluster LIF to reach the node, which is down. If you SSH'd into the node management LIF of that specific node and tried to run it again, I suspect you wouldn't have that issue.

NetappGuy7 · ‎2022-03-01

You're absolutely correct!

So how do I bring the cluster LIF back up? Is deleting it and recreating it a viable option?

DarrenJ · ‎2022-03-01

Worth trying. Modifying the home port to something different and then back might also be worth trying.

NetappGuy7 · ‎2022-03-02

It doesn't look like its possible to delete a cluster lif (command failed: LIF "stcffn-02_clus1" cannot be removed because it is required to maintain quorum on node "stcffn-02".)

Nor is it possible to move its home port either. I have no idea how to proceed here. It doesn't look like faulty hardware but i'm not sure either. Really puzzling issue

Ontapforrum · ‎2022-03-02

Any update from NetApp on this issue?

NetappGuy7 · ‎2022-03-02

I've been in contact with someone, but I haven't been able to reach them for a while. At this point I need to escalate the case

tahmad · ‎2022-03-15

Did you end up following the previous suggestion and performed the reboot?

Do you have a case number to follow on this issue? @NetappGuy7

NetappGuy7 · ‎2022-03-16

I didn't perform a reboot... however, I was able to get a faulty cable replaced, which resolved the issue!

Ontapforrum · ‎2022-02-28

That's great. NetApp will review the event logs/cluster-core logs to suggest. Most likely mgwd process might have got stuck. Anyway, feed us back once it's resolved.