ONTAP Discussions

One of the network interfaces is showing up/down, need to troubleshoot

NetappGuy7
10,646 Views

Hi,

 

One of our interfaces (Cluster Interface) inexplicably has the status of up/down. The port itself is healthy and is home commands doesn't seem to fix it. Please advise on how to proceed. 

 

Even using the status admin down command and bring it back up didn't fix the issue

1 ACCEPTED SOLUTION

Ontapforrum
10,596 Views

What Protocols are being served ? It depends... but in general 'Takeover and giveback' allows HA configuration to perform nondisruptive operations and avoid service interruptions.

 

However,  I think it makes sense to raise a ticket with NetApp, so that they can take a look at the logs/messages (root cause) and suggest next course of action.

View solution in original post

13 REPLIES 13

Ontapforrum
10,618 Views

Could you check this kb: (May need a reboot to fix it )

Network interfaces show up/down following a Vifmgr restart in ONTAP 9.5 and earlier
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/ONTAP_network_interfaces_show_up_down

 

 

Another Kb on similar lines:
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Systems/FAS_Systems/Logical_Interfaces_(LIF)_up_down_after_ONTAP_node_Reboot

NetappGuy7
10,604 Views

I'm not aware of Vifmgr being restarded, but that seems like a decent solution. However i'm hesitatant to reboot one of the nodes because we're currently serving clients and I want to avoid a power outage. I know we have a failover system but I still feel hesitant 

Ontapforrum
10,597 Views

What Protocols are being served ? It depends... but in general 'Takeover and giveback' allows HA configuration to perform nondisruptive operations and avoid service interruptions.

 

However,  I think it makes sense to raise a ticket with NetApp, so that they can take a look at the logs/messages (root cause) and suggest next course of action.

NetappGuy7
10,481 Views

Yep I've already done so, hoping for the best

 

We're aware of nondisruptive operations but unfortunately it's still not a risk we're willing to take since we're with the government and we serve data to literally hundreds of thousands of clients

 

As of late, a new error popped up when trying to run the following command:

stcffn::> node run -node stcffn-02
telnet: connect to address ***.***.***.*: Host is down
telnet: Unable to connect to remote host

 

Despite the fact that both heads are healthy and reachable. Very odd

DarrenJ
10,434 Views

This is likely because the command being routed to the node shell of that node has to traverse the cluster LIF to reach the node, which is down. If you SSH'd into the node management LIF of that specific node and tried to run it again, I suspect you wouldn't have that issue. 

NetappGuy7
10,417 Views

You're absolutely correct!

So how do I bring the cluster LIF back up? Is deleting it and recreating it a viable option?

DarrenJ
10,401 Views

Worth trying. Modifying the home port to something different and then back might also be worth trying. 

NetappGuy7
10,390 Views

It doesn't look like its possible to delete a cluster lif (command failed: LIF "stcffn-02_clus1" cannot be removed because it is required to maintain quorum on node "stcffn-02".) 

 

Nor is it possible to move its home port either. I have no idea how to proceed here. It doesn't look like faulty hardware but i'm not sure either. Really puzzling issue

Ontapforrum
10,385 Views

Any update from NetApp on this issue?

NetappGuy7
10,382 Views

I've been in contact with someone, but I haven't been able to reach them for a while. At this point I need to escalate the case 

tahmad
9,961 Views

Did you end up following the previous suggestion and performed the reboot?

Do you have a case number to follow on this issue? @NetappGuy7 

NetappGuy7
9,941 Views

I didn't perform a reboot... however, I was able to get a faulty cable replaced, which resolved the issue!

Ontapforrum
10,471 Views

That's great. NetApp will review the event logs/cluster-core logs to suggest. Most likely mgwd process might have got stuck. Anyway, feed us back once it's resolved.

Public