ONTAP Hardware

Takeover Not Possible: Local node missing partner disks

Chandler
5,004 Views

I have an issue where we've removed several shelves from a 2 node cluster.  A node recently started reporting cf.disk.inventory.mismatch: Status of the disk ?.? and it provides several UIDs.  If I run a "storage failover show -fields reason-enum" a node reports "disk_inventory_mismatch_local".

If I run a "storage failover show -node *" 1 node reports missing 2 disk UIDs and the other node reports missing over 100 disks at "Missing Disks on Local Node".  If I run a "storage disk show -uid XX:XX:XX" using a UID of any of the UIDs listed from storage failover show from either node, there are no results.

I attempted to run a takeover from a node that indicated a takeover was possible but the failover node aborted because communication with destination failed. 

I'm at a loss at this point on how to remove the 'missing disks' from the nodes.  Being that I can't get the original Disk ID (since they're no longer connected) I can't run the command to remove a disk.

Any insight would be appreciated.

1 ACCEPTED SOLUTION

Chandler
4,473 Views

Takeover/giveback was successful.  We ran: "storage failover takeover -ofnode <nodename> -allow-disk-inventory-mismatch true"  This is also updated the reassigned disk partitions that were stale on both systems.  Glad to report all is normal.  Wished there was a diag command that could have been ran to update disk inventory but nonetheless we are back to normal.

Thanks for your input.

View solution in original post

12 REPLIES 12

Chandler
4,925 Views

I have (and several others) and the 'storage disk refresh-ownership' doesn't resolve the issue.  I'm unable to run any 'storage disk replace', 'storage disk removeowner' commands since they require the Disk ID which the disks don't exist so there is no ID as it's only reporting the missing disks by UID.

 

Thank you for asking.

Ontapforrum
4,919 Views

Is the filer under support ? If yes, please reach out to NetApp support. They will guide you further. If this is out of support. Are you able to create a maintenance window, if so then reboot the Node that is complaining, but make sure storage failover modify -node * -onreboot false  , This option specifies whether the node automatically takes over for its partner if the partner reboots. After reboot, node run -node * -c sysconfig -a


These are just few suggestions but if this filer has production data, I would suggest reaching out to your local NetApp Account Manager to discuss the next course of action.

Chandler
4,917 Views

Unfortunately it is no longer under support and it is a production system.  It's possible I could bring the node down as it seems there would be an outage being that it can't be taken over by the partner.  I can reach out to our AM and see if there is any support they can provide.  Was hoping there was a way for it to redo the partner disk information without requiring a halt.

 

Thank you for the suggestion.

Chandler
4,714 Views

We have decided to migrate all of the volumes to the healthy node (which has just enough capacity) then halt the issue node.  I'll use the suggested storage failover command by Ontapforrum and see if that will help.

Chandler
4,693 Views

This did not resolve the issue and the current state of 'storage failover show' says "Disk inventory not exchanged"

Ontapforrum
4,642 Views

Sorry to hear that. What's the storage system Model & ONTAP version.  I don't know exactly your situation, but I would suggest getting in touch with NetApp PS to help fix this issue. Is this filer going to be decomm ?

Chandler
4,641 Views

NetApp did grace us with support and are currently looking into it.  The system is not planning on being decomm'd as we have it under hardware support in an expiring remote datacenter.  This is an A700s (which is riddled with issues) running 9.12.1P4.

I will say, the node reboot did fix the issue with that specific node as it is not reporting 178 missing disks.  The issue we ran into was, after moving all volumes to the healthy node, that gave us about 2.5TB of free space.  The plan was to move volumes back to the rebooted node however the vol move didn't execute, it would hang with no errors.  I suspect this is due to the state of the nodes and the disk inventory.  To remediate the space issue we deleted the rebooted nodes aggregate so to add space to the healthy node and migrate vols that way however disk/partition ownership did not take.  We reassigned 6 Data1 partitions to the healthy node but it doesn't show them as spare disks.  Running a sysconfig -r, each node shows the partner owns the Data1 partitions that we reassigned...so basically the partition assignment didn't update.  At this time we have been migrating data off of the cluster to free up additional space.  NetApp support wants to reboot the nodes with a forced takeover but my concern is the state of the nodes.  While HA shows healthy on both the disk ownership has me concerned there is a communication issue.  We will wait and see.  I appreciate your responses.

Ontapforrum
4,622 Views

Thanks for keeping us updated. Good luck, hopefully it will be sorted. 

 

How does this output looks.

set adv

::*> system ha interconnect status show

Chandler
4,609 Views

Output for system ha interconnect status show is that all interfaces are Up.  Link status and IC RDMA connections.

Support determined that our outage occurred because of Epsilon when running the 'storage failover modify -node * -onreboot false' as they stated it prevents all data from being presented is halted when a 2 Node HA pair partner is rebooted.

 

After the call today, they have determined we hit a bug in 9.12.1P1-4: (it's fixed starting in version 9.12.1P7)
https://kb.netapp.com/Support_Bulletins/Customer_Bulletins/SU543

 

4 shelves had been removed from this cluster on 2 different dates.  The first 2 were performed by me which I removed ownership.  The other 2 were not performed by me so I am not certain if ownership was removed.  According to the bug it's possible this could occur regardless of disk ownership being removed before removing disks.

 

Either way, I greatly appreciate your input and time with this progress.  We will be performing a takeover/giveback this evening which should fix us for now until we can upgrade.

Ontapforrum
4,552 Views

Ok. Thanks for the update. I am sure there is always a learning in such situations. Good luck with your next course of actions.

Chandler
4,474 Views

Takeover/giveback was successful.  We ran: "storage failover takeover -ofnode <nodename> -allow-disk-inventory-mismatch true"  This is also updated the reassigned disk partitions that were stale on both systems.  Glad to report all is normal.  Wished there was a diag command that could have been ran to update disk inventory but nonetheless we are back to normal.

Thanks for your input.

Public