Solved: NetApp Aggregate Offline, Volumes Gone

hartman · ‎2024-03-19

Please bear with me here -- I'm a network guy and know very little about NetApp. But I'm trying to help a customer restore a NetApp that crashed (possibly due to power issues). It's an older single-chassis two-node cluster running an EOL version of ONTAP (8.3), so formal support isn't an option. With a shelf, it has 48 disks total, of which 6 are currently in a failed/broken state. The overall system and both nodes in the cluster show healthy, but the problems that I see just from poking around in the CLI and GUI are as follows:

- The aggregates (other than aggr0 for each node) show in an "unknown" state. (The state was "failed" prior to a reboot of the cluster.) Attempting to bring them online results in "Failed to bring aggregate <name> online. Reason: Lookup failed."

- The volumes do not seem to exist. They are listed with the "volume show" command but no information about them (e.g. size) is displayed other than the aggregates they belong to. In the GUI, the volume names appear in the namespace but nothing is listed on the "volumes" tab for the Vserver. In the namespace, nothing that displays can be mounted or unmounted. Attempting to bring a volume online returns "Error onlining volume ... Unable to set volume attributes ... Reason: The volume does not exist for the operation."

- The single Vserver is running but the dashboard command shows the error "Vserver root volume and root volume load-sharing mirrors are either offline or don't exist."

This NetApp is unfortunately an SPOF for the network in question, so any assistance or advice is appreciated. Even if the aggs/volumes cannot be restored, is there any way to export the data residing on the healthy disks?

Thanks.

SpindleNinja · ‎2024-03-20

Commenting here based on info from /r/netapp - looks very possible that SU448 was hit.

https://kb.netapp.com/Support_Bulletins/Customer_Bulletins/SU448

View solution in original post

dbenadib · ‎2024-03-19

Hi hartman

Sounds like aggregates went offline ... probably due to :

- multi disk failure

- (If encryption ) Key Manager not available

- Shelves missing

etc...

Can you share results of following commands:

aggr show

disk show

node show

set d -c off ; cluster ring show

hartman · ‎2024-03-19

Output of those commands is attached. Thanks.

dbenadib · ‎2024-03-19

Seems that too many drives failed .... Let's try to unfail them and see if it bring back the aggr online:

1- Note disk allocation using the cmd : node run * sysconfig -r

2- Try to unfail failed disks:

::> set advanced
::*> storage disk unfail -disk <disk>

3- After unfailing, if the number of failed disks is less than or equal to the RAID tolerance threshold, then the aggregate will come back online.

4- If the disks can't be unfailed, perform a reseat of the failed drives and then check their status.

5- If multiple drives failed simultaneously, reconstructions can be prevented by reseating all the impacted drives while the node is halted (without takeover)

hartman · ‎2024-03-19

I'll give this a shot. Thanks.

AlexDawson · ‎2024-03-19

Hi there,

Please also provide the output of "run local sysconfig -a" - as I commented on reddit, the SSDs have probably failed due to a firmware BURT, meaning the flashpool is offline and the data is not recoverable.

hartman · ‎2024-03-20

Hi. Sorry, the NetApp is sitting on an isolated network segment, so I'm having to manually copy output from a console terminal. Is there something specific in the output of that command you're looking for? I see the 4 x SSDs are showing Failed and 0.0GB.

dbenadib · ‎2024-03-20

Please share the output of :

node run * sysconfig -a

SpindleNinja · ‎2024-03-20

Commenting here based on info from /r/netapp - looks very possible that SU448 was hit.

https://kb.netapp.com/Support_Bulletins/Customer_Bulletins/SU448

hartman · ‎2024-03-20

Okay, looking at SU448 / BURT1335350, this certainly seems to be the issue. The SDD models, current drive firmware, and current ONTAP version are all listed as affected. Thanks all for helping pinpoint the problem.

So, am I understanding correctly that the data is now unrecoverable, unless maybe via third party recovery tools that could possibly pull something off the HDDs?

SpindleNinja · ‎2024-03-20

I would personally try the data recovery route before I give up on it.

AlexDawson · ‎2024-03-20

You've had Drivesavers and Kroll suggested on Reddit. These aren't random choices - if anyone can save it, they can. Do not use anyone else or attempt to DIY.

Budget in the $5-15k range for them.