ONTAP Discussions

NetApp Aggregate Offline, Volumes Gone

hartman
6,368 Views

Please bear with me here -- I'm a network guy and know very little about NetApp. But I'm trying to help a customer restore a NetApp that crashed (possibly due to power issues). It's an older single-chassis two-node cluster running an EOL version of ONTAP (8.3), so formal support isn't an option. With a shelf, it has 48 disks total, of which 6 are currently in a failed/broken state. The overall system and both nodes in the cluster show healthy, but the problems that I see just from poking around in the CLI and GUI are as follows:

- The aggregates (other than aggr0 for each node) show in an "unknown" state. (The state was "failed" prior to a reboot of the cluster.) Attempting to bring them online results in "Failed to bring aggregate <name> online. Reason: Lookup failed."

- The volumes do not seem to exist. They are listed with the "volume show" command but no information about them (e.g. size) is displayed other than the aggregates they belong to. In the GUI, the volume names appear in the namespace but nothing is listed on the "volumes" tab for the Vserver. In the namespace, nothing that displays can be mounted or unmounted. Attempting to bring a volume online returns "Error onlining volume ... Unable to set volume attributes ... Reason: The volume does not exist for the operation."

- The single Vserver is running but the dashboard command shows the error "Vserver root volume and root volume load-sharing mirrors are either offline or don't exist."

 

This NetApp is unfortunately an SPOF for the network in question, so any assistance or advice is appreciated. Even if the aggs/volumes cannot be restored, is there any way to export the data residing on the healthy disks?

Thanks.

1 ACCEPTED SOLUTION

SpindleNinja
6,164 Views

Commenting here based on info from /r/netapp - looks very possible that SU448 was hit. 

 

https://kb.netapp.com/Support_Bulletins/Customer_Bulletins/SU448

View solution in original post

11 REPLIES 11

dbenadib
6,351 Views

Hi hartman

 

Sounds like aggregates went offline ... probably due to :

- multi disk failure

- (If encryption ) Key Manager not available 

- Shelves missing

 

etc...

 

Can you share results of following commands:

 

aggr show

disk show 

node show

set d -c off ; cluster ring show

hartman
6,333 Views

Output of those commands is attached. Thanks.

dbenadib
6,327 Views

Seems that too many drives failed .... Let's try to unfail them and see if it bring back the aggr online:

 

1- Note disk allocation using the cmd : node run * sysconfig -r

2- Try to unfail failed disks:

  • ::> set advanced
  • ::*> storage disk unfail -disk <disk>

3- After unfailing, if the number of failed disks is less than or equal to the RAID tolerance threshold, then the aggregate will come back online.

 

4- If the disks can't be unfailed, perform a reseat of the failed drives and then check their status.

 

5- If multiple drives failed simultaneously, reconstructions can be prevented by reseating all the impacted drives while the node is halted (without takeover)

hartman
6,281 Views

I'll give this a shot. Thanks.

AlexDawson
6,198 Views

Hi there,

 

Please also provide the output of "run local sysconfig -a" - as I commented on reddit, the SSDs have probably failed due to a firmware BURT, meaning the flashpool is offline and the data is not recoverable.

hartman
6,165 Views

Hi. Sorry, the NetApp is sitting on an isolated network segment, so I'm having to manually copy output from a console terminal. Is there something specific in the output of that command you're looking for? I see the 4 x SSDs are showing Failed and 0.0GB.

dbenadib
6,161 Views

Please share the output of :

  • node run * sysconfig -a

SpindleNinja
6,165 Views

Commenting here based on info from /r/netapp - looks very possible that SU448 was hit. 

 

https://kb.netapp.com/Support_Bulletins/Customer_Bulletins/SU448

hartman
6,147 Views

Okay, looking at SU448 / BURT1335350, this certainly seems to be the issue. The SDD models, current drive firmware, and current ONTAP version are all listed as affected. Thanks all for helping pinpoint the problem.

So, am I understanding correctly that the data is now unrecoverable, unless maybe via third party recovery tools that could possibly pull something off the HDDs?

SpindleNinja
6,112 Views

I would personally try the data recovery route before I give up on it. 

AlexDawson
6,015 Views

You've had Drivesavers and Kroll suggested on Reddit. These aren't random choices - if anyone can save it, they can. Do not use anyone else or attempt to DIY.

 

Budget in the $5-15k range for them. 

Public