Disk errors that have caused Aggregate failure

SMITCHELL80 · ‎2014-02-06

I have been assisting a customer that has an out of support system. They initially reported that the filer would not boot. On investigation today I have found that the root aggregate has failed. Out of the 14 disks that are assigned to the aggregate 6 have failed. We cannot boot into ONTAP but still had BMC access. I pulled the system log and can see a sysconfig -v output and boot messages. There is a total of 20 drives assigned to the filer and 9 are reporting the same issue. There appears to be 3 spares left on the filer but they were not used when the drives failed.

On the boot message it reports that 9 of the SAS drive cannot spin up. Below is the output of the boot messages.

NetApp Release 7.3.1.1: Mon Apr 20 22:58:46 PDT 2009

Starting boot on Thu Feb 6 10:21:14 GMT 2014

Thu Feb 6 10:21:31 GMT [nvram.battery.turned.on:info]: The NVRAM battery is turned ON. It is turned OFF during system shutdown.

Thu Feb 6 10:21:31 GMT [kern.version.change:notice]: Data ONTAP kernel version was changed from NetApp Release 7.3.1.1P2 to NetApp Release 7.3.1.1.

Thu Feb 6 10:21:34 GMT [disk.init.failureBytes:error]: Disk 0b.23 failed due to failure byte setting.

Thu Feb 6 10:21:34 GMT [disk.init.failureBytes:error]: Disk 0b.22 failed due to failure byte setting.

Thu Feb 6 10:21:34 GMT [disk.init.failureBytes:error]: Disk 0b.20 failed due to failure byte setting.

Thu Feb 6 10:21:34 GMT [disk.init.failureBytes:error]: Disk 0b.17 failed due to failure byte setting.

Thu Feb 6 10:21:34 GMT [mptsas_intrd:error]: Disk 0c.00.10 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:34 GMT [mptsas_intrd:error]: Disk 0c.00.4 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:34 GMT [mptsas_intrd:error]: Disk 0c.00.0 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:34 GMT [mptsas_intrd:error]: Disk 0c.00.14 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:35 GMT [mptsas_intrd:error]: Disk 0c.00.3 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:35 GMT [mptsas_intrd:error]: Disk 0c.00.19 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:35 GMT [mptsas_intrd:error]: Disk 0c.00.12 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:35 GMT [mptsas_intrd:error]: Disk 0c.00.9 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:35 GMT [mptsas_intrd:error]: Disk 0c.00.1 has failed to spin up and cannot be used. Please replace it with a new drive.

Thu Feb 6 10:21:35 GMT [diskown.isEnabled:info]: software ownership has been enabled for this system

Ignore the FC disk failures as the DS14 shelf can be recovered it has just ran out of spares and will be fixed when we can purchase 4 new drives to rebuild the array. The disks are spread out all over the SAS shelf therefore i doubt this is a backplane issue that has caused so many disks to fail in a short space of time that stopped ONTAP from bring the spares online and then shutting down the system to protect the aggregate from further damage. The disks have been reseated and still will not spin up. There is no amber led on the disk when the filer is powered on and we only have a flashing green LED as an indicator. Before having to send the disks off for a 3rd party recovery is there anything else that can be tried to bring drive back online and recover the data? Would like likely be a firmware issues that has caused the issue?

I can provide further outputs if they would be required.

billshaffer · ‎2014-02-06

I agree that it's unlikely all the drives have simultaneously failed. What model is the SAS shelf? What model is the filer? Clustered? Are the drives connected multipath-HA? Using the onboard adapters or PCI adapters?

What do you get with storage show disk, storage show adapter, etc, from maintenance mode?

Bo;;

SMITCHELL80 · ‎2014-02-06

Hi Bill thanks for the reply.

The filer is a FAS 2050 so the SAS drives are inbuilt. The filer was not clustered the internal sas drives where assigned to one of the controllers and the DS14 shelf was assigned to the second controller.

I have had to remove the disks from the customers site as we were planning on using Ontrack to recover the data if possible. I am now thinking that due to the number it could have been a firmware or ONTAP bug. The filer was running 7.3.1P2.

Before sending this off and paying a few thousand to just have them check that the data is recoverable I would attempt to recover the information on a donor filer in the office.

I am sorry but I dont have any of the logs from Maintenance Mode. This was the root aggregate so I don't think it would boot to anything at the moment.

billshaffer · ‎2014-02-06

Probably not drive dirmware issue - I wouldn't expect them to all go together like that. Very possibly an ONTAP bug or hardware issue. Trying the drives in another controller is a good idea - let us know how it goes.

Bill

salah76 · ‎2015-11-11

Hi, have you sorted out this issue, I have the same issue right now, exactly the same system FAS2050 and 8 SAS disks have failed togather. your relay is much appriciated.

Disk errors that have caused Aggregate failure

Sign-up for Software Release Notifications!

Join us on Discord