ONTAP Discussions

RAID DP Disk Failure

edlam2000
14,599 Views

I know that RAID_DP allows 2 disk failure.

Let's say there's already a disk with double parity failed, and theretically, it allows another disk to fail, but if another disk fails, will the system shut itself down automatically after 24 hours for its own data-protection?

31 REPLIES 31

radek_kubka
13,041 Views

It will not.

An automatic RAID rebuild will start as soon as the first drive fails using one of the hot-spare disks.

billshaffer
13,041 Views

Assuming there are spares available.  If the system is unable to rebuild, then yes, it will shut itself down in 24 hours, though this can be altered with the raid.timeout option.

Bill

radek_kubka
12,674 Views

True. A system with no spares is rather an uncommon view though.

edlam2000
12,674 Views

Assuming that it's the controller shut down the raid group, so how to restart it after its shutdown?

saranraj456
12,674 Views

once the disks are replaced , have to boot up the filers.

edlam2000
12,674 Views

But there is no spare any more and warranty has expired.

I just want to know when one of the controllers shutdown the raid group, how to restart it?

saranraj456
12,674 Views

pull the failed disks out & start the filer.

have to change the raid.timeout option after starting.

edlam2000
12,674 Views

Can the filer controller be remotely started or do I have to physically be there to push the button?

saranraj456
12,674 Views

It can be started from the loader prompt I believe.

shaunjurr
11,912 Views

Arrange to have someone toss it in the bin or dismantle it for spare parts...  You aren't far away from losing all of your data anyway, so this way you can just eliminate the element of surprise.

SCOTT_V_RONTALE
12,279 Views

Hi Bill

Is there a document or white report that you can share that if the system is unable to rebuild, it will shut down in 24 hours?

SCOTT_V_RONTALE
12,279 Views

thank you for this one sir! Another is what if 3 failed disk is down with out applying the raid.timeout option what could happend to the system? i know there will be a data loss or 24 hours shut down but is there a TR or white paper that can support this.

Thank You in advance!

billshaffer
12,279 Views

A few things to consider.  Remember that there can be multiple aggregates on the system, each of which can consist of multiple raid groups - aggr status -r will show the raid groups.

You can have two failed drives in every aggregate and still not lose data, because each aggregate is a seperate entity.  In addition, you can lose two drives in each raid group of a _single_ aggregate without losing data, because each raid group is in its own raid-dp setup.  So, if you have an aggregate with 4 raid-dp raid groups, you could lose 8 drives, as long as two come from each raid group, without losing data.  For the record, I've seen this - an entire shelf powered off, but only two drives from that shelf were in any single raid group, so no data loss.

If you lost a third drive in a raid-dp raid group, that raid group would fail and the aggregate would go offline, and you'd lose data.  Not sure if a failover would happen - assuming dual path HA, if the drive is down on one controller it'll be down on the other also.  Also not sure if raid.timeout would shut the system down - I don't know that an offline aggregate constitutes a degraded state.  Degraded implies that it's still running, which it isn't, technically.

I did a quick look, but couldn't find anything to describe what happens when you lose that third drive.  This link (https://kb.netapp.com/support/index?page=content&id=3013638) hints that the controller will panic, but it is a different situation (media error on rebuild).  One of the links is to article 2014172, which says to call support in the rare event that there was actually another disk failure - so maybe they don't publish what happens in that case.

Bill

MSTORAGE_1986
11,912 Views

I think you should change degrade options 72 hours however, your filer not panic & with you can raise a case with netapp also.

SCOTT_V_RONTALE
11,380 Views

thank you for this sir bill. this helps a lot.

billshaffer
11,380 Views

No problem - and thanks for knighting me!

GOODNERD1
11,747 Views

Sorry all, but this was as close to our recent failures so I'm trying to find out more.  We recently lost 4 drives and the aggregate is listed as failed, radi_dp,partial.  We have an HA cluster and a failover occured when the multiple drives failed.  We've replaced the drives (physically) and are now researching if it is possible to recover any or all of the aggregate/volumes. 

This is what I have found so far.  The bold type is concerning and removes some hope.  Any insight you guys can provide to a novice NetApp admin would be great.  I can certainly provide more details if necessary.  Thanks in advance!

IBM6040SAN51B> aggr undestroy -n SATA401

To view disk list, select one of following options

        [1] abandon the command

        [2] view disks of aggregate SATA401 ID: 0x849bc76c-11e04511-a0000b95-92041398

Selection (1-2)?  2

Couldn't find sufficient disks to make any plex operable.

1 raidgroup failed in plex0

Aggregate SATA401 (failed, raid_dp, wafl inconsistent) (block checksums)

  Plex /SATA401/plex0

    RAID group /SATA401/plex0/rg0

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0c.16           635555/1301618176

      parity            0a.32           635555/1301618176

      data              0a.56           635555/1301618176

      data              0c.67           635555/1301618176

      data              0a.17           635555/1301618176

      data              0a.35           635555/1301618176

      data              0a.41           635555/1301618176

      data              0c.70           635555/1301618176

      data              0c.26           635555/1301618176

      data              1d.00.10        635555/1301618176

      data              1a.01.0         635555/1301618176

      data              0a.39           635555/1301618176

      data              UNMAPPED        635555/'-'

    RAID group /SATA401/plex0/rg1

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0a.54           635555/1301618176

      parity            0c.40           635555/1301618176

      data              UNMAPPED        635555/'-'

      data              0c.55           635555/1301618176

      data              0a.36           635555/1301618176

      data              0c.23           635555/1301618176

      data              1d.00.16        635555/1301618176

      data              0a.27           635555/1301618176

      data              UNMAPPED        635555/'-'

      data              0a.57           635555/1301618176

      data              1a.01.4         635555/1301618176

      data              0a.20           635555/1301618176

      data              UNMAPPED        635555/'-'

    RAID group /SATA401/plex0/rg2

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0c.64           635555/1301618176

      parity            0a.34           635555/1301618176

      data              1d.00.12        635555/1301618176

      data              0a.48           635555/1301618176

      data              0c.65           635555/1301618176

      data              0a.38           635555/1301618176

      data              0a.22           635555/1301618176

      data              0a.50           635555/1301618176

      data              0c.66           635555/1301618176

      data              0c.42           635555/1301618176

      data              0a.25           635555/1301618176

      data              0c.52           635555/1301618176

      data              0c.68           635555/1301618176

    RAID group /SATA401/plex0/rg3

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0c.45           635555/1301618176

      parity            0c.69           635555/1301618176

      data              0c.77           635555/1301618176

      data              0c.58           635555/1301618176

      data              0c.71           635555/1301618176

      data              0a.44           635555/1301618176

      data              0a.43           635555/1301618176

      data              0c.72           635555/1301618176

      data              0a.29           635555/1301618176

      data              0c.61           635555/1301618176

      data              0c.73           635555/1301618176

      data              0c.76           635555/1301618176

      data              0c.74           635555/1301618176

IBM6040SAN51B>

IBM6040SAN51B> disk show -n

disk show: No disks match option -n.

JGPSHNTAP
11,750 Views

^^

You're in bad shape.. What version of code are you running.  You lost your aggregate b/c you had a triple disk failure in a single raid-group.  If these are older type drives, you either had a multi-disk panic, hit the MOOS drive issue, or failed to replace failed drives in time, by not having enough spares

You need to call ibm/netapp support, and they are going to tell you to perform a wafl iron, but, there's very little chance of recovering from a triple disk failure in my opinion.   

I hope you are snapmirroring the data somewhere, if not you better touch up your resume if you're not able to recover

GOODNERD1
10,600 Views

Version 8.0.2.  We've had a rash of failed drives recently and I suspect it's bad replacements but haven't confirmed.  Long back story on this solution but in short we're migrating to another solution.  This just happened to be our DR Filers that someone decided to use for a "test" that became "important" and our primary NetApp admin left.  So It appears to be a bad shape prior to worse shape now scenario.  Our support contract is up as well, so that makes this rather difficult.  But I'm willing to "learn" as much as I can while I still have this DR solution running.

They are older type drives, check.  How can I check for multi-disk panic?  Not familiar with the MOOS drive issue and certainly didn't replace the drives fast enough,  3 failed last week and were replaced Monday when right after more drives failed and weren't swapped before this occurred.  So I certainly agree with your assessment. 

Public