Re: RAID DP Disk Failure

edlam2000 · ‎2014-01-27

I know that RAID_DP allows 2 disk failure.

Let's say there's already a disk with double parity failed, and theretically, it allows another disk to fail, but if another disk fails, will the system shut itself down automatically after 24 hours for its own data-protection?

radek_kubka · ‎2014-01-27

It will not.

An automatic RAID rebuild will start as soon as the first drive fails using one of the hot-spare disks.

billshaffer · ‎2014-01-27

Assuming there are spares available. If the system is unable to rebuild, then yes, it will shut itself down in 24 hours, though this can be altered with the raid.timeout option.

Bill

radek_kubka · ‎2014-01-28

True. A system with no spares is rather an uncommon view though.

edlam2000 · ‎2014-01-28

Assuming that it's the controller shut down the raid group, so how to restart it after its shutdown?

saranraj456 · ‎2014-01-28

once the disks are replaced , have to boot up the filers.

edlam2000 · ‎2014-01-28

But there is no spare any more and warranty has expired.

I just want to know when one of the controllers shutdown the raid group, how to restart it?

saranraj456 · ‎2014-01-28

pull the failed disks out & start the filer.

have to change the raid.timeout option after starting.

edlam2000 · ‎2014-01-28

Can the filer controller be remotely started or do I have to physically be there to push the button?

saranraj456 · ‎2014-01-28

It can be started from the loader prompt I believe.

shaunjurr · ‎2014-01-29

Arrange to have someone toss it in the bin or dismantle it for spare parts... You aren't far away from losing all of your data anyway, so this way you can just eliminate the element of surprise.

SCOTT_V_RONTALE · ‎2014-02-05

Hi Bill

Is there a document or white report that you can share that if the system is unable to rebuild, it will shut down in 24 hours?

radek_kubka · ‎2014-02-05

e.g. this one:

https://library.netapp.com/ecmdocs/ECMP1196986/html/GUID-D8B1E3D3-A0DD-4A1D-84EF-DE1DA2B9D3B7.html

SCOTT_V_RONTALE · ‎2014-02-05

thank you for this one sir! Another is what if 3 failed disk is down with out applying the raid.timeout option what could happend to the system? i know there will be a data loss or 24 hours shut down but is there a TR or white paper that can support this.

Thank You in advance!

billshaffer · ‎2014-02-05

A few things to consider. Remember that there can be multiple aggregates on the system, each of which can consist of multiple raid groups - aggr status -r will show the raid groups.

You can have two failed drives in every aggregate and still not lose data, because each aggregate is a seperate entity. In addition, you can lose two drives in each raid group of a _single_ aggregate without losing data, because each raid group is in its own raid-dp setup. So, if you have an aggregate with 4 raid-dp raid groups, you could lose 8 drives, as long as two come from each raid group, without losing data. For the record, I've seen this - an entire shelf powered off, but only two drives from that shelf were in any single raid group, so no data loss.

If you lost a third drive in a raid-dp raid group, that raid group would fail and the aggregate would go offline, and you'd lose data. Not sure if a failover would happen - assuming dual path HA, if the drive is down on one controller it'll be down on the other also. Also not sure if raid.timeout would shut the system down - I don't know that an offline aggregate constitutes a degraded state. Degraded implies that it's still running, which it isn't, technically.

I did a quick look, but couldn't find anything to describe what happens when you lose that third drive. This link (https://kb.netapp.com/support/index?page=content&id=3013638) hints that the controller will panic, but it is a different situation (media error on rebuild). One of the links is to article 2014172, which says to call support in the rare event that there was actually another disk failure - so maybe they don't publish what happens in that case.

Bill

MSTORAGE_1986 · ‎2014-02-05

I think you should change degrade options 72 hours however, your filer not panic & with you can raise a case with netapp also.

SCOTT_V_RONTALE · ‎2014-02-05

thank you for this sir bill. this helps a lot.

billshaffer · ‎2014-02-06

No problem - and thanks for knighting me!

GOODNERD1 · ‎2014-09-17

Sorry all, but this was as close to our recent failures so I'm trying to find out more. We recently lost 4 drives and the aggregate is listed as failed, radi_dp,partial. We have an HA cluster and a failover occured when the multiple drives failed. We've replaced the drives (physically) and are now researching if it is possible to recover any or all of the aggregate/volumes.

This is what I have found so far. The bold type is concerning and removes some hope. Any insight you guys can provide to a novice NetApp admin would be great. I can certainly provide more details if necessary. Thanks in advance!

IBM6040SAN51B> aggr undestroy -n SATA401

To view disk list, select one of following options

[1] abandon the command

[2] view disks of aggregate SATA401 ID: 0x849bc76c-11e04511-a0000b95-92041398

Selection (1-2)? 2

Couldn't find sufficient disks to make any plex operable.

1 raidgroup failed in plex0

Aggregate SATA401 (failed, raid_dp, wafl inconsistent) (block checksums)

Plex /SATA401/plex0

RAID group /SATA401/plex0/rg0

RAID Disk Device RAID size(MB/blks)

--------- ------ --------------

dparity 0c.16 635555/1301618176

parity 0a.32 635555/1301618176

data 0a.56 635555/1301618176

data 0c.67 635555/1301618176

data 0a.17 635555/1301618176

data 0a.35 635555/1301618176

data 0a.41 635555/1301618176

data 0c.70 635555/1301618176

data 0c.26 635555/1301618176

data 1d.00.10 635555/1301618176

data 1a.01.0 635555/1301618176

data 0a.39 635555/1301618176

data UNMAPPED 635555/'-'

RAID group /SATA401/plex0/rg1

RAID Disk Device RAID size(MB/blks)

--------- ------ --------------

dparity 0a.54 635555/1301618176

parity 0c.40 635555/1301618176

data UNMAPPED 635555/'-'

data 0c.55 635555/1301618176

data 0a.36 635555/1301618176

data 0c.23 635555/1301618176

data 1d.00.16 635555/1301618176

data 0a.27 635555/1301618176

data UNMAPPED 635555/'-'

data 0a.57 635555/1301618176

data 1a.01.4 635555/1301618176

data 0a.20 635555/1301618176

data UNMAPPED 635555/'-'

RAID group /SATA401/plex0/rg2

RAID Disk Device RAID size(MB/blks)

--------- ------ --------------

dparity 0c.64 635555/1301618176

parity 0a.34 635555/1301618176

data 1d.00.12 635555/1301618176

data 0a.48 635555/1301618176

data 0c.65 635555/1301618176

data 0a.38 635555/1301618176

data 0a.22 635555/1301618176

data 0a.50 635555/1301618176

data 0c.66 635555/1301618176

data 0c.42 635555/1301618176

data 0a.25 635555/1301618176

data 0c.52 635555/1301618176

data 0c.68 635555/1301618176

RAID group /SATA401/plex0/rg3

RAID Disk Device RAID size(MB/blks)

--------- ------ --------------

dparity 0c.45 635555/1301618176

parity 0c.69 635555/1301618176

data 0c.77 635555/1301618176

data 0c.58 635555/1301618176

data 0c.71 635555/1301618176

data 0a.44 635555/1301618176

data 0a.43 635555/1301618176

data 0c.72 635555/1301618176

data 0a.29 635555/1301618176

data 0c.61 635555/1301618176

data 0c.73 635555/1301618176

data 0c.76 635555/1301618176

data 0c.74 635555/1301618176

IBM6040SAN51B>

IBM6040SAN51B> disk show -n

disk show: No disks match option -n.

JGPSHNTAP · ‎2014-09-17

^^

You're in bad shape.. What version of code are you running. You lost your aggregate b/c you had a triple disk failure in a single raid-group. If these are older type drives, you either had a multi-disk panic, hit the MOOS drive issue, or failed to replace failed drives in time, by not having enough spares

You need to call ibm/netapp support, and they are going to tell you to perform a wafl iron, but, there's very little chance of recovering from a triple disk failure in my opinion.

I hope you are snapmirroring the data somewhere, if not you better touch up your resume if you're not able to recover

GOODNERD1 · ‎2014-09-17

Version 8.0.2. We've had a rash of failed drives recently and I suspect it's bad replacements but haven't confirmed. Long back story on this solution but in short we're migrating to another solution. This just happened to be our DR Filers that someone decided to use for a "test" that became "important" and our primary NetApp admin left. So It appears to be a bad shape prior to worse shape now scenario. Our support contract is up as well, so that makes this rather difficult. But I'm willing to "learn" as much as I can while I still have this DR solution running.

They are older type drives, check. How can I check for multi-disk panic? Not familiar with the MOOS drive issue and certainly didn't replace the drives fast enough, 3 failed last week and were replaced Monday when right after more drives failed and weren't swapped before this occurred. So I certainly agree with your assessment.