Effective December 3, NetApp adopts Microsoft’s Business-to-Customer (B2C) identity management to simplify and provide secure access to NetApp resources.
For accounts that did not pre-register (prior to Dec 3), access to your NetApp data may take up to 1 hour as your legacy NSS ID is synchronized to the new B2C identity.
To learn more, read the FAQ and watch the video.
Need assistance? Complete this form and select “Registration Issue” as the Feedback Category.

ONTAP Discussions

RAID DP Disk Failure

edlam2000

I know that RAID_DP allows 2 disk failure.

Let's say there's already a disk with double parity failed, and theretically, it allows another disk to fail, but if another disk fails, will the system shut itself down automatically after 24 hours for its own data-protection?

31 REPLIES 31

radek_kubka

It will not.

An automatic RAID rebuild will start as soon as the first drive fails using one of the hot-spare disks.

billshaffer

Assuming there are spares available.  If the system is unable to rebuild, then yes, it will shut itself down in 24 hours, though this can be altered with the raid.timeout option.

Bill

Hi Bill

Is there a document or white report that you can share that if the system is unable to rebuild, it will shut down in 24 hours?

thank you for this one sir! Another is what if 3 failed disk is down with out applying the raid.timeout option what could happend to the system? i know there will be a data loss or 24 hours shut down but is there a TR or white paper that can support this.

Thank You in advance!

A few things to consider.  Remember that there can be multiple aggregates on the system, each of which can consist of multiple raid groups - aggr status -r will show the raid groups.

You can have two failed drives in every aggregate and still not lose data, because each aggregate is a seperate entity.  In addition, you can lose two drives in each raid group of a _single_ aggregate without losing data, because each raid group is in its own raid-dp setup.  So, if you have an aggregate with 4 raid-dp raid groups, you could lose 8 drives, as long as two come from each raid group, without losing data.  For the record, I've seen this - an entire shelf powered off, but only two drives from that shelf were in any single raid group, so no data loss.

If you lost a third drive in a raid-dp raid group, that raid group would fail and the aggregate would go offline, and you'd lose data.  Not sure if a failover would happen - assuming dual path HA, if the drive is down on one controller it'll be down on the other also.  Also not sure if raid.timeout would shut the system down - I don't know that an offline aggregate constitutes a degraded state.  Degraded implies that it's still running, which it isn't, technically.

I did a quick look, but couldn't find anything to describe what happens when you lose that third drive.  This link (https://kb.netapp.com/support/index?page=content&id=3013638) hints that the controller will panic, but it is a different situation (media error on rebuild).  One of the links is to article 2014172, which says to call support in the rare event that there was actually another disk failure - so maybe they don't publish what happens in that case.

Bill

GOODNERD1

Sorry all, but this was as close to our recent failures so I'm trying to find out more.  We recently lost 4 drives and the aggregate is listed as failed, radi_dp,partial.  We have an HA cluster and a failover occured when the multiple drives failed.  We've replaced the drives (physically) and are now researching if it is possible to recover any or all of the aggregate/volumes. 

This is what I have found so far.  The bold type is concerning and removes some hope.  Any insight you guys can provide to a novice NetApp admin would be great.  I can certainly provide more details if necessary.  Thanks in advance!

IBM6040SAN51B> aggr undestroy -n SATA401

To view disk list, select one of following options

        [1] abandon the command

        [2] view disks of aggregate SATA401 ID: 0x849bc76c-11e04511-a0000b95-92041398

Selection (1-2)?  2

Couldn't find sufficient disks to make any plex operable.

1 raidgroup failed in plex0

Aggregate SATA401 (failed, raid_dp, wafl inconsistent) (block checksums)

  Plex /SATA401/plex0

    RAID group /SATA401/plex0/rg0

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0c.16           635555/1301618176

      parity            0a.32           635555/1301618176

      data              0a.56           635555/1301618176

      data              0c.67           635555/1301618176

      data              0a.17           635555/1301618176

      data              0a.35           635555/1301618176

      data              0a.41           635555/1301618176

      data              0c.70           635555/1301618176

      data              0c.26           635555/1301618176

      data              1d.00.10        635555/1301618176

      data              1a.01.0         635555/1301618176

      data              0a.39           635555/1301618176

      data              UNMAPPED        635555/'-'

    RAID group /SATA401/plex0/rg1

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0a.54           635555/1301618176

      parity            0c.40           635555/1301618176

      data              UNMAPPED        635555/'-'

      data              0c.55           635555/1301618176

      data              0a.36           635555/1301618176

      data              0c.23           635555/1301618176

      data              1d.00.16        635555/1301618176

      data              0a.27           635555/1301618176

      data              UNMAPPED        635555/'-'

      data              0a.57           635555/1301618176

      data              1a.01.4         635555/1301618176

      data              0a.20           635555/1301618176

      data              UNMAPPED        635555/'-'

    RAID group /SATA401/plex0/rg2

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0c.64           635555/1301618176

      parity            0a.34           635555/1301618176

      data              1d.00.12        635555/1301618176

      data              0a.48           635555/1301618176

      data              0c.65           635555/1301618176

      data              0a.38           635555/1301618176

      data              0a.22           635555/1301618176

      data              0a.50           635555/1301618176

      data              0c.66           635555/1301618176

      data              0c.42           635555/1301618176

      data              0a.25           635555/1301618176

      data              0c.52           635555/1301618176

      data              0c.68           635555/1301618176

    RAID group /SATA401/plex0/rg3

      RAID Disk         Device          RAID size(MB/blks)

      ---------         ------          --------------

      dparity           0c.45           635555/1301618176

      parity            0c.69           635555/1301618176

      data              0c.77           635555/1301618176

      data              0c.58           635555/1301618176

      data              0c.71           635555/1301618176

      data              0a.44           635555/1301618176

      data              0a.43           635555/1301618176

      data              0c.72           635555/1301618176

      data              0a.29           635555/1301618176

      data              0c.61           635555/1301618176

      data              0c.73           635555/1301618176

      data              0c.76           635555/1301618176

      data              0c.74           635555/1301618176

IBM6040SAN51B>

IBM6040SAN51B> disk show -n

disk show: No disks match option -n.

aborzenkov

Unfortunately there is very little hope to recover RAID now, after drivers were physically replaced. As long as original drives remained, there was hope to try to unfail them; but now data on these drives is gone.

It could be still possible to put them back and attempt to unfail as the last resort.

GOODNERD1

The more I read the less I believe we'll recover what is lost.  I can attempt to replace the failed drives but not familiar with any process to attempt recovery at this point.  I've read on of the wafliron and according to the docs, it can not be run on a Aggr that is listed as failed, DP_RAID, partial which is our status.  I'm thinking the effort required to make this attempt would be better suited toward rebuilding the MIRRORS for replication and rebuilding any VM's that were effected.  Thankfully this was DR so nothing "Production" was involved.  But it does create a sizable concern for Production since they are basically the same clustered pair at each location. 

I appreciate the responses and time spent discussing this. 

radek_kubka

I agree it probably would be too much hassle to attempt recovery if the data are just DR replicas.

If it was a "fake" disk failure caused by ONTAP bug, then first thing to do is to upgrade, probably to 8.1.4 latest P version. "Real" multiple disk failures are an extreme rarity.

Though I'm not saying this should immediately rebuild your confidence in NetApp...

JGPSHNTAP

Non Zero'd spares are useless... type "disk zero spares"

You should be immediately updating production to a version of ontap.    Are these N-series boxes thru IBM

Show me sysconfig -a

the following two lines

                Model Name:         N3150

                Machine Type:       IBM-2857-A25

If these are N-series boxes, you need to open up a PMR with ibm and get an upgrade advisor, plus the proper code to upgrade too

aborzenkov

Re non-zeroed spares - that's not quote correct. Zeroed spares are relevant only for adding to aggregate (or creating one). Replacement of failed drive starts immediately, it does not try to zero first. So they are pretty much useful as spares.

JGPSHNTAP

Wait, I need some clarification

A drive fails.. All you have is non-zeroed spares in the spare pool. What happens

aborzenkov

Rebuild onto non-zeroed spare is started immediately. There is no need to zero drive which is used as replacement for failed one - replacement drive is going to be rebuilt and completely rewritten anyway.

JGPSHNTAP

You need to open up a PMR with IBM to get an upgrade advisor and the latest code for your 6040  A 6040 is a FAS3140. 

GOODNERD1

It is an IBM branded NetApp.

Model Name:         N6040

Machine Type:       IBM-2858-A20

I understood non-zeroed drives would require zeroing prior to going online.  This would cause a delay in the event of a failure and no available zeroed spares were available.  I found and ran the disk zero spares last night and it was very non-intrusive back ground process.

Logs show multi-disk failure caused the issue.  The rebuild running from the last disk replacements wasn't complete and a HA fail-over occurred.  So far the lost VMs have been rebuilt and I'm going to work on re-configuring the Aggregate for the replications from Prod.  It's better to destroy this lost aggregate and create a new one, correct?

JGPSHNTAP

^^

You're in bad shape.. What version of code are you running.  You lost your aggregate b/c you had a triple disk failure in a single raid-group.  If these are older type drives, you either had a multi-disk panic, hit the MOOS drive issue, or failed to replace failed drives in time, by not having enough spares

You need to call ibm/netapp support, and they are going to tell you to perform a wafl iron, but, there's very little chance of recovering from a triple disk failure in my opinion.   

I hope you are snapmirroring the data somewhere, if not you better touch up your resume if you're not able to recover

GOODNERD1

Version 8.0.2.  We've had a rash of failed drives recently and I suspect it's bad replacements but haven't confirmed.  Long back story on this solution but in short we're migrating to another solution.  This just happened to be our DR Filers that someone decided to use for a "test" that became "important" and our primary NetApp admin left.  So It appears to be a bad shape prior to worse shape now scenario.  Our support contract is up as well, so that makes this rather difficult.  But I'm willing to "learn" as much as I can while I still have this DR solution running.

They are older type drives, check.  How can I check for multi-disk panic?  Not familiar with the MOOS drive issue and certainly didn't replace the drives fast enough,  3 failed last week and were replaced Monday when right after more drives failed and weren't swapped before this occurred.  So I certainly agree with your assessment. 

JGPSHNTAP

Ok, Well, you are running code that is suspectible to Multi-disk panics.. I dont' know what happened b/c I haven't seen the logs.. Did you not have enough spares?   How many spares do you have (aggr status -s)

8.0.2, doesn't have the code that is needed to help prevent this.  The newer code does preventative copies to spares and then gracefully spares.  I would recommend, once you get this fixed, to upgrade/date you netapp controllers, and update disk/shelf and disk qualification package.

As for what happened, you need to pick through the messages file and see what's going on..  

GOODNERD1

Non-zeroed spares?  Could that be from the "replacement" drives that were zeroed?  That's not good!  I'll start looking through the logs for more info.  Looks like this will be a lesson learned event.  Good thing it happened at DR and not PROD.  Bad thing is we're no close from being out of the red zone with this.  Great responses JGPSHNTAP, much appreciated!

IBM6040SAN51B> aggr status -s

 

Pool1 spare disks (empty)

 

Pool0 spare disks

 

RAID Disk       Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)

---------       ------          ------------- ---- ---- ---- ----- --------------    --------------

Spare disks for block or zoned checksum traditional volumes or aggregates

spare           0b.109          0b    6   13  FC:A   0  FCAL 10000 272000/557056000  274845/562884296 (not zeroed)

spare           0d.93           0d    5   13  FC:B   0  FCAL 10000 272000/557056000  280104/573653840 (not zeroed)

spare           0a.37           0a    2   5   FC:A   0  ATA   7200 635555/1301618176 635858/1302238304

spare           0a.60           0a    3   12  FC:A   0  ATA   7200 635555/1301618176 635858/1302238304

IBM6040SAN51B>

Announcements
NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

NetApp Insights to Action
I2A Banner
Public