ONTAP Discussions

RAID DP Disk Failure

edlam2000
19,245 Views

I know that RAID_DP allows 2 disk failure.

Let's say there's already a disk with double parity failed, and theretically, it allows another disk to fail, but if another disk fails, will the system shut itself down automatically after 24 hours for its own data-protection?

31 REPLIES 31

JGPSHNTAP
8,097 Views

Ok, Well, you are running code that is suspectible to Multi-disk panics.. I dont' know what happened b/c I haven't seen the logs.. Did you not have enough spares?   How many spares do you have (aggr status -s)

8.0.2, doesn't have the code that is needed to help prevent this.  The newer code does preventative copies to spares and then gracefully spares.  I would recommend, once you get this fixed, to upgrade/date you netapp controllers, and update disk/shelf and disk qualification package.

As for what happened, you need to pick through the messages file and see what's going on..  

GOODNERD1
8,099 Views

Non-zeroed spares?  Could that be from the "replacement" drives that were zeroed?  That's not good!  I'll start looking through the logs for more info.  Looks like this will be a lesson learned event.  Good thing it happened at DR and not PROD.  Bad thing is we're no close from being out of the red zone with this.  Great responses JGPSHNTAP, much appreciated!

IBM6040SAN51B> aggr status -s

 

Pool1 spare disks (empty)

 

Pool0 spare disks

 

RAID Disk       Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)

---------       ------          ------------- ---- ---- ---- ----- --------------    --------------

Spare disks for block or zoned checksum traditional volumes or aggregates

spare           0b.109          0b    6   13  FC:A   0  FCAL 10000 272000/557056000  274845/562884296 (not zeroed)

spare           0d.93           0d    5   13  FC:B   0  FCAL 10000 272000/557056000  280104/573653840 (not zeroed)

spare           0a.37           0a    2   5   FC:A   0  ATA   7200 635555/1301618176 635858/1302238304

spare           0a.60           0a    3   12  FC:A   0  ATA   7200 635555/1301618176 635858/1302238304

IBM6040SAN51B>

aborzenkov
8,094 Views

Unfortunately there is very little hope to recover RAID now, after drivers were physically replaced. As long as original drives remained, there was hope to try to unfail them; but now data on these drives is gone.

It could be still possible to put them back and attempt to unfail as the last resort.

GOODNERD1
8,094 Views

The more I read the less I believe we'll recover what is lost.  I can attempt to replace the failed drives but not familiar with any process to attempt recovery at this point.  I've read on of the wafliron and according to the docs, it can not be run on a Aggr that is listed as failed, DP_RAID, partial which is our status.  I'm thinking the effort required to make this attempt would be better suited toward rebuilding the MIRRORS for replication and rebuilding any VM's that were effected.  Thankfully this was DR so nothing "Production" was involved.  But it does create a sizable concern for Production since they are basically the same clustered pair at each location. 

I appreciate the responses and time spent discussing this. 

radek_kubka
8,094 Views

I agree it probably would be too much hassle to attempt recovery if the data are just DR replicas.

If it was a "fake" disk failure caused by ONTAP bug, then first thing to do is to upgrade, probably to 8.1.4 latest P version. "Real" multiple disk failures are an extreme rarity.

Though I'm not saying this should immediately rebuild your confidence in NetApp...

JGPSHNTAP
8,094 Views

Non Zero'd spares are useless... type "disk zero spares"

You should be immediately updating production to a version of ontap.    Are these N-series boxes thru IBM

Show me sysconfig -a

the following two lines

                Model Name:         N3150

                Machine Type:       IBM-2857-A25

If these are N-series boxes, you need to open up a PMR with ibm and get an upgrade advisor, plus the proper code to upgrade too

aborzenkov
8,095 Views

Re non-zeroed spares - that's not quote correct. Zeroed spares are relevant only for adding to aggregate (or creating one). Replacement of failed drive starts immediately, it does not try to zero first. So they are pretty much useful as spares.

JGPSHNTAP
8,095 Views

Wait, I need some clarification

A drive fails.. All you have is non-zeroed spares in the spare pool. What happens

GOODNERD1
8,097 Views

It is an IBM branded NetApp.

Model Name:         N6040

Machine Type:       IBM-2858-A20

I understood non-zeroed drives would require zeroing prior to going online.  This would cause a delay in the event of a failure and no available zeroed spares were available.  I found and ran the disk zero spares last night and it was very non-intrusive back ground process.

Logs show multi-disk failure caused the issue.  The rebuild running from the last disk replacements wasn't complete and a HA fail-over occurred.  So far the lost VMs have been rebuilt and I'm going to work on re-configuring the Aggregate for the replications from Prod.  It's better to destroy this lost aggregate and create a new one, correct?

aborzenkov
6,271 Views

Rebuild onto non-zeroed spare is started immediately. There is no need to zero drive which is used as replacement for failed one - replacement drive is going to be rebuilt and completely rewritten anyway.

JGPSHNTAP
6,272 Views

You need to open up a PMR with IBM to get an upgrade advisor and the latest code for your 6040  A 6040 is a FAS3140. 

Public