RAID DP Disk Failure

edlam2000 · ‎2014-01-27

I know that RAID_DP allows 2 disk failure.

Let's say there's already a disk with double parity failed, and theretically, it allows another disk to fail, but if another disk fails, will the system shut itself down automatically after 24 hours for its own data-protection?

JGPSHNTAP · ‎2014-09-17

Ok, Well, you are running code that is suspectible to Multi-disk panics.. I dont' know what happened b/c I haven't seen the logs.. Did you not have enough spares? How many spares do you have (aggr status -s)

8.0.2, doesn't have the code that is needed to help prevent this. The newer code does preventative copies to spares and then gracefully spares. I would recommend, once you get this fixed, to upgrade/date you netapp controllers, and update disk/shelf and disk qualification package.

As for what happened, you need to pick through the messages file and see what's going on..

GOODNERD1 · ‎2014-09-17

Non-zeroed spares? Could that be from the "replacement" drives that were zeroed? That's not good! I'll start looking through the logs for more info. Looks like this will be a lesson learned event. Good thing it happened at DR and not PROD. Bad thing is we're no close from being out of the red zone with this. Great responses JGPSHNTAP, much appreciated!

IBM6040SAN51B> aggr status -s

Pool1 spare disks (empty)

Pool0 spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

Spare disks for block or zoned checksum traditional volumes or aggregates

spare 0b.109 0b 6 13 FC:A 0 FCAL 10000 272000/557056000 274845/562884296 (not zeroed)

spare 0d.93 0d 5 13 FC:B 0 FCAL 10000 272000/557056000 280104/573653840 (not zeroed)

spare 0a.37 0a 2 5 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304

spare 0a.60 0a 3 12 FC:A 0 ATA 7200 635555/1301618176 635858/1302238304

IBM6040SAN51B>

aborzenkov · ‎2014-09-17

Unfortunately there is very little hope to recover RAID now, after drivers were physically replaced. As long as original drives remained, there was hope to try to unfail them; but now data on these drives is gone.

It could be still possible to put them back and attempt to unfail as the last resort.

GOODNERD1 · ‎2014-09-17

The more I read the less I believe we'll recover what is lost. I can attempt to replace the failed drives but not familiar with any process to attempt recovery at this point. I've read on of the wafliron and according to the docs, it can not be run on a Aggr that is listed as failed, DP_RAID, partial which is our status. I'm thinking the effort required to make this attempt would be better suited toward rebuilding the MIRRORS for replication and rebuilding any VM's that were effected. Thankfully this was DR so nothing "Production" was involved. But it does create a sizable concern for Production since they are basically the same clustered pair at each location.

I appreciate the responses and time spent discussing this.

radek_kubka · ‎2014-09-18

I agree it probably would be too much hassle to attempt recovery if the data are just DR replicas.

If it was a "fake" disk failure caused by ONTAP bug, then first thing to do is to upgrade, probably to 8.1.4 latest P version. "Real" multiple disk failures are an extreme rarity.

Though I'm not saying this should immediately rebuild your confidence in NetApp...

JGPSHNTAP · ‎2014-09-18

Non Zero'd spares are useless... type "disk zero spares"

You should be immediately updating production to a version of ontap. Are these N-series boxes thru IBM

Show me sysconfig -a

the following two lines

Model Name: N3150

Machine Type: IBM-2857-A25

If these are N-series boxes, you need to open up a PMR with ibm and get an upgrade advisor, plus the proper code to upgrade too

aborzenkov · ‎2014-09-18

Re non-zeroed spares - that's not quote correct. Zeroed spares are relevant only for adding to aggregate (or creating one). Replacement of failed drive starts immediately, it does not try to zero first. So they are pretty much useful as spares.

JGPSHNTAP · ‎2014-09-18

Wait, I need some clarification

A drive fails.. All you have is non-zeroed spares in the spare pool. What happens

GOODNERD1 · ‎2014-09-18

It is an IBM branded NetApp.

Model Name: N6040

Machine Type: IBM-2858-A20

I understood non-zeroed drives would require zeroing prior to going online. This would cause a delay in the event of a failure and no available zeroed spares were available. I found and ran the disk zero spares last night and it was very non-intrusive back ground process.

Logs show multi-disk failure caused the issue. The rebuild running from the last disk replacements wasn't complete and a HA fail-over occurred. So far the lost VMs have been rebuilt and I'm going to work on re-configuring the Aggregate for the replications from Prod. It's better to destroy this lost aggregate and create a new one, correct?

aborzenkov · ‎2014-09-18

Rebuild onto non-zeroed spare is started immediately. There is no need to zero drive which is used as replacement for failed one - replacement drive is going to be rebuilt and completely rewritten anyway.

JGPSHNTAP · ‎2014-09-18

You need to open up a PMR with IBM to get an upgrade advisor and the latest code for your 6040 A 6040 is a FAS3140.

New video on NetApp KB TV