Re: Recovering a RAID 4 aggregate on FAS2020

cdcditupm · ‎2012-04-09

Hi all,

This is in short our present (sad) situation: we had a FAS2020 system with two controllers, a base shelf of 12 SAS 300Gb disks and an external shelf of 14 1TB SATA disks. We had, among others, an agreggate made of 6 SATA disks configured with RAID4 (5 data disks + 1 parity disk) and 1 spare disk available.

One week ago, one of the disks failed. The system substituted it by the spare disk and begun to reconstruct the RAID. However 20 hours later and, before finishing RAID reconstruction, another disk failed. Reconstruction should have ended on that period as the system was not loaded at all (10-12 hours should have lasted as I've read), but unfortunately took longer.

When the second disk failed, the first controller made a takeover and set the aggregate offline. After contacting Netapp support, they forced a giveback that led to the following situation:

Aggregate aggr1 (failed, raid4, partial) (block checksums)

Plex /aggr1/plex0 (offline, failed, inactive)

RAID group /aggr1/plex0/rg0 (partial)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

parity 0a.22 0a 1 6 FC:A - ATA 7200 847555/1735794176 847827/1736350304

data 0a.16 0a 1 0 FC:A - ATA 7200 847555/1735794176 847827/1736350304

data FAILED N/A 847555/1735794176

data 0a.26 0a 1 10 FC:A - ATA 7200 847555/1735794176 847827/1736350304

data 0a.24 0a 1 8 FC:A - ATA 7200 847555/1735794176 847827/1736350304

data 0a.28 0a 1 12 FC:A - ATA 7200 847555/1735794176 847827/1736350304 (reconstruction 99% completed)

Raid group is missing 1 disk.

That is, of a RAID4 6 disks aggregrate we have 4 disks OK, one reconstructed at 99% (as message suggests) and a broken one.

My question is: should we have any hope to recover the information or we should directly forgot and recover all from backups?

To make things worst, the two disks that failed are apparently working (at least they are detected by the system; could have been a software problem), but an improper sequence of commands have partially 'zeroed' them.

Any idea, comment or opinion is welcome. Thanks in advance,

David

scottgelb · ‎2012-04-09

It doesn't sound good. You might check with support escalations on some hidden commands to bring back failed disks (special boot menu stuff for support only) but with 2 failed disks a quicker and safer solution would likely be restore.

cdcditupm · ‎2012-05-11

Hi again,

Sad stories sometimes have happy ends! After scalating the problem, NetApp support engineers recovered all the information from the aggregate. Basically they used the 99% reconstructed disk to regenerate the raid, taken the 1% of data missing from the two "partially zeroed" original raid disks (fortunately the missing and deleted data didn't overlaped).

In summary, a hard two weeks of recovering backups and reinstalling virtual machines and more work later to mix what we recovered and what we reinstalled, but no information lost thanks to the professionality of NetApp scalation support engineers.

Best regards,
David

radek_kubka · ‎2012-05-11

Thanks for sharing this.

In my view, it illustrates that RAID-DP & no hot-spare is better than RAID-4 with a hot-spare.

cdcditupm · ‎2012-05-11

> Thanks for sharing this.

> In my view, it illustrates that RAID-DP & no hot-spare is better than RAID-4 with a hot-spare.

I cannot agree more on that. And I would like somebody would have told me that long before when we installed the system.

Best regards,
David