Solved: Failed Aggregate in FAS202 - Not rebuilding

Slartibartfast · ‎2018-01-08

I am a NetApp newb of the first degree. I inherited these things with zero knowledge and a mandate to keep it running. I am extra lucky in my learning curve, in that the support has been allowed to expire and will not be renewed.

I have a FAS202 with two additional shelves. The software is DataOnTap 7.3.7P1. I am running two aggregates in addition to the OS. One of the aggregates is a RAID4 with three raid groups of four disks each. All the drives are 2TB drives of the exact same model and manufacture.

I returned from holdiay to find that my aggregate was in a failed/partial condition. In actuallity, 6 disks failed while I was on holiday. That'll teach me I guess. Calling up the condition of the aggregate, I see that one of the drives has failed but I have no idea which one to swap. I replaced two of the drives and they both were listed with bad labels. I cleared the label off of one of the drives that I replaced and made it a spare (I read later that this was a poor choice) so I haven't done anything with the other. I also have not replaced any of the other failed drives in the system since every single one shows this "bad label" error and I am not sure how to fix that before putting a drive in.

Additionally, I see that my parity disk on RG1 is stuck at 83%. I am no storage guru, but I am pretty sure that can't be good.

I am fortunate that I had only moved about 30 out of about 120 VMs to this storage so, there's that. I would rather NOT have to rebuild 30 VMs though. I would be eternally grateful if someone could help me find the magic to do a coupel things:

Tell me how to properly prepare a previously utilized disk for placement into the FAS2020 so that it sees a disk to use in the failed array

Determine which disk to replace in the failed array

Get the repair process started so that I might recover my failed aggregate.

Thanks in advance.

JGPSHNTAP · ‎2018-01-08

You're previous person should have been fired, b/c no one uses raid4, for production.. That's risky.. and you lost two disks in your RG, so you lost the aggregate.

You're system is very old, running a super old OS, and super old hardware. And I see these are VMFS, so it's luns. The chances of you recovering this is very small. Sorry, time to restore or rebuild.

View solution in original post

JGPSHNTAP · ‎2018-01-08

You're previous person should have been fired, b/c no one uses raid4, for production.. That's risky.. and you lost two disks in your RG, so you lost the aggregate.

You're system is very old, running a super old OS, and super old hardware. And I see these are VMFS, so it's luns. The chances of you recovering this is very small. Sorry, time to restore or rebuild.

Slartibartfast · ‎2018-01-08

So, am I right in guessing that the rebuild of the parity disk is jacked and that is a second disk failure, yeah?

JGPSHNTAP · ‎2018-01-08

https://library.netapp.com/ecmdocs/ECMP1196986/html/GUID-B7839E9C-E4BD-4E04-8DA0-F1E104CC20F3.html

Yes, I'm not sure if you should start crying or quit

Slartibartfast · ‎2018-01-08

Swell.

Just.....swell.

Ok then. I guess that I will start drinking early this week.

Is there a direction you can point me to prepare disks for insertion into the system so that I don't have the bad label problem?

Thanks

JGPSHNTAP · ‎2018-01-08

The only advice I would give you is to get off that old system as fast as you can. But check your diskqual package..

Slartibartfast · ‎2018-01-12

Welp - that was that. The real scary part of this is that we have two other FAS2020 systems with shelves and they are configured the same way.

Looks like we have some work to do very quickly.

I appreciate your time.

GidonMarcus · ‎2018-01-08

wow.

as the other comment before - this is the most poor configuration i ever seen on any storage system,

the only way to save it is to run sysconfig -a

and for every disk you see as failed run

priv set diag; disk unfail <disk name such 0a.13 as seen in sysconfig -a>

if that dosen't bring the current failed raid group online. maybe it's one of the ones you already taken out.

if so. you'll need to put it in (either instead of one of your current spare. or if you don't have an empty slot you can maybe take one of the other parity's out).

now - see that as a very important warning - my recommendation here is without seeing the system or the overall status. it's only based on the very limited data you provided. no one can take any responsibility on a data loss. if you need a true consulting - get a paid one, maybe they will be able to help you recover. if you try any of the steps is provided before. you can make the situation even worse and i'm taking no responsibility for that. .

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

Slartibartfast · ‎2018-01-12

Thank you for this suggestion. I tried it and the disks came back online, but the aggregate was still toast. I appreciate your time.