ONTAP Discussions

inconsistent root aggregate

betterplace
6,289 Views

Hi all,

I have a half-production filer gone bad.. The filer has only 1 disk shelf and 1 aggr spreading out on all 14 disks. now my aggr is inconsistent and every volume on it.

From everything i've read, i can't run wafl_check or wafliron on the root aggr (a procedure that will dismount the vol0). i tried to run those commands on maintanence mode, but they're not there.

i can't create a new aggr cause i have no spare disks and i can't add a new shelf, all actions for making a new vol0

what can i do to check the volumes and aggregate consistency ?

7 REPLIES 7

dougsiggins
6,289 Views

The WAFL_check (notice the caps) does not show in maintenance mode because Netapp advises you run it under the guidance of Technical Support. It is a hidden command under the 1-5 menu. A WAFL check takes a significant amount of time, and can cause issues. Do you have access to NOW?

Have you contacted support in determining why the aggr went inconsistent? You will need to make sure from the maintenance menu that the loop is up and stable.

betterplace
6,289 Views

i hadn't talked to netapp about it, thought it's a known solution.

my entire aggr went inconsistent because one ofthe disks has failed and with no spare disk, the DOT couldn't reconstruct the raid, causing (after the 24 h grace) multiple panics

i have an access to NOW if you can recommend me a KB or article from there

of course i've noticed, everything is case sensitive wafliron is the same deal about the risk and the need for a netapp invlvement ?

dougsiggins
6,289 Views

I'd first get another shelf on there to have hotspares. I'd never left an aggr 24 hours waiting for a spare to hit such a problem. Frankly, I've never not had enough hot spares.

Do a search on NOW for WAFL_check. Though my recommendation is that you call Netapp support.

aborzenkov
6,289 Views

my entire aggr went inconsistent because one ofthe disks has failed and with no spare disk, the DOT couldn't reconstruct the raid, causing (after the 24 h grace) multiple panics

If you had muti disk failure, WAFL_check is not going to help you. OTOH I have yet to see true multi-disk failure - all of them were caused by loop stability, environmental (like switching off shelf) or operational (disk assigned off working head ... ) issues. So I would repeat advice you were given - contact support to determine and fix the cause for multiple failures.

Hmm ... metnioning that system "paniced" exactly after 24 hours makes me suspect that there was actually no panic at all, just system warning. This is normal behaviour - NetApp will shutdown after 24 hours if degraded raid group is not repaired. This is to protect you from possible second disk failre and losing data. Any chance you misinterpreted system message?

betterplace
6,289 Views

thank you all for the reponses.

i had no hotspares because this is only a temporary filer that will change it's usage in a month or so. because i had only 1 shelf to attach, i've decided to skip the hotspare and use it for data, thinking nothing bad will PROBABLY go wrong in 2 months.. guess i was wrong lol

I can't get another shelf over there because the set is on a different continent. because all the restrictions, i didn't knew about it in the first 24hours.

aborzenkov- there was only 1 failed disk, everything is good. the system paniced because it couldn't reconstruct and i guess was inconsistent as well, but after 24 hours all of this happened because i had the dfault value raid.timeout 24. maybe after that it started collapsing, causing the inconsistencies.

worst case- i'll reinstall the filer (and properly this time), but i really want not to loose the data that's already on it, since it's few TB's already snapmirrored over a slow WAN.

shaunjurr
6,289 Views

Hi,

Have you reviewed the messages file (attachment) that you received in your ASUP mails? (I really hope you configured at least this for your temp. solution).  That should give you a better idea of what is going on.  If you have had errors in the FC loops or from disks, that might explain the filesystem inconsistency.  If you have support, you probably should have received your spare disk long ago and should have swapped the bad one out.  You might even have made the problem worse by leaving the bad disk in the system causing FC errors that eventually lead to the WAFL problems. There may also have been disk firmware or shelf firmware issues that you should have give a bit of attention too that could have avoided the WAFL problems.

The 24-hour issue has already been explained.  That shutdown should not, however, have caused any WAFL problems.

Remove the bad disk. Insert the replacement. Start in maint mode and I think the reconstruction will comense.  A WAFL_check or wafliron is the only way out if you don't have enough disks to get an "emergency" root volume on-line.  If you really can't afford to lose the data, you should have NetApp Support looking at things with you. 

betterplace
6,289 Views

thank you all, the problem has been resolved.

after consulting with netapp, i rav WAFL_check, took almost 5 hours and fixed everything

appreciate all the help

Public