ONTAP Hardware
ONTAP Hardware
Hi!
I have a problem with one filer in FAS2020HA configuration. In the middle of boot process filer crashes with the next error:
PANIC: Root volume: "aggr0" is corrupt in process config_thread on release NetApp Release 7.3.2 on Fri Oct 3 08:33:45 GMT 2014
version: NetApp Release 7.3.2: Thu Oct 15 04:17:39 PDT 2009
cc flags: 8O
halt after panic during system initialization
AMI BIOS8 Modular BIOS
Copyright (C) 1985-2006, American Megatrends, Inc. All Rights Reserved
Portions Copyright (C) 2006 Network Appliance, Inc. All Rights Reserved
BIOS Version 3.0
+++++++++++++++
Does anybody has any idea, how to troubleshoot this "corrupted" aggregate? Disks seems to be OK (no failed disks - as seen from the alive partner).
Will WAFL_CHECK command help here?
Thank you!
With regards,
Klemen
Solved! See The Solution
I agree with the first post. Call Support and they will help you with this. This is something that you shouldn't attempt without knowledge and or experience. If you're not in the "hot seat" and want to know how to do this yourself well then here's what I'd do. In my case I had a V-Series V3020 with HPXP10K behind it which made it even more fun I've done this ~10 times so I got good at doing it because the HP barfed (technical term) all the time.
I do not take any liability for you messing up your system if you proceed!
Don't expect NetApp to take responsibility for my advise either.
If you're still curious and I haven't scared you away then proceed...
Setup TFTP server on the partner node https://kb.netapp.com/support/index?page=content&id=1012003&locale=en_US
Netboot the node with the corrupted /vol/vol0. https://kb.netapp.com/support/index?page=content&id=3013804&locale=en_US
run WAFL_check or wafliron on the aggregate that is corrupted (mostly likely will show aggr inconsistant). Try WAFL_check first as it will run faster if that doesn't work then try wafliron.
It probably looks like this. Remember mine was a v-series so it shows raid0 and array LUN's. Don't let that throw you. Wafl does checksum on top of software RAID.
*** This system has failed.
Any adapters shown below are those of the live partner, toaster1
Aggregate aggr1 (restricted, raid0, wafl inconsistent) (block checksums)
Plex /aggr1/plex0 (online, normal, active)
RAID group /aggr1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
data ntcsan6:19.126L0 0d - - - LUN N/A 432876/886530048 437248/895485360
data ntcsan5:18.126L2 0a - - - LUN N/A 432876/886530048 437248/895485360
data ntcsan5:18.126L1 0a - - - LUN N/A 432876/886530048 437248/895485360
data ntcsan5:18.126L6 0a - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan5:18.126L5 0a - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan6:19.126L8 0d - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan6:19.126L7 0d - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan5:18.126L10 0a - - - LUN N/A 415681/851314688 419880/859914720
RAID group /aggr1/plex0/rg1 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
data ntcsan6:19.126L12 0d - - - LUN N/A 367837/753330176 371553/760940880
data ntcsan5:18.126L13 0a - - - LUN N/A 367837/753330176 371553/760940880
data ntcsan6:18.126L6 0d - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan6:18.126L10 0d - - - LUN N/A 411063/841857024 415215/850362240
data ntcsan6:18.126L13 0d - - - LUN N/A 422730/865751040 427000/874497120
Wait a while while it runs. It will most likely take several hours; wafliron much longer. I couldn't find the doc that I used for wafl_check and wafliron but if you search for it they're out there. Support keeps these pretty close to the chest so people don't break their systems as you should really only do this if you really know what you're doing or are on the phone with support. (have I said tha enough...)
If it comes back clean then you should be able to boot it as normal because Wafl_check or wafliron will mark the aggregate as consistant.
if it comes back with errors, there is a priv set advanced command that will allow you to trace the bac block to a file that you can restore.
If all this fails call support and have them help you. If you don't have a valid support contract, buy one. If you can't do that... Well then, how good are your backups and is your resume up to date? 😉
I like the ASAP ASUP joke, that was a nice touch...
I agree with the first post. Call Support and they will help you with this. This is something that you shouldn't attempt without knowledge and or experience. If you're not in the "hot seat" and want to know how to do this yourself well then here's what I'd do. In my case I had a V-Series V3020 with HPXP10K behind it which made it even more fun I've done this ~10 times so I got good at doing it because the HP barfed (technical term) all the time.
I do not take any liability for you messing up your system if you proceed!
Don't expect NetApp to take responsibility for my advise either.
If you're still curious and I haven't scared you away then proceed...
Setup TFTP server on the partner node https://kb.netapp.com/support/index?page=content&id=1012003&locale=en_US
Netboot the node with the corrupted /vol/vol0. https://kb.netapp.com/support/index?page=content&id=3013804&locale=en_US
run WAFL_check or wafliron on the aggregate that is corrupted (mostly likely will show aggr inconsistant). Try WAFL_check first as it will run faster if that doesn't work then try wafliron.
It probably looks like this. Remember mine was a v-series so it shows raid0 and array LUN's. Don't let that throw you. Wafl does checksum on top of software RAID.
*** This system has failed.
Any adapters shown below are those of the live partner, toaster1
Aggregate aggr1 (restricted, raid0, wafl inconsistent) (block checksums)
Plex /aggr1/plex0 (online, normal, active)
RAID group /aggr1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
data ntcsan6:19.126L0 0d - - - LUN N/A 432876/886530048 437248/895485360
data ntcsan5:18.126L2 0a - - - LUN N/A 432876/886530048 437248/895485360
data ntcsan5:18.126L1 0a - - - LUN N/A 432876/886530048 437248/895485360
data ntcsan5:18.126L6 0a - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan5:18.126L5 0a - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan6:19.126L8 0d - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan6:19.126L7 0d - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan5:18.126L10 0a - - - LUN N/A 415681/851314688 419880/859914720
RAID group /aggr1/plex0/rg1 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
data ntcsan6:19.126L12 0d - - - LUN N/A 367837/753330176 371553/760940880
data ntcsan5:18.126L13 0a - - - LUN N/A 367837/753330176 371553/760940880
data ntcsan6:18.126L6 0d - - - LUN N/A 415681/851314688 419880/859914720
data ntcsan6:18.126L10 0d - - - LUN N/A 411063/841857024 415215/850362240
data ntcsan6:18.126L13 0d - - - LUN N/A 422730/865751040 427000/874497120
Wait a while while it runs. It will most likely take several hours; wafliron much longer. I couldn't find the doc that I used for wafl_check and wafliron but if you search for it they're out there. Support keeps these pretty close to the chest so people don't break their systems as you should really only do this if you really know what you're doing or are on the phone with support. (have I said tha enough...)
If it comes back clean then you should be able to boot it as normal because Wafl_check or wafliron will mark the aggregate as consistant.
if it comes back with errors, there is a priv set advanced command that will allow you to trace the bac block to a file that you can restore.
If all this fails call support and have them help you. If you don't have a valid support contract, buy one. If you can't do that... Well then, how good are your backups and is your resume up to date? 😉
Hi!
Thank you for your advice. I was successful with WAFL ironing
With regards,
Klemen
That's awesome. I'm glad it helped. I hadn't had to do this in a long time and just the other day we hit a super nasty bug in 8.1.2 (417544) and had to WAFLiron every aggr on a node. We used the HTTP method instead of the TFTP method as the TFTP method didn't work for some reason and the HTTP worked just as well for what it's worth...