Solved: PANIC: Root volume: "aggr0" is corrupt in process config_thread

klemen_bregar · ‎2014-10-06

Hi!

I have a problem with one filer in FAS2020HA configuration. In the middle of boot process filer crashes with the next error:

PANIC: Root volume: "aggr0" is corrupt in process config_thread on release NetApp Release 7.3.2 on Fri Oct 3 08:33:45 GMT 2014

version: NetApp Release 7.3.2: Thu Oct 15 04:17:39 PDT 2009
cc flags: 8O
halt after panic during system initialization

AMI BIOS8 Modular BIOS
Copyright (C) 1985-2006, American Megatrends, Inc. All Rights Reserved
Portions Copyright (C) 2006 Network Appliance, Inc. All Rights Reserved
BIOS Version 3.0
+++++++++++++++

Does anybody has any idea, how to troubleshoot this "corrupted" aggregate? Disks seems to be OK (no failed disks - as seen from the alive partner).

Will WAFL_CHECK command help here?

Thank you!

With regards,

Klemen

GLENDONLOWDER · ‎2014-10-06

I agree with the first post. Call Support and they will help you with this. This is something that you shouldn't attempt without knowledge and or experience. If you're not in the "hot seat" and want to know how to do this yourself well then here's what I'd do. In my case I had a V-Series V3020 with HPXP10K behind it which made it even more fun I've done this ~10 times so I got good at doing it because the HP barfed (technical term) all the time.

I do not take any liability for you messing up your system if you proceed!

Don't expect NetApp to take responsibility for my advise either.

If you're still curious and I haven't scared you away then proceed...

Setup TFTP server on the partner node https://kb.netapp.com/support/index?page=content&id=1012003&locale=en_US

Netboot the node with the corrupted /vol/vol0. https://kb.netapp.com/support/index?page=content&id=3013804&locale=en_US

run WAFL_check or wafliron on the aggregate that is corrupted (mostly likely will show aggr inconsistant). Try WAFL_check first as it will run faster if that doesn't work then try wafliron.

It probably looks like this. Remember mine was a v-series so it shows raid0 and array LUN's. Don't let that throw you. Wafl does checksum on top of software RAID.

*** This system has failed.
Any adapters shown below are those of the live partner, toaster1
Aggregate aggr1 (restricted, raid0, wafl inconsistent) (block checksums)
Plex /aggr1/plex0 (online, normal, active)
    RAID group /aggr1/plex0/rg0 (normal)

      RAID Disk Device                  HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------                  ------------- ---- ---- ---- ----- --------------    --------------
      data      ntcsan6:19.126L0        0d    -   -          - LUN   N/A 432876/886530048 437248/895485360
      data      ntcsan5:18.126L2        0a    -   -          - LUN   N/A 432876/886530048 437248/895485360
      data      ntcsan5:18.126L1        0a    -   -          - LUN   N/A 432876/886530048 437248/895485360
      data      ntcsan5:18.126L6        0a    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan5:18.126L5        0a    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan6:19.126L8        0d    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan6:19.126L7        0d    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan5:18.126L10       0a    -   -          - LUN   N/A 415681/851314688 419880/859914720

    RAID group /aggr1/plex0/rg1 (normal)

      RAID Disk Device                  HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------                  ------------- ---- ---- ---- ----- --------------    --------------
      data      ntcsan6:19.126L12       0d    -   -          - LUN   N/A 367837/753330176 371553/760940880
      data      ntcsan5:18.126L13       0a    -   -          - LUN   N/A 367837/753330176 371553/760940880
      data      ntcsan6:18.126L6        0d    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan6:18.126L10       0d    -   -          - LUN   N/A 411063/841857024 415215/850362240
      data      ntcsan6:18.126L13       0d    -   -          - LUN   N/A 422730/865751040 427000/874497120

Wait a while while it runs. It will most likely take several hours; wafliron much longer. I couldn't find the doc that I used for wafl_check and wafliron but if you search for it they're out there. Support keeps these pretty close to the chest so people don't break their systems as you should really only do this if you really know what you're doing or are on the phone with support. (have I said tha enough...)

If it comes back clean then you should be able to boot it as normal because Wafl_check or wafliron will mark the aggregate as consistant.

if it comes back with errors, there is a priv set advanced command that will allow you to trace the bac block to a file that you can restore.

If all this fails call support and have them help you. If you don't have a valid support contract, buy one. If you can't do that... Well then, how good are your backups and is your resume up to date? 😉

View solution in original post

YIshikawa · ‎2014-10-06

You should contact Technical Support ASUP!

GLENDONLOWDER · ‎2014-10-06

I like the ASAP ASUP joke, that was a nice touch...

GLENDONLOWDER · ‎2014-10-06

I agree with the first post. Call Support and they will help you with this. This is something that you shouldn't attempt without knowledge and or experience. If you're not in the "hot seat" and want to know how to do this yourself well then here's what I'd do. In my case I had a V-Series V3020 with HPXP10K behind it which made it even more fun I've done this ~10 times so I got good at doing it because the HP barfed (technical term) all the time.

I do not take any liability for you messing up your system if you proceed!

Don't expect NetApp to take responsibility for my advise either.

If you're still curious and I haven't scared you away then proceed...

Setup TFTP server on the partner node https://kb.netapp.com/support/index?page=content&id=1012003&locale=en_US

Netboot the node with the corrupted /vol/vol0. https://kb.netapp.com/support/index?page=content&id=3013804&locale=en_US

run WAFL_check or wafliron on the aggregate that is corrupted (mostly likely will show aggr inconsistant). Try WAFL_check first as it will run faster if that doesn't work then try wafliron.

It probably looks like this. Remember mine was a v-series so it shows raid0 and array LUN's. Don't let that throw you. Wafl does checksum on top of software RAID.

*** This system has failed.
Any adapters shown below are those of the live partner, toaster1
Aggregate aggr1 (restricted, raid0, wafl inconsistent) (block checksums)
Plex /aggr1/plex0 (online, normal, active)
    RAID group /aggr1/plex0/rg0 (normal)

      RAID Disk Device                  HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------                  ------------- ---- ---- ---- ----- --------------    --------------
      data      ntcsan6:19.126L0        0d    -   -          - LUN   N/A 432876/886530048 437248/895485360
      data      ntcsan5:18.126L2        0a    -   -          - LUN   N/A 432876/886530048 437248/895485360
      data      ntcsan5:18.126L1        0a    -   -          - LUN   N/A 432876/886530048 437248/895485360
      data      ntcsan5:18.126L6        0a    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan5:18.126L5        0a    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan6:19.126L8        0d    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan6:19.126L7        0d    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan5:18.126L10       0a    -   -          - LUN   N/A 415681/851314688 419880/859914720

    RAID group /aggr1/plex0/rg1 (normal)

      RAID Disk Device                  HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------                  ------------- ---- ---- ---- ----- --------------    --------------
      data      ntcsan6:19.126L12       0d    -   -          - LUN   N/A 367837/753330176 371553/760940880
      data      ntcsan5:18.126L13       0a    -   -          - LUN   N/A 367837/753330176 371553/760940880
      data      ntcsan6:18.126L6        0d    -   -          - LUN   N/A 415681/851314688 419880/859914720
      data      ntcsan6:18.126L10       0d    -   -          - LUN   N/A 411063/841857024 415215/850362240
      data      ntcsan6:18.126L13       0d    -   -          - LUN   N/A 422730/865751040 427000/874497120

Wait a while while it runs. It will most likely take several hours; wafliron much longer. I couldn't find the doc that I used for wafl_check and wafliron but if you search for it they're out there. Support keeps these pretty close to the chest so people don't break their systems as you should really only do this if you really know what you're doing or are on the phone with support. (have I said tha enough...)

If it comes back clean then you should be able to boot it as normal because Wafl_check or wafliron will mark the aggregate as consistant.

if it comes back with errors, there is a priv set advanced command that will allow you to trace the bac block to a file that you can restore.

If all this fails call support and have them help you. If you don't have a valid support contract, buy one. If you can't do that... Well then, how good are your backups and is your resume up to date? 😉

klemen_bregar · ‎2015-01-13

Hi!

Thank you for your advice. I was successful with WAFL ironing

With regards,

Klemen

GLENDONLOWDER · ‎2015-01-14

That's awesome. I'm glad it helped. I hadn't had to do this in a long time and just the other day we hit a super nasty bug in 8.1.2 (417544) and had to WAFLiron every aggr on a node. We used the HTTP method instead of the TFTP method as the TFTP method didn't work for some reason and the HTTP worked just as well for what it's worth...