Re: Bypassing a disk shelf/ "Replaying WAFL log"

nwleaphart · ‎2010-04-06

Sorry in advance to ask such a newbish question (our FAS3050 was donated to us with no documentation)...

Our FAS3050 has a faulty power supply on one of the disk shelves. We've had it running regardless for about a month now and have been storing non-critical data on it. Something happened over the weekend that caused the NAS to no longer boot. When I console into the NAS and attempt to boot normally, I can see that the last message is about the power supply failing on that disk shelf and "Replaying WAFL log" immediately before that (on the LCD readout on the controller, as well). I've pasted the latest log below. I've run diagnostics, and the NVRAM appears to be functioning in spite of the odd time code.

Anyway, I'm wondering if it is possible for me to simply "bypass" the faulty shelf (since it is unlikely that we will ever have the money to obtain a new power supply) by re-cabling the NAS. If I do that, will have to rebuild the entire array? Won't that cause me to lose all the data on the NAS? I'd like to avoid losing any data if I could, but if there's no other way I'm willing to do what I have to do?

Or is there something else going on that I'm not aware of?

I'd appreciate any help, or just someone to point me in the right direction where I can do some RTFM'ing.

NetApp Release 7.0.5: Wed Aug 9 00:27:38 PDT 2006
Copyright (c) 1992-2006 Network Appliance, Inc.
Starting boot on Tue Apr 6 20:53:25 GMT 2010
Tue Apr 6 20:53:49 GMT [fmmbx_instanceWorke:info]: Disk 0b.32 is a primary mail
box disk
Tue Apr 6 20:53:49 GMT [fmmbx_instanceWorke:info]: Disk 0c.64 is a primary mail
box disk
Tue Apr 6 20:53:49 GMT [fmmbx_instanceWorke:info]: normal mailbox instance on p
rimary side
Tue Apr 6 20:53:56 GMT [raid.vol.replay.nvram:info]: Performing raid replay on
volume(s)
Restoring parity from NVRAM
Tue Apr 6 20:53:56 GMT [raid.cksum.replay.summary:info]: Replayed 0 checksum bl
ocks.
Tue Apr 6 20:53:56 GMT [raid.stripe.replay.summary:info]: Replayed 0 stripes.
Tue Apr 6 20:54:00 GMT [wafl.vol.guarantee.fail:error]: Space for volume NSF_We
ekly_Backup is NOT guaranteed
Replaying WAFL log

Tue Apr 6 20:54:09 GMT [ses.status.psError:CRITICAL]: DS14-Mk2-AT shelf 2 on ch
annel 0d power error for Power supply 2: critical status; power supply failed. T
his module is on the rear side of the shelf, at the right.

aborzenkov · ‎2010-04-07

Log that you pasted does not show any error (except PS which is benign). What exactly happens after the last message?

You can remove shelf if it is self-contained (i.e. aggregate contains only disks in this shelf). If shelf contains root volume, filer won't boot. If some aggregate has disks in this shelf and another one, aggregate will come up incomplete and you won't be able access data in it. Again, if it is root aggregate, filer won't boot.

nwleaphart · ‎2010-04-07

Thanks for the reply!

The entire NAS freezes after that message. I still have one good power supply in that shelf, though. I assume they're built that way for redundancy, since the NAS was working fine for close to a month.

In any case, I suppose I can assume that the root aggregate may have been at least partially stored on that shelf, right? Which means I will need to rebuild the entire array?

Apologies if I'm not conveying the situation in the clearest manner. I'm not too familiar with NAS hardware/software.

aborzenkov · ‎2010-04-07

Have you tried to press RETURN on console at this point? What is on LCD?

If root aggregate were missing filer would reboot, not freeze.

nwleaphart · ‎2010-04-08

Nope. Nothing. I went back again to be sure. The console displays the power supply failure warning and hangs. It doesn't accept any input after that. The LCD displays "Replaying WAFL log".

After a few hours I've noticed that the LCD display may switch to "System inactive", but it still doesn't accept inputs through the console.

lovik_netapp · ‎2010-04-16

I am not sure what's wrong but have you tried swapping the power supplies? replace the failed power supply with the other good supply, moreover one failed power supply shouldn't put filer to its knees, check cabling and disks, and see if light glows.

BTW are you able to get maintenance mode by pressing CTRL+C. Though in your logs it isn't there but just to be sure I am asking this. it should look like this.

Boot Loader version 1.2
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2010 NetApp Inc.

CPU Type: AMD Opteron(tm) Processor 252

Starting AUTOBOOT press Ctrl-C to abort...
Loading:............0x200000/32064968 0x20945c8/34790016 0x41c2048/2371097 0x4404e61/7 Entry at 0x00202018
Starting program at 0x00202018
Press CTRL-C for special boot menu This is where CTRL-C should be pressed to access the boot menu.
Wed Apr 25 19:44:20 GMT [nvram.battery.state:info]: The NVRAM battery is currently ON.
Special boot options menu will be available.

nwleaphart · ‎2010-04-28

Apologies for the delayed response.

I am indeed able to get into the maintenance mode on the NAS. I've run many of the diagnostics within it, and a SAN engineer who happened to be onsite with me determined that the root aggregate was corrupted. He recommended zeroing the disks and rebuilding the array. I've still yet to put in a help ticket to NetApp, so I'll see what they say before I wipe everything.

Thanks for everyones' help.