Sorry in advance to ask such a newbish question (our FAS3050 was donated to us with no documentation)...
Our FAS3050 has a faulty power supply on one of the disk shelves. We've had it running regardless for about a month now and have been storing non-critical data on it. Something happened over the weekend that caused the NAS to no longer boot. When I console into the NAS and attempt to boot normally, I can see that the last message is about the power supply failing on that disk shelf and "Replaying WAFL log" immediately before that (on the LCD readout on the controller, as well). I've pasted the latest log below. I've run diagnostics, and the NVRAM appears to be functioning in spite of the odd time code.
Anyway, I'm wondering if it is possible for me to simply "bypass" the faulty shelf (since it is unlikely that we will ever have the money to obtain a new power supply) by re-cabling the NAS. If I do that, will have to rebuild the entire array? Won't that cause me to lose all the data on the NAS? I'd like to avoid losing any data if I could, but if there's no other way I'm willing to do what I have to do?
Or is there something else going on that I'm not aware of?
I'd appreciate any help, or just someone to point me in the right direction where I can do some RTFM'ing.
NetApp Release 7.0.5: Wed Aug 9 00:27:38 PDT 2006 Copyright (c) 1992-2006 Network Appliance, Inc. Starting boot on Tue Apr 6 20:53:25 GMT 2010 Tue Apr 6 20:53:49 GMT [fmmbx_instanceWorke:info]: Disk 0b.32 is a primary mail box disk Tue Apr 6 20:53:49 GMT [fmmbx_instanceWorke:info]: Disk 0c.64 is a primary mail box disk Tue Apr 6 20:53:49 GMT [fmmbx_instanceWorke:info]: normal mailbox instance on p rimary side Tue Apr 6 20:53:56 GMT [raid.vol.replay.nvram:info]: Performing raid replay on volume(s) Restoring parity from NVRAM Tue Apr 6 20:53:56 GMT [raid.cksum.replay.summary:info]: Replayed 0 checksum bl ocks. Tue Apr 6 20:53:56 GMT [raid.stripe.replay.summary:info]: Replayed 0 stripes. Tue Apr 6 20:54:00 GMT [wafl.vol.guarantee.fail:error]: Space for volume NSF_We ekly_Backup is NOT guaranteed Replaying WAFL log
Tue Apr 6 20:54:09 GMT [ses.status.psError:CRITICAL]: DS14-Mk2-AT shelf 2 on ch annel 0d power error for Power supply 2: critical status; power supply failed. T his module is on the rear side of the shelf, at the right.
Log that you pasted does not show any error (except PS which is benign). What exactly happens after the last message?
You can remove shelf if it is self-contained (i.e. aggregate contains only disks in this shelf). If shelf contains root volume, filer won't boot. If some aggregate has disks in this shelf and another one, aggregate will come up incomplete and you won't be able access data in it. Again, if it is root aggregate, filer won't boot.
The entire NAS freezes after that message. I still have one good power supply in that shelf, though. I assume they're built that way for redundancy, since the NAS was working fine for close to a month.
In any case, I suppose I can assume that the root aggregate may have been at least partially stored on that shelf, right? Which means I will need to rebuild the entire array?
Apologies if I'm not conveying the situation in the clearest manner. I'm not too familiar with NAS hardware/software.
I am not sure what's wrong but have you tried swapping the power supplies? replace the failed power supply with the other good supply, moreover one failed power supply shouldn't put filer to its knees, check cabling and disks, and see if light glows.
BTW are you able to get maintenance mode by pressing CTRL+C. Though in your logs it isn't there but just to be sure I am asking this. it should look like this.
Boot Loader version 1.2 Copyright (C) 2000,2001,2002,2003 Broadcom Corporation. Portions Copyright (C) 2002-2010 NetApp Inc.
CPU Type: AMDOpteron(tm) Processor 252
Starting AUTOBOOT press Ctrl-C to abort... Loading:............0x200000/32064968 0x20945c8/34790016 0x41c2048/2371097 0x4404e61/7 Entry at 0x00202018 Starting program at 0x00202018 Press CTRL-C for special boot menu This is where CTRL-C should be pressed to access the boot menu. Wed Apr 25 19:44:20 GMT [nvram.battery.state:info]: The NVRAM battery is currently ON. Special boot options menu will be available.
I am indeed able to get into the maintenance mode on the NAS. I've run many of the diagnostics within it, and a SAN engineer who happened to be onsite with me determined that the root aggregate was corrupted. He recommended zeroing the disks and rebuilding the array. I've still yet to put in a help ticket to NetApp, so I'll see what they say before I wipe everything.