Re: FAS3020 will not boot - any ideas

akhtara · ‎2012-07-13

All,

I have a FAS3020 that isn't booting.

Here is what I am getting...any ideas to get this filer to boot?

NetApp Release 7.3.4: Thu May 27 17:52:48 PDT 2010

Starting boot on Fri Jul 13 18:34:03 GMT 2012

Fri Jul 13 18:34:46 GMT [fci.initialization.failed:error]: Initialization failed on Fibre Channel adapter 0b.

Fri Jul 13 18:34:54 GMT [diskown.isEnabled:info]: software ownership has been enabled for this system

add net 127.0.0.Fr0: gateway 127.0i Jul 13 18:34:57 GMT [config.noPartnerDisks:CRITICAL]: No disks were detected for the partner; this node will be unable to takeover correctly

.0.1

Fri Jul 13 18:34:57 GMT [fmmb.current.lock.disk:info]: Disk 0a.18 is a local HA mailbox disk.

Fri Jul 13 18:34:57 GMT [fmmb.current.lock.disk:info]: Disk 0a.32 is a local HA mailbox disk.

Fri Jul 13 18:34:57 GMT [fmmb.instStat.change:info]: normal mailbox instance on local side.

Fri Jul 13 18:34:58 GMT [fmmb.instStat.change:info]: no mailbox instance on partner side.

Fri Jul 13 18:34:58 GMT [monitor.chassisPower.degraded:notice]: Chassis power is degraded:

Fri Jul 13 18:34:58 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.60 Shelf ? Bay ? [NETAPP X272_SCHT6073F10 NA16] S/N [3HZ7VJLV00007502BD5H] (21273/1351414276) in plex aggr0/0, because it is out-of-date (21607).

Fri Jul 13 18:34:58 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.42 Shelf ? Bay ? [NETAPP X272_HJURE073F10 NA14] S/N [41360296] (20851/1350424692) in plex aggr0/0, because it is out-of-date (21607).

Fri Jul 13 18:34:58 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.41 Shelf ? Bay ? [NETAPP X272_HJURE073F10 NA14] S/N [41360830] (20965/1350435816) in plex aggr0/0, because it is out-of-date (21607).

Root volume is failed (all disks may not yet be visible).

Rebooting... (ctrl-c to break reboot loop)

scottgelb · ‎2012-07-13

It shows you are missing drives in the root aggregate (and likely others). I would boot to maintenance mode and check disk assignment, all disk paths are available... not sure if anything changed or if some hardware was swapped but something changed or disks or a shelf are not viewable or assigned.

akhtara · ‎2012-07-13

Nothing has changed, was up yesterday and this morning "system inactive"..I rebooted it and its stuck in this loop.

storage show disk displays:

DISK

SHELF BAY SERIAL

VENDOR MODEL

REV

--------------------- --------- ---------------- -------- ---------- ----

0a.16	1	0 41317181	NETAPP X272_HJURE NA14
0a.17	1	1 41316538	NETAPP X272_HJURE NA14
0a.18	1	2 41327496	NETAPP X272_HJURE NA14
0a.19	1	3 412W4548	NETAPP X272_HJURE NA14
0a.20	1	4 3HZ7SWGJ000073369BCP NETAPP X272_SCHT6 NA16
0a.21	1	5 41360845	NETAPP X272_HJURE NA14
0a.22	1	6 41320011	NETAPP X272_HJURE NA14
0a.23	1	7 41316772	NETAPP X272_HJURE NA14
0a.24	1	8 41316694	NETAPP X272_HJURE NA14
0a.25	1	9 412W4347	NETAPP X272_HJURE NA14
0a.26	1 10 412Y8249	NETAPP X272_HJURE NA14
0a.27	1 11 41324284	NETAPP X272_HJURE NA14
0a.28	1 12 41304561	NETAPP X272_HJURE NA14
0a.29	1 13 41304745	NETAPP X272_HJURE NA14
0a.32	2	0 412W4612	NETAPP X272_HJURE NA14
0a.33	2	1 41355514	NETAPP X272_HJURE NA14
0a.35	2	3 41332956	NETAPP X272_HJURE NA14
0a.36	2	4 41356920	NETAPP X272_HJURE NA14
0a.37	2	5 41356723	NETAPP X272_HJURE NA14
0a.38	2	6 412K6952	NETAPP X272_HJURE NA14
0a.39	2	7 412W2029	NETAPP X272_HJURE NA14
0a.40	2	8 41352558	NETAPP X272_HJURE NA14
0a.41	2	9 41360830	NETAPP X272_HJURE NA14
0a.42	2 10 41360296	NETAPP X272_HJURE NA14
0a.43	2 11 412Z6774	NETAPP X272_HJURE NA14
0a.45	2 13 414J9147	NETAPP X272_HJURE NA14
0a.48	3	0 41320062	NETAPP X272_HJURE NA14
0a.49	3	1 41332516	NETAPP X272_HJURE NA14
0a.50	3	2 414J8914	NETAPP X272_HJURE NA14
0a.51	3	3 414J9151	NETAPP X272_HJURE NA14
0a.52	3	4 41444805	NETAPP X272_HJURE NA14
0a.53	3	5 3HZ74EKY00007336AHCD NETAPP X272_SCHT6 NA16
0a.54	3	6 414G9462	NETAPP X272_HJURE NA14
0a.55	3	7 414F4103	NETAPP X272_HJURE NA14
0a.56	3	8 414G0947	NETAPP X272_HJURE NA14
0a.57	3	9 414J7534	NETAPP X272_HJURE NA14
0a.59	3 11 3HZX5JQV000073482KAF NETAPP X272_SCHT6 NA16
0a.60	3 12 3HZ7VJLV00007502BD5H NETAPP X272_SCHT6 NA16
0a.61	3 13 414G0265	NETAPP X272_HJURE NA14

Looks like there 2 disks missing on Shelf 2...bay 2 and 12. I tried to reseed the drives and they aren't showing up.

scottgelb · ‎2012-07-13

It sounds like a hardware failure somewhere... I would check SFP and modules... make sure all paths are dual path with "storage show disk -p" ... also check that shelves are connected with fcstat device_map and/or sasadmin expander_map depending on shelf type.

AGUMADAVALLI · ‎2012-07-13

it seems to be the disk shelf pathing and please check the power on the shelf and controller.

On the active node, check the sysconfig and disk show -v, which will provide the information of disk pathing.

thank you,

AK G

AGUMADAVALLI · ‎2012-07-13

Download the wireguage and run it, it will show what drives or loops or hardware issues on the multipathing or looping

Thank you,

AK G

scottgelb · ‎2012-07-13

Config Advisor (wiregauge) needs the system up and running to run the commands... haven't tried in maintenance mode though which might work but not all the commands will be available but still might report.

akhtara · ‎2012-07-13

Paths look ok..I can see all 3 disk shelfs. I am getting the following error alot:

Fri Jul 13 20:08:35 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.43 Shelf 2 Bay 11 [NETAPP X272_HJURE073F10 NA14] S/N [412Z6774] (20885/1350427798) in plex aggr0/0, because it is out-of-date (21607).

Fri Jul 13 20:08:35 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.60 Shelf 3 Bay 12 [NETAPP X272_SCHT6073F10 NA16] S/N [3HZ7VJLV00007502BD5H] (21273/1351414276) in plex aggr0/0, because it is out-of-date (21607).

Fri Jul 13 20:08:35 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.41 Shelf 2 Bay 9 [NETAPP X272_HJURE073F10 NA14] S/N [41360830] (20965/1350435816) in plex aggr0/0, because it is out-of-date (21607).

Fri Jul 13 20:08:35 GMT [raid.assim.cls.outOfDate:error]: Orphaning Disk 0a.42 Shelf 2 Bay 10 [NETAPP X272_HJURE073F10 NA14] S/N [41360296] (20851/1350424692) in plex aggr0/0, because it is out-of-date (21607).

scottgelb · ‎2012-07-13

I would open a case. Could be a firmware or disk qualification file update needed.

Sent from my iPhone 4S

AGUMADAVALLI · ‎2012-07-13

Hi there:

If you visit your netapp "my autosupport" site, you will see the disk upgrade patch, apply it. If not send the drives to netapp, and get it replaced. It is essentially a bug:

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=144216

thank you,

AK G

pmeidell · ‎2012-07-13

You will need to give a little more information, like the background for this system. For example

1. I upgraded Data ONTAP and now it won't boot, or

2. We had a powe failure, and now it won't start, or

3. We move the system from one data center to another, and it won't boot after the move, or

4. I found a fas3020 controller in a back room, and an unused disk shelf in a storage facility, and when I connect it together, it won't boot, or

5. insert your story here.

Normally it's useful to know the system's serial number. The path forward will depend on your answers.

akhtara · ‎2012-07-13

Hey Philip...Fair enough.

The system has been up and running yesterday and has been for quite some time. No software or hardware changes were made. This morning the system was down "inactive system". I powered down the head and the 3 disk shelfs. I powered on the 3 disk shelfs and then the head and noticed its not booting or able to see the aggr0/root volume.

I noticed i had 4 failed disks, I unfailed the disks and rebooted again without any luck.

In maintenance mode, I noticed all my disks weren't assigned so I assigned them to pool0 and still no luck.

Thats how i got to where i am now.

aborzenkov · ‎2012-07-13

This is FAS3020 which is likely to use hardware based disk ownership. Do you know for sure it was using software based disk ownership before?

akhtara · ‎2012-07-13

I am almost 100% sure it was using hardware based disk ownership before.

aborzenkov · ‎2012-07-13

So why did you assign disks in this case?

You really need to open support case and wait for them to guide further steps. At this point any incorrect move can result in data loss.

akhtara · ‎2012-07-13

So looks like i lost too many disks in my aggr0/root volume. There is volume I need in aggr1. Anyone know of a way to rebuild/recreate aggr0/root and recover the volumes in aggr1?

scottgelb · ‎2012-07-14

Definitely work with support on this. With multiple failed drives that may be more than just failed drives. For root you can mark another aggr as root and ontap creates a volume called AUTOROOT. From maintenance mode you can check if aggr1 is online. You could also assign to the other node but again work with support. I think any system in a failure mode with possible loss of data gets risky in a forum.

MICKINCH_PATEL · ‎2012-07-23

How does sysconfig -r output return. Those failed drives appear as failed or orphened.

Have you open support case ? Do that first before experiments.