FAS2050 HA Pair: Unable to boot filer

Britto · ‎2017-11-21

Hello Team,

We have two FAS2050 Filers (st10 & st11) with HA pair configured running OnTAP7.3.6. Recently, one of the filer st10 is broken and we have replaced the hard disks of the filer. When we tried to boot the filer we couldnt able to as it shows aggr0 has been failed the root partition. We tried to recover/diagnostics/scan, but nothing worked out. We have enabled takeover on st11 and it can able to takeover and we could see the partitions of st10 from st11. However, we couldnt able to boot st10 and we have data stored on st10. Can you please help us in this issue?

1. When we run disk show, we found some of the disks of st10 was owned by st11. Is there anyway we can get back the ownership?

2. Is there anyway we can enable takeover and access st10 volumes through st11 to copy the data?

We need to retrieve the data as the snapmirror also been failed and any help would be greatly appreciated since we already spend days on this issue and nothing worked out.

thanks

britto.

AlexDawson · ‎2017-11-21

Hi there,

From what you've said, it sounds like your data is on a different aggregate of st10, and what has failed is the root aggregate of st10. Additionally, you believe there are a number of disks with incorrect ownership. To resolve this problem you will need to resolve ownership of the disks (it may be unclear because the system is in failover?), recreate aggr0 of st10, reinitialize ONTAP on that controller only, and then reconfigure failover.

This is quite possible, however it is an advanced activity, and my strong recommendation would be to contact a NetApp reseller in your area and have it done by a professional services consultant to avoid potential problems.

If you wish to attempt it yourself, please post the output from "disk show -a" and "aggr status -r", and please tell us which disks have been replaced, and what you have done to controller st10.

Britto · ‎2017-11-21

Hello Alex,

Thank you for responding and you are right about the data.

We have data stored in other aggregates,not under the root aggregate.

I am attaching output of "disk show -a" and "aggr status -r" for both systems(st10 & st11).

We have added following 5 disks on st10:

0a.69
0a.70
0a.76
0a.84
0a.92

but after rebooting the system,we are getting error like aggr0 has failed.

We've tried WAFL_check for the system and it could able to boot st10 successfully but within a minute or so,st11 will takeover the system.

and If we giveback from st11 to st10 ,then also few of the disks of st10 are still with st11.

We are not sure how to giveback ownership of disks which are retained by st11 to st10.

We have checked physically for the errors/warning on st10,but there are no amber lights or any warning.

However,We could able to find following error message by using "system log" on bmc command prompt:

SYSTEM_BOOT_FAILED(POST failed.)

So, We tried to troubleshoot based on this link, https://kb.netapp.com/app/answers/answer_view/a_id/1006069

and it also didnt help as NVRAM has some data so update_flash command could not be able to execute successfully..

So now,We have following issues:

Is there any way to get disks of st10 which are still in control of st11 after giveback?
how to repair root partition and will there be any data loss if we repair root partition?

Thanks,

Britto

AlexDawson · ‎2017-11-22

It looks like aggr0 on st10 has suffered a failure of up to 10 disks - which is not protectable by RAID-DP, and is likely to have caused data loss. With 16 disks originally, I am concerned this aggregate did not just contain root data. On st11, can you please run "partner aggr status -r" and post the results? It is possible the aggregate only looks inconsistent on st10 because it does not have all disks visible, but I believe that is unlikely.

Unfortunately with this many disks (94 at my count?), and a 10 year old system, I don't feel confident that a forum post is appropriate for resolution.

I would again strongly suggest that you find a local partner or service provider via our webpage at http://solutionconnection.netapp.com/partnerdirectory.aspx and see if they would be able to assist. I used to work for one that would do this sort of work at approximately USD$1800/day.

Britto · ‎2017-11-22

Hello Alex,

Currently, st10 is not taken over by st11 so partner command will not give any result for it.

However, I tried "events all" command on bmc shell and got following output:

bmc shell -> events all
Event TimeStamp Id Sensor Description
-----------------------------------------------------------------
1 11/21/2017 13:11:50 GMT #ff -- Critical Stop during OS boot
2 11/21/2017 13:15:30 GMT #ff -- Critical Stop during OS boot
3 11/21/2017 13:19:53 GMT #ff -- Critical Stop during OS boot
4 11/21/2017 13:23:40 GMT #ff -- NOTICE: USER_TRIGGERED (system reset)
5 11/21/2017 13:27:45 GMT #b6 SAS Status State Deasserted
6 11/21/2017 13:27:45 GMT #b6 SAS Status State Asserted
7 11/21/2017 13:27:50 GMT #b6 SAS Status State Deasserted
8 11/21/2017 13:27:50 GMT #b6 SAS Status State Asserted
9 11/21/2017 17:00:12 GMT #ff -- Critical Stop during OS boot
10 11/21/2017 17:12:28 GMT #77 1.1V Below Crit-Low (1109 mV <= 2997 mV )
11 11/21/2017 17:12:33 GMT #77 1.1V Above Crit-Low (1109 mV > 0 mV )
12 11/21/2017 17:12:42 GMT #76 1.2V Below Crit-Low (1215 mV <= 2997 mV )
13 11/21/2017 17:12:47 GMT #76 1.2V Above Crit-Low (1215 mV > 0 mV )
14 11/21/2017 17:12:57 GMT #75 1.5V Below Crit-Low (1498 mV <= 2997 mV )
15 11/21/2017 17:13:02 GMT #75 1.5V Above Crit-Low (1498 mV > 0 mV )
16 11/21/2017 17:13:12 GMT #74 1.8V Below Crit-Low (1829 mV <= 2997 mV )
17 11/21/2017 17:13:17 GMT #74 1.8V Above Crit-Low (1829 mV > 0 mV )
18 11/21/2017 17:13:27 GMT #70 12.0V Below Crit-Low (12000 mV <= 20320 mV )
19 11/21/2017 17:13:32 GMT #70 12.0V Above Crit-Low (12080 mV > 0 mV )
20 11/21/2017 17:13:42 GMT #73 2.5V Below Crit-Low (2539 mV <= 4673 mV )
21 11/21/2017 17:13:47 GMT #73 2.5V Above Crit-Low (2539 mV > 0 mV )
22 11/21/2017 17:13:57 GMT #72 3.3V Below Crit-Low (3351 mV <= 5994 mV )
23 11/21/2017 17:14:02 GMT #72 3.3V Above Crit-Low (3351 mV > 0 mV )
24 11/21/2017 17:14:12 GMT #71 CPU 1.2V Below Crit-Low (1191 mV <= 2997 mV )
25 11/21/2017 17:14:17 GMT #71 CPU 1.2V Above Crit-Low (1203 mV > 0 mV )
26 11/21/2017 17:14:32 GMT #50 Batt 8.0V Above Warn-High (7488 mV >= 0 mV )
27 11/21/2017 17:14:37 GMT #50 Batt 8.0V Above Crit-High (7488 mV >= 0 mV )
28 11/21/2017 17:14:42 GMT #50 Batt 8.0V Below Crit-High (7488 mV < 8576 mV )
29 11/21/2017 17:14:42 GMT #59 Batt Amp Above Warn-High (0 mA >= -4096 mA )
30 11/21/2017 17:14:47 GMT #50 Batt 8.0V Below Warn-High (7488 mV < 8512 mV )
31 11/21/2017 17:14:47 GMT #59 Batt Amp Above Crit-High (0 mA >= -4096 mA )
32 11/21/2017 17:14:52 GMT #59 Batt Amp Below Crit-High (0 mA < 2208 mA )
33 11/21/2017 17:14:52 GMT #55 Batt Run Time Below Warn-Low (128 h <= 255 h )
34 11/21/2017 17:14:57 GMT #59 Batt Amp Below Warn-High (0 mA < 2112 mA )
35 11/21/2017 17:14:57 GMT #55 Batt Run Time Below Crit-Low (128 h <= 255 h )
36 11/21/2017 17:15:02 GMT #55 Batt Run Time Above Crit-Low (128 h > 72 h )
37 11/21/2017 17:15:02 GMT #51 Batt Temp Below Warn-Low (28 C <= 126 C )
38 11/21/2017 17:15:07 GMT #55 Batt Run Time Above Warn-Low (128 h > 80 h )
39 11/21/2017 17:15:07 GMT #51 Batt Temp Below Crit-Low (28 C <= 126 C )
40 11/21/2017 17:15:12 GMT #51 Batt Temp Above Crit-Low (28 C > -128 C )
41 11/21/2017 17:15:17 GMT #51 Batt Temp Above Warn-Low (28 C > -128 C )
42 11/21/2017 17:15:27 GMT #58 Charger Cycles Above Warn-High (47 >= 0 )
43 11/21/2017 17:15:32 GMT #58 Charger Cycles Above Crit-High (47 >= 0 )
44 11/21/2017 17:15:37 GMT #58 Charger Cycles Below Crit-High (47 < 251 )
45 11/21/2017 17:15:37 GMT #0b NVMEM 1.8V Below Warn-Low (1790 mV <= 3581 mV )
46 11/21/2017 17:15:42 GMT #58 Charger Cycles Below Warn-High (47 < 250 )
47 11/21/2017 17:15:42 GMT #0b NVMEM 1.8V Below Crit-Low (1776 mV <= 3581 mV )
48 11/21/2017 17:15:47 GMT #0b NVMEM 1.8V Above Crit-Low (1804 mV > 0 mV )
49 11/21/2017 17:15:52 GMT #0b NVMEM 1.8V Above Warn-Low (1776 mV > 0 mV )
50 11/21/2017 17:15:57 GMT #0a NVMEM 8.0V Above Warn-High (7265 mV >= 0 mV )
51 11/21/2017 17:16:02 GMT #0a NVMEM 8.0V Above Crit-High (7265 mV >= 0 mV )
52 11/21/2017 17:16:07 GMT #0a NVMEM 8.0V Below Crit-High (7265 mV < 8604 mV )
53 11/21/2017 17:16:12 GMT #08 Board Temp Bot Below Warn-Low (44 C <= 126 C )
54 11/21/2017 17:16:12 GMT #0a NVMEM 8.0V Below Warn-High (7265 mV < 8460 mV )
55 11/21/2017 17:16:17 GMT #08 Board Temp Bot Below Crit-Low (44 C <= 126 C )
56 11/21/2017 17:16:22 GMT #08 Board Temp Bot Above Crit-Low (44 C > -128 C )
57 11/21/2017 17:16:27 GMT #08 Board Temp Bot Above Warn-Low (44 C > -128 C )
58 11/21/2017 17:16:32 GMT #07 Board Temp Top Below Warn-Low (41 C <= 126 C )
59 11/21/2017 17:16:37 GMT #07 Board Temp Top Below Crit-Low (41 C <= 126 C )
60 11/21/2017 17:16:42 GMT #07 Board Temp Top Above Crit-Low (41 C > -128 C )
61 11/21/2017 17:16:47 GMT #07 Board Temp Top Above Warn-Low (41 C > -128 C )
62 11/21/2017 17:16:57 GMT #09 CPU Temp Above Crit-High (48 C >= -128 C )
63 11/21/2017 17:17:02 GMT #09 CPU Temp Below Crit-High (48 C < 126 C )
64 11/21/2017 17:24:38 GMT #77 1.1V Below Crit-Low (1109 mV <= 2997 mV )
65 11/21/2017 17:24:43 GMT #77 1.1V Above Crit-Low (1109 mV > 0 mV )
66 11/21/2017 17:24:52 GMT #76 1.2V Below Crit-Low (1215 mV <= 2997 mV )
67 11/21/2017 17:24:57 GMT #76 1.2V Above Crit-Low (1227 mV > 0 mV )
68 11/21/2017 17:25:07 GMT #75 1.5V Below Crit-Low (1498 mV <= 2997 mV )
69 11/21/2017 17:25:12 GMT #75 1.5V Above Crit-Low (1498 mV > 0 mV )
70 11/21/2017 17:25:22 GMT #74 1.8V Below Crit-Low (1829 mV <= 2997 mV )
71 11/21/2017 17:25:27 GMT #74 1.8V Above Crit-Low (1829 mV > 0 mV )
72 11/21/2017 17:25:37 GMT #70 12.0V Below Crit-Low (12080 mV <= 20320 mV )
73 11/21/2017 17:25:42 GMT #70 12.0V Above Crit-Low (12080 mV > 0 mV )
74 11/21/2017 17:25:52 GMT #73 2.5V Below Crit-Low (2539 mV <= 4673 mV )
75 11/21/2017 17:25:57 GMT #73 2.5V Above Crit-Low (2539 mV > 0 mV )
76 11/21/2017 17:26:07 GMT #72 3.3V Below Crit-Low (3351 mV <= 5994 mV )
77 11/21/2017 17:26:12 GMT #72 3.3V Above Crit-Low (3351 mV > 0 mV )
78 11/21/2017 17:26:22 GMT #71 CPU 1.2V Below Crit-Low (1203 mV <= 2997 mV )
79 11/21/2017 17:26:27 GMT #71 CPU 1.2V Above Crit-Low (1203 mV > 0 mV )
80 11/21/2017 17:26:42 GMT #50 Batt 8.0V Above Warn-High (7488 mV >= 0 mV )
81 11/21/2017 17:26:47 GMT #50 Batt 8.0V Above Crit-High (7488 mV >= 0 mV )
82 11/21/2017 17:26:52 GMT #50 Batt 8.0V Below Crit-High (7488 mV < 8576 mV )
83 11/21/2017 17:26:52 GMT #59 Batt Amp Above Warn-High (0 mA >= -4096 mA )
84 11/21/2017 17:26:57 GMT #50 Batt 8.0V Below Warn-High (7488 mV < 8512 mV )
85 11/21/2017 17:26:57 GMT #59 Batt Amp Above Crit-High (0 mA >= -4096 mA )
86 11/21/2017 17:27:02 GMT #59 Batt Amp Below Crit-High (0 mA < 2208 mA )
87 11/21/2017 17:27:02 GMT #55 Batt Run Time Below Warn-Low (128 h <= 255 h )
88 11/21/2017 17:27:07 GMT #59 Batt Amp Below Warn-High (0 mA < 2112 mA )
89 11/21/2017 17:27:07 GMT #55 Batt Run Time Below Crit-Low (128 h <= 255 h )
90 11/21/2017 17:27:12 GMT #55 Batt Run Time Above Crit-Low (128 h > 72 h )
91 11/21/2017 17:27:12 GMT #51 Batt Temp Below Warn-Low (28 C <= 126 C )
92 11/21/2017 17:27:17 GMT #55 Batt Run Time Above Warn-Low (128 h > 80 h )
93 11/21/2017 17:27:17 GMT #51 Batt Temp Below Crit-Low (28 C <= 126 C )
94 11/21/2017 17:27:22 GMT #51 Batt Temp Above Crit-Low (28 C > -128 C )
95 11/21/2017 17:27:27 GMT #51 Batt Temp Above Warn-Low (28 C > -128 C )
96 11/21/2017 17:27:37 GMT #58 Charger Cycles Above Warn-High (47 >= 0 )
97 11/21/2017 17:27:42 GMT #58 Charger Cycles Above Crit-High (47 >= 0 )
98 11/21/2017 17:27:47 GMT #58 Charger Cycles Below Crit-High (47 < 251 )

What do you think about these events?

Is Battery also going down?

Thanks,

Britto

AlexDawson · ‎2017-11-22

That output looks like the BMC's threshholds have become incorrect and it is erroring on all of them.

This page lists what the threshholds should look like - https://library.netapp.com/ecmdocs/ECMP1112529/html/GUID-00B725C6-DC1F-47D1-80A7-40345DB7A1E1.html - these may not be the exact values for a FAS2050, but they will be close.

Britto · ‎2017-11-22

Hello,

thanks for your reply, really appreciate your support.

We have tried the "Sensor Show" command and it starts displaying the sensor values. However we could see State for all the sensors as Unknown. I have tried for the BIOS Status and posted the results below.

Any idea what could be the issue? Thanks again for your quick support.

bmc shell -> sensors search id=#f0
Name State Id Reading Crit-Low Warn-Low Warn-High Crit-High
-------------------------------------------------------------------------------------
BIOS Status Unknown #f0 -- -- -- -- --

thanks

britto.