Solved: Re: HA Broken, want to confirm steps

Digriz60 · ‎2018-07-21

I'll try to be brief, but that's almost impossible for me, so I'll apologize now. TL;DR: I'm sorry.

We have an HA-pair of 2240-4s in our HQ. We are running Ontap 8.2.5P1 7-Mode. I know. And that's why I put it in bold because, being ancient, much of the advise I got from the NetApp Technical Support staff was slightly incorrect, because the information they researched, and KB links were for Cluster Mode. So when I, or we, would walk through the steps of what I'm about to describe, things didn't work as expected because of syntax changes between the two Ontap versions.

To break it down, we suffered a major problem a couple weeks ago when we lost the air conditioning to the server room, resulting in filer A losing six disks (not all at once, but over the time it took to pull the trigger on remotely powering off our server room. In the meantime, everything was overheating and we suffered a WAFL inconsistency.

NetApp techs were great and dedicated, I want to make that clear. But the issue was (and rightly so) everyone just assumes you're running cluster mode.

Now, my question. In all the rukus, HA got broken, and NetApp's determination is that options for cf.mode is set to HA on both filers. But the chassis NVRAM reports A as being in Non-HA mode. At the end of this post, I'll paste the options cf output so you can see.

Therefore, I've been provided a KB which I've tentatively scheduled for next Tuesday night. It involves opening up the chassis and removing the batteries from CMOS and NVRAM and clearing it out. It will cause an outage.My only concern at this point is I'm not familiar with the internals of a 2240, never opened one up, and this is the first time I've dealt with this issue...and all the articles I seem to find are for other scenarios, filers and, of course, clustered mode. I just want to confirm this seems like the appropriate solution, or the possibility of an online solution or PROM setting that could avoid any downtime. I love the overtime, but I also love sleep. And I want to stress, the engineer who has ownership has been great, available, informative and dedicated. I'm not questioning their (and their team's) remedy. But I have nothing to lose by asking the community.

Below is the article link and following that are the cf information from A and B. A is the one that had the problems.

TIA,

Steve

This is the article:

Error message: Chassis FRU PROM write operation failed Replace the system chassis of controller

https://kb.netapp.com/app/answers/answer_view/a_id/1029050/loc/en_US#__highlight

Here are our options cf for A-filer, followed by B-filer (A is the one that suffered all the problems)

--------------------------------------------------------------------------------------------------

A-filer> cf enable
Controller is in Non-HA mode.
A-filer> options.cf.enable true
options.cf.enable not found. Type '?' for a list of commands
A-filer> options cf ?
Setting invalid option cf failed.
cf.giveback.auto.after.panic.takeover on
cf.giveback.auto.cancel.on_network_failure on
cf.giveback.auto.cifs.terminate.minutes 5
cf.giveback.auto.delay.seconds 600
cf.giveback.auto.enable      on
cf.giveback.auto.override.vetoes off
cf.giveback.auto.terminate.bigjobs off
cf.giveback.check.partner    on
cf.hw_assist.enable          off
cf.hw_assist.partner.address
cf.hw_assist.partner.port
cf.mode                      ha
cf.remote_syncmirror.enable off
cf.sfoaggr_maxtime           120
cf.takeover.bypass_optimization off
cf.takeover.change_fsid      on
cf.takeover.detection.seconds 15
cf.takeover.on_disk_shelf_miscompare off
cf.takeover.on_failure       off
cf.takeover.on_network_interface_failure off
cf.takeover.on_network_interface_failure.policy all_nics
cf.takeover.on_panic         off
cf.takeover.on_reboot        off
cf.takeover.on_short_uptime off
cf.takeover.use_mcrc_file    off

----------------------------------------------------------------------------------

B-filer> cf status
A-filer may be down, takeover disabled because of reason (takeover disabled by partner)
B-filer has disabled takeover by A-filer (unsynchronized log)
VIA Interconnect is up (link up).
B-filer> options cf
cf.giveback.auto.after.panic.takeover on
cf.giveback.auto.cancel.on_network_failure on
cf.giveback.auto.cifs.terminate.minutes 5
cf.giveback.auto.delay.seconds 600
cf.giveback.auto.enable      on
cf.giveback.auto.override.vetoes off
cf.giveback.auto.terminate.bigjobs off
cf.giveback.check.partner    on
cf.hw_assist.enable          on
cf.hw_assist.partner.address 192.168.blah.blah
cf.hw_assist.partner.port    4444
cf.mode                      ha
cf.remote_syncmirror.enable off        (same value required in local+partner)
cf.sfoaggr_maxtime           120        (value might be overwritten in takeover)
cf.takeover.bypass_optimization off
cf.takeover.change_fsid      on
cf.takeover.detection.seconds 15
cf.takeover.on_disk_shelf_miscompare off
cf.takeover.on_failure       on
cf.takeover.on_network_interface_failure off
cf.takeover.on_network_interface_failure.policy all_nics   (same value in local+partner recommended)
cf.takeover.on_panic         on
cf.takeover.on_reboot        off
cf.takeover.on_short_uptime on
cf.takeover.use_mcrc_file    off        (value might be overwritten in takeover)

AlexDawson · ‎2018-07-22

Have you already tried "cf disable" on controller A yet?

There should be an FRU map on the top of the PCM (the slide out module). Otherwise this photo shows the insides - the NVRAM battery is the large black plastic unit at the bottom of the motherboard, while the CMOS one is the coin cell on the right bottom of the board

View solution in original post

AlexDawson · ‎2018-07-22

Have you already tried "cf disable" on controller A yet?

There should be an FRU map on the top of the PCM (the slide out module). Otherwise this photo shows the insides - the NVRAM battery is the large black plastic unit at the bottom of the motherboard, while the CMOS one is the coin cell on the right bottom of the board

Digriz60 · ‎2018-07-22

Thanks for the information and illustration! I just tried disable/enable and receive the same error:

la-fas01-a> cf status
Indeterminate state. Mode is HA, FRU value is non-HA.

I was just hoping there would have been a way to issue a command from the SP, but I'm sure the way things are engineered, the system has to be off, just like changing a bios setting. And I strongly assumed that if this is the procedure I was given, that's the only way to accomplish this. And, to be honest, even if someone came up with a hack, I'd still probably stick with the official plan. Just wanted to see what was out there, you never know!

Again, thank you!

Steve

AlexDawson · ‎2018-07-22

Yes - there is a degree of BIOS assisted memory partitioning performed in HA mode so it needs to be set before ONTAP boots. I've reviewed available internal documentation and there does not appear to be a workaround.

aborzenkov · ‎2018-07-22

System cannot be "off" - HA mode is configured in maintenance mode boot. It still means outage for the controller in question.