I'll try to be brief, but that's almost impossible for me, so I'll apologize now. TL;DR: I'm sorry.
We have an HA-pair of 2240-4s in our HQ. We are running Ontap 8.2.5P1 7-Mode. I know. And that's why I put it in bold because, being ancient, much of the advise I got from the NetApp Technical Support staff was slightly incorrect, because the information they researched, and KB links were for Cluster Mode. So when I, or we, would walk through the steps of what I'm about to describe, things didn't work as expected because of syntax changes between the two Ontap versions.
To break it down, we suffered a major problem a couple weeks ago when we lost the air conditioning to the server room, resulting in filer A losing six disks (not all at once, but over the time it took to pull the trigger on remotely powering off our server room. In the meantime, everything was overheating and we suffered a WAFL inconsistency.
NetApp techs were great and dedicated, I want to make that clear. But the issue was (and rightly so) everyone just assumes you're running cluster mode.
Now, my question. In all the rukus, HA got broken, and NetApp's determination is that options for cf.mode is set to HA on both filers. But the chassis NVRAM reports A as being in Non-HA mode. At the end of this post, I'll paste the options cf output so you can see.
Therefore, I've been provided a KB which I've tentatively scheduled for next Tuesday night. It involves opening up the chassis and removing the batteries from CMOS and NVRAM and clearing it out. It will cause an outage.My only concern at this point is I'm not familiar with the internals of a 2240, never opened one up, and this is the first time I've dealt with this issue...and all the articles I seem to find are for other scenarios, filers and, of course, clustered mode. I just want to confirm this seems like the appropriate solution, or the possibility of an online solution or PROM setting that could avoid any downtime. I love the overtime, but I also love sleep. And I want to stress, the engineer who has ownership has been great, available, informative and dedicated. I'm not questioning their (and their team's) remedy. But I have nothing to lose by asking the community.
Below is the article link and following that are the cf information from A and B. A is the one that had the problems.
This is the article:
Error message: Chassis FRU PROM write operation failed Replace the system chassis of controller
A-filer> cf enable Controller is in Non-HA mode. A-filer> options.cf.enable true options.cf.enable not found. Type '?' for a list of commands A-filer> options cf ? Setting invalid option cf failed. cf.giveback.auto.after.panic.takeover on cf.giveback.auto.cancel.on_network_failure on cf.giveback.auto.cifs.terminate.minutes 5 cf.giveback.auto.delay.seconds 600 cf.giveback.auto.enable on cf.giveback.auto.override.vetoes off cf.giveback.auto.terminate.bigjobs off cf.giveback.check.partner on cf.hw_assist.enable off cf.hw_assist.partner.address cf.hw_assist.partner.port cf.mode ha cf.remote_syncmirror.enable off cf.sfoaggr_maxtime 120 cf.takeover.bypass_optimization off cf.takeover.change_fsid on cf.takeover.detection.seconds 15 cf.takeover.on_disk_shelf_miscompare off cf.takeover.on_failure off cf.takeover.on_network_interface_failure off cf.takeover.on_network_interface_failure.policy all_nics cf.takeover.on_panic off cf.takeover.on_reboot off cf.takeover.on_short_uptime off cf.takeover.use_mcrc_file off
B-filer> cf status A-filer may be down, takeover disabled because of reason (takeover disabled by partner) B-filer has disabled takeover by A-filer (unsynchronized log) VIA Interconnect is up (link up). B-filer> options cf cf.giveback.auto.after.panic.takeover on cf.giveback.auto.cancel.on_network_failure on cf.giveback.auto.cifs.terminate.minutes 5 cf.giveback.auto.delay.seconds 600 cf.giveback.auto.enable on cf.giveback.auto.override.vetoes off cf.giveback.auto.terminate.bigjobs off cf.giveback.check.partner on cf.hw_assist.enable on cf.hw_assist.partner.address 192.168.blah.blah cf.hw_assist.partner.port 4444 cf.mode ha cf.remote_syncmirror.enable off (same value required in local+partner) cf.sfoaggr_maxtime 120 (value might be overwritten in takeover) cf.takeover.bypass_optimization off cf.takeover.change_fsid on cf.takeover.detection.seconds 15 cf.takeover.on_disk_shelf_miscompare off cf.takeover.on_failure on cf.takeover.on_network_interface_failure off cf.takeover.on_network_interface_failure.policy all_nics (same value in local+partner recommended) cf.takeover.on_panic on cf.takeover.on_reboot off cf.takeover.on_short_uptime on cf.takeover.use_mcrc_file off (value might be overwritten in takeover)
Have you already tried "cf disable" on controller A yet?
There should be an FRU map on the top of the PCM (the slide out module). Otherwise this photo shows the insides - the NVRAM battery is the large black plastic unit at the bottom of the motherboard, while the CMOS one is the coin cell on the right bottom of the board
Thanks for the information and illustration! I just tried disable/enable and receive the same error:
la-fas01-a> cf status Indeterminate state. Mode is HA, FRU value is non-HA.
I was just hoping there would have been a way to issue a command from the SP, but I'm sure the way things are engineered, the system has to be off, just like changing a bios setting. And I strongly assumed that if this is the procedure I was given, that's the only way to accomplish this. And, to be honest, even if someone came up with a hack, I'd still probably stick with the official plan. Just wanted to see what was out there, you never know!
Yes - there is a degree of BIOS assisted memory partitioning performed in HA mode so it needs to be set before ONTAP boots. I've reviewed available internal documentation and there does not appear to be a workaround.