ONTAP Hardware
ONTAP Hardware
I'll try to be brief, but that's almost impossible for me, so I'll apologize now. TL;DR: I'm sorry.
We have an HA-pair of 2240-4s in our HQ. We are running Ontap 8.2.5P1 7-Mode. I know. And that's why I put it in bold because, being ancient, much of the advise I got from the NetApp Technical Support staff was slightly incorrect, because the information they researched, and KB links were for Cluster Mode. So when I, or we, would walk through the steps of what I'm about to describe, things didn't work as expected because of syntax changes between the two Ontap versions.
To break it down, we suffered a major problem a couple weeks ago when we lost the air conditioning to the server room, resulting in filer A losing six disks (not all at once, but over the time it took to pull the trigger on remotely powering off our server room. In the meantime, everything was overheating and we suffered a WAFL inconsistency.
NetApp techs were great and dedicated, I want to make that clear. But the issue was (and rightly so) everyone just assumes you're running cluster mode.
Now, my question. In all the rukus, HA got broken, and NetApp's determination is that options for cf.mode is set to HA on both filers. But the chassis NVRAM reports A as being in Non-HA mode. At the end of this post, I'll paste the options cf output so you can see.
Therefore, I've been provided a KB which I've tentatively scheduled for next Tuesday night. It involves opening up the chassis and removing the batteries from CMOS and NVRAM and clearing it out. It will cause an outage.My only concern at this point is I'm not familiar with the internals of a 2240, never opened one up, and this is the first time I've dealt with this issue...and all the articles I seem to find are for other scenarios, filers and, of course, clustered mode. I just want to confirm this seems like the appropriate solution, or the possibility of an online solution or PROM setting that could avoid any downtime. I love the overtime, but I also love sleep. And I want to stress, the engineer who has ownership has been great, available, informative and dedicated. I'm not questioning their (and their team's) remedy. But I have nothing to lose by asking the community.
Below is the article link and following that are the cf information from A and B. A is the one that had the problems.
TIA,
Steve
This is the article:
Error message: Chassis FRU PROM write operation failed Replace the system chassis of controller
https://kb.netapp.com/app/answers/answer_view/a_id/1029050/loc/en_US#__highlight
Here are our options cf for A-filer, followed by B-filer (A is the one that suffered all the problems)
--------------------------------------------------------------------------------------------------
A-filer> cf enable
Controller is in Non-HA mode.
A-filer> options.cf.enable true
options.cf.enable not found. Type '?' for a list of commands
A-filer> options cf ?
Setting invalid option cf failed.
cf.giveback.auto.after.panic.takeover on
cf.giveback.auto.cancel.on_network_failure on
cf.giveback.auto.cifs.terminate.minutes 5
cf.giveback.auto.delay.seconds 600
cf.giveback.auto.enable on
cf.giveback.auto.override.vetoes off
cf.giveback.auto.terminate.bigjobs off
cf.giveback.check.partner on
cf.hw_assist.enable off
cf.hw_assist.partner.address
cf.hw_assist.partner.port
cf.mode ha
cf.remote_syncmirror.enable off
cf.sfoaggr_maxtime 120
cf.takeover.bypass_optimization off
cf.takeover.change_fsid on
cf.takeover.detection.seconds 15
cf.takeover.on_disk_shelf_miscompare off
cf.takeover.on_failure off
cf.takeover.on_network_interface_failure off
cf.takeover.on_network_interface_failure.policy all_nics
cf.takeover.on_panic off
cf.takeover.on_reboot off
cf.takeover.on_short_uptime off
cf.takeover.use_mcrc_file off
----------------------------------------------------------------------------------
B-filer> cf status
A-filer may be down, takeover disabled because of reason (takeover disabled by partner)
B-filer has disabled takeover by A-filer (unsynchronized log)
VIA Interconnect is up (link up).
B-filer> options cf
cf.giveback.auto.after.panic.takeover on
cf.giveback.auto.cancel.on_network_failure on
cf.giveback.auto.cifs.terminate.minutes 5
cf.giveback.auto.delay.seconds 600
cf.giveback.auto.enable on
cf.giveback.auto.override.vetoes off
cf.giveback.auto.terminate.bigjobs off
cf.giveback.check.partner on
cf.hw_assist.enable on
cf.hw_assist.partner.address 192.168.blah.blah
cf.hw_assist.partner.port 4444
cf.mode ha
cf.remote_syncmirror.enable off (same value required in local+partner)
cf.sfoaggr_maxtime 120 (value might be overwritten in takeover)
cf.takeover.bypass_optimization off
cf.takeover.change_fsid on
cf.takeover.detection.seconds 15
cf.takeover.on_disk_shelf_miscompare off
cf.takeover.on_failure on
cf.takeover.on_network_interface_failure off
cf.takeover.on_network_interface_failure.policy all_nics (same value in local+partner recommended)
cf.takeover.on_panic on
cf.takeover.on_reboot off
cf.takeover.on_short_uptime on
cf.takeover.use_mcrc_file off (value might be overwritten in takeover)
Solved! See The Solution
Have you already tried "cf disable" on controller A yet?
There should be an FRU map on the top of the PCM (the slide out module). Otherwise this photo shows the insides - the NVRAM battery is the large black plastic unit at the bottom of the motherboard, while the CMOS one is the coin cell on the right bottom of the board
Have you already tried "cf disable" on controller A yet?
There should be an FRU map on the top of the PCM (the slide out module). Otherwise this photo shows the insides - the NVRAM battery is the large black plastic unit at the bottom of the motherboard, while the CMOS one is the coin cell on the right bottom of the board
Thanks for the information and illustration! I just tried disable/enable and receive the same error:
la-fas01-a> cf status
Indeterminate state. Mode is HA, FRU value is non-HA.
I was just hoping there would have been a way to issue a command from the SP, but I'm sure the way things are engineered, the system has to be off, just like changing a bios setting. And I strongly assumed that if this is the procedure I was given, that's the only way to accomplish this. And, to be honest, even if someone came up with a hack, I'd still probably stick with the official plan. Just wanted to see what was out there, you never know!
Again, thank you!
Steve
Yes - there is a degree of BIOS assisted memory partitioning performed in HA mode so it needs to be set before ONTAP boots. I've reviewed available internal documentation and there does not appear to be a workaround.
System cannot be "off" - HA mode is configured in maintenance mode boot. It still means outage for the controller in question.