2018-07-21 02:46 PM
I'll try to be brief, but that's almost impossible for me, so I'll apologize now. TL;DR: I'm sorry.
We have an HA-pair of 2240-4s in our HQ. We are running Ontap 8.2.5P1 7-Mode. I know. And that's why I put it in bold because, being ancient, much of the advise I got from the NetApp Technical Support staff was slightly incorrect, because the information they researched, and KB links were for Cluster Mode. So when I, or we, would walk through the steps of what I'm about to describe, things didn't work as expected because of syntax changes between the two Ontap versions.
To break it down, we suffered a major problem a couple weeks ago when we lost the air conditioning to the server room, resulting in filer A losing six disks (not all at once, but over the time it took to pull the trigger on remotely powering off our server room. In the meantime, everything was overheating and we suffered a WAFL inconsistency.
NetApp techs were great and dedicated, I want to make that clear. But the issue was (and rightly so) everyone just assumes you're running cluster mode.
Now, my question. In all the rukus, HA got broken, and NetApp's determination is that options for cf.mode is set to HA on both filers. But the chassis NVRAM reports A as being in Non-HA mode. At the end of this post, I'll paste the options cf output so you can see.
Therefore, I've been provided a KB which I've tentatively scheduled for next Tuesday night. It involves opening up the chassis and removing the batteries from CMOS and NVRAM and clearing it out. It will cause an outage.My only concern at this point is I'm not familiar with the internals of a 2240, never opened one up, and this is the first time I've dealt with this issue...and all the articles I seem to find are for other scenarios, filers and, of course, clustered mode. I just want to confirm this seems like the appropriate solution, or the possibility of an online solution or PROM setting that could avoid any downtime. I love the overtime, but I also love sleep. And I want to stress, the engineer who has ownership has been great, available, informative and dedicated. I'm not questioning their (and their team's) remedy. But I have nothing to lose by asking the community.
Below is the article link and following that are the cf information from A and B. A is the one that had the problems.
This is the article:
Error message: Chassis FRU PROM write operation failed Replace the system chassis of controller
Here are our options cf for A-filer, followed by B-filer (A is the one that suffered all the problems)
A-filer> cf enable
Controller is in Non-HA mode.
A-filer> options.cf.enable true
options.cf.enable not found. Type '?' for a list of commands
A-filer> options cf ?
Setting invalid option cf failed.
B-filer> cf status
A-filer may be down, takeover disabled because of reason (takeover disabled by partner)
B-filer has disabled takeover by A-filer (unsynchronized log)
VIA Interconnect is up (link up).
B-filer> options cf
cf.remote_syncmirror.enable off (same value required in local+partner)
cf.sfoaggr_maxtime 120 (value might be overwritten in takeover)
cf.takeover.on_network_interface_failure.policy all_nics (same value in local+partner recommended)
cf.takeover.use_mcrc_file off (value might be overwritten in takeover)
Solved! See The Solution
4 REPLIES 4
2018-07-22 08:18 PM
Have you already tried "cf disable" on controller A yet?
There should be an FRU map on the top of the PCM (the slide out module). Otherwise this photo shows the insides - the NVRAM battery is the large black plastic unit at the bottom of the motherboard, while the CMOS one is the coin cell on the right bottom of the board
Re: HA Broken, want to confirm steps
2018-07-22 10:56 PM
Thanks for the information and illustration! I just tried disable/enable and receive the same error:
la-fas01-a> cf status
Indeterminate state. Mode is HA, FRU value is non-HA.
I was just hoping there would have been a way to issue a command from the SP, but I'm sure the way things are engineered, the system has to be off, just like changing a bios setting. And I strongly assumed that if this is the procedure I was given, that's the only way to accomplish this. And, to be honest, even if someone came up with a hack, I'd still probably stick with the official plan. Just wanted to see what was out there, you never know!
Again, thank you!
Re: HA Broken, want to confirm steps
2018-07-22 11:15 PM
Yes - there is a degree of BIOS assisted memory partitioning performed in HA mode so it needs to be set before ONTAP boots. I've reviewed available internal documentation and there does not appear to be a workaround.