I got a question past week while setting up my new filer, how long partner waits for failed system to come back before it decides to take over its identity?
Hunting around in NOW site doesn't yields any results, except some information from TR-3450 and Active/Active configuration guide.
Looking at below response it is uncertain that did author have something to hide that's why used cleaver words or it was missed out unintentionally.
Appreciate if someone can shed some light on this?
Enable the cf.takeover.on_panic Option Only in Block Protocol (FCP/iSCSI) Environments
Controller takeover does not occur automatically in all panic situations; panics which only trigger a reboot do not cause takeover under normal circumstances. Active/active controller takeover is generally advantageous in two scenarios:
·Situations where clients will not disconnect upon controller reboot, but which experience crippling timeouts if controller response does not resume within 90 seconds. Neither CIFS nor NFS falls into this category.
·Controller panics that cause the node to halt without rebooting (motherboard failure, NVRAM failure, Fibre Channel loop failure, etc.).
Active/active controllers can be configured to perform a takeover in all scenarios (instead of a reboot in some) by enabling the cf.takeover.on_panic option on both nodes. However, NetApp does not recommend enabling this option in a NAS environment. Enabling the cf.takeover.on_panic option requires that a spare disk is available for the core dump file.
Ontap Active/Active Configuration Guide
options cf.takeover.on_panic [on|off]
on Enables immediate takeover of a failed partner or off to disable immediate takeover.
off Disables immediate takeover. If you disable this option, normal takeover procedures apply. The node still takes over if its partner panics, but might take longer to do so.
Note: If you enter this command on one node, the value applies to both nodes. The setting of this option is persistent across reboots.
Essentially a cluster node will start to takeover from the partner when it fails to send a heartbeat, or if a hardware failure is notices. There is no "wait" period as such for recovery. The cluster takeover should be considered immediate for this reason. However, if you have a particularly latency sensitive application its definately worth testing the effect before going live with it. There is always a delay in the other node coming online of up to a couple of minutes depending on amount of volumes etc.
Hmm re-reading your question it also looks like you were asking about why cf takeover is not recommended for a NAS
Not sure I would agree with not enabling takeover for this, but there is also a giveback option that I really wouldn't enable.
Takeover and Giveback automatically can cause 2 outages in a very short time if the panic simply bounces the node and it recovers immediately as the takeover will occur and then the giveback. It is faster just to let it reboot on it's own, but there is always the risk that something bad happened and it will stay down, this means that the takeover would need to happen. If the node does recover you can giveback at a later time manually. It is most common to have takeover enabled, but takeover_on_panic disabled. This allows for the hardware fault to be caught, but not the panic, which would have to be manually recovered. This is recommended only in a NAS (file protocol) environment. If you have FCP, then takeover on Panic can prevent the disconnection of disks.
Apologies for late reply, it looks like i wasn't receiving any mails in my inbox, however just to clear my question What I want to know is how long partner waits before it decides that it should start a takeover when cf.takeover.on_panic is enabled. If you read carefully author says that ontap doesn't do takeover in all the panics situations.
so if I enable this option how long ontap takes to decide if it should do a takeover or filer is coming back online so just let him take care of his system, I remember reading in a netapp document that it waits for 60 seconds before starting a takeover but it looks like i lost the reference. Appreciate if someone have a link to that article.