ONTAP Discussions

I need to reduce the time period before takeover, as much as poosible

damin
5,008 Views

I need to reduce the time period before takeover, as much as poosible.

I did change the options

options cf.takeover.detection.seconds to 10 seconds.

But still its taking 180 seconds to failover which is not accepted by the customer.

The system is FAS3140A.

Any Suggestion

Thanks,

10 REPLIES 10

radek_kubka
4,980 Views

Hi and welcome to the Communities!

180 sec for the cluster failover sounds way too long. What ONTAP version are they on? Have you seen this thread: http://communities.netapp.com/message/11677#11677?

Regards,

Radek

damin
4,980 Views

Hi,

The Data ontap is 7.3.4 , yes I did look at the below thread but will

not help.

Still waiting for help, we need to reduce this 180 seconds.

Thanks,

Amin Abu-Dosh

chriszurich
4,980 Views

Where are you seeing that it's taking 180 seconds? ie. are you seeing it within logs? If so please provide them... the logs that is. Failover to the cf partner should be near instantaneous. Failback is another story since the controller has to boot before it can takeover services.

I'm also curious to hear what transport protocols you're using.  With FC you should see no downtime whatsoever since both boxes should be configured in "single_image" mode. NFS/CIFS and iSCSI will be impacted and require use of the NetApp Host Utilities kit which will update the timeout settings to 120 seconds for physical hosts.  For virtual hosts a seperate host utilitities kit is bundled with the ESX host utilities kit which updates each hosts disk timeout settings to 180 seconds.

damin
4,980 Views

Hi,

see the below log.

alm3140b> Sun Mar 6 16:51:02 AST [alm3140b:

cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting

takeover request by partner, reason: operator initiated cf takeover.

Asking partner to shutdown gracefully; will takeover in at most 180

seconds.

the protcol used is CIFS , all the stoarge are used for shares only.

Thanks,

eric_barlier
4,980 Views

Hi,

Can you ask the customer what the impact of 180 seconds is, other than it might seem long? are they experiencing loss of writes/service?

Also, if you want to get shorter failover time you could turn of snapmirror, ndmp, cifs, nfs and all other protocols to lessen workload so that

failover is faster. This should be part of a controlled failvoer anyways to ensure no loss of data.

Have you enabled hardware assisted failover?

https://kb.netapp.com/support/index?page=content&id=1010145&actp=search&viewlocale=en_US&searchid=1299526213692

Read the note on the bottom to make sure you can do it.

Eric

damin
4,980 Views

Hi,

All the storgae are used almost for CIFS shares.

The customer applications are hang becuase of long failover period.

Is there any options to reduce the failover period when cifs are there.

Yes I did configure hardware assisted. But still no impact in the

failover period

Thanks,

radek_kubka
4,980 Views
All the storgae are used almost for CIFS shares.

Okay, so are the apps in question relying on CIFS?

CIFS is session-based, so regardless of the fail-over time, all sessions are terminated during fail-over & nothing can be done about it (as far as current version of CIFS is concerned)

hill
4,980 Views

I'm wondering if the questions Eric asked were answered?  Where are you seeing 180 seconds?  When you initiate a takover, there is a standard "hard coded" message that says takeover will happen within 180 seconds.  It will not reflect the values that you've set.  Also, how the host reacts depends on a couple of factors.  If the right host utilities are used, if there are any specific corproate System settings in your environment, and more importantly if your cluster has been correctly configured to assume the partner IP's in case of a failover.

The system logs on the primary and partner systems would identify how long the takover actually took.  Out of curiosity have you opened a case for this issue?  That would be the best way to get a more detailed response to your questions.  The Global support folks could assess your environmental settings from the most recent Autosupport, or you could generate a user-initiated ASUP when a new case is opened.

Right now I'm not certain there is enough information to look at (on this discussion) to point you in the right direction.

niels
4,980 Views

In addition to the session-drop also remember the network infrastructure.

With a completed takeover, IP and MAC address now appear on a different switch port.

The network infrastructure needs to deal with it as well.

regards, Niels

mcope
4,127 Views

Are these manually triggered failovers or by actual failures?  Manual failovers are going to take a bit longer because there's not the priority and it is more graceful.  In a CIFS environment, you have to specify how much time to give CIFS sessions to close out before initiating failover. 

Hardware-assisted takeover only applies to failure generated failover events and requires the RLM ports be configured and online.  The time reduction is only about 10 - 12 seconds which is more critical for FCP than other protocols.

Public