Latency issue with Windows Failover Cluster role failover and FAS2240-2/Data ONTAP 8.1.4 7-Mode

Wonkins · ‎2017-06-12

Greetings,

I've found an issue involving our specific filer model and ONTAP version (FAS2240-2/ONTAP 8.1.4 7
-Mode) with a new implementation that we're testing, and I'm hoping that someone could provide
some thoughts. When using LUNs created on this filer in this implementation, manually failing over
a file server role in a two-node Server 2016 Windows Failover Cluster using in-guest iSCSI
consistently takes around 12 minutes for the failover between WFC nodes to complete. During the
failover between these WFC nodes, a LUN reset request is sent by the MS iSCSI initiator to our
filer, and the connection to the disk is reestablished within the Windows environment.

I have tested the same configuration on an old filer/ONTAP version (FAS2020/ONTAP 7.3.5.1) and we
do not experience the 12 minute failover time. The failover of the file server role happens within
seconds, as expected. The only part of the configuration that changes to reproduce the long
failover time is which filer the Windows source and destination disks are hosted on.

The implementation is a newer Microsoft block-level replication technology called Storage Replica.
Our configuration involves two Windows Server 2016 DCE nodes in a Windows Failover Cluster, with
each node using the in-guest MS iSCSI initiator and SnapDrive 7.1.4 x64. Each node is connected to
one separate LUN for data (2TB) and one separate LUN for logging (25GB), making four LUNs total,
each thin-provisioned with SnapDrive. The four disks are then added to the Windows Failover
Cluster and a File Server role is created using one of the 2TB disks as the source disk.
Replication is then successfully enabled between the identically-sized disks using the Storage
Replica wizard, to create a source and destination for replication. The role is supposed to
failover to the other node (destination) within seconds, but this operation takes around 12
minutes on our specific filer and ONTAP version. As stated previously, the long failover does not
happen on an older filer, with an older ONTAP version.

We have a total of four FAS2240-2 filers, and each pair are in a HA configuration and reside at
different physical sites. I have tested hosting the storage in this configuration across the
physical sites and have also isolated the configuration to each individual site, and consistenly
achieve the same long failover time of the file server role with the FAS2240-2/ONTAP 8.1.4 7-Mode
filers. The older filer is a FAS2020 pair in a HA configuration, running ONTAP 7.3.5.1. The long
failover time does not happen when hosting the storage in this configuration on the older filer.

Since we are currently on 8.1.4 7-mode, we are unable to get support due to the version falling
under EOVS. We intend to move to a newer version when possible to open a support case. However in
the meantime, we've been scratching our heads on this one and are hoping to see if anyone on the
NetApp forums have any ideas/thoughts. I would be happy to answer any additional questions.

Thanks!

Jeff_Yao · ‎2017-06-21

not sure if it's related about this burt:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=605236

8.1.4p1 is fixed... try to find some patch version of 8.1.4?

hopefully helps

thanks

Jeff

GidonMarcus · ‎2017-06-21

Hi,

@Jeff_Yaoi think @Wonkins refer to windows failover rather the filer failover.

@Wonkins, you mentioned that you see LUN restes. is it in EMS? can you maybe share a packettrace (pktt) from the filer side while you failover?

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

Wonkins · ‎2017-06-22

@Jeff_Yao, @GidonMarcus is correct. I am referring to failover of the Windows Failover Cluster file server role as having the issue described in my post.

@GidonMarcus, I noticed the LUN reset notice in the Syslog on the filer with the destination disk of the failover. I did check EMS and found the same message that I saw in the Syslog. The specific message in EMS is:
<iscsi_notice_1 m="Initiator (iqn.1991-05.com.microsoft:server) sent LUN Reset request, aborting all SCSI commands on lun X"/>.