I've found an issue involving our specific filer model and ONTAP version (FAS2240-2/ONTAP 8.1.4 7 -Mode) with a new implementation that we're testing, and I'm hoping that someone could provide some thoughts. When using LUNs created on this filer in this implementation, manually failing over a file server role in a two-node Server 2016 Windows Failover Cluster using in-guest iSCSI consistently takes around 12 minutes for the failover between WFC nodes to complete. During the failover between these WFC nodes, a LUN reset request is sent by the MS iSCSI initiator to our filer, and the connection to the disk is reestablished within the Windows environment.
I have tested the same configuration on an old filer/ONTAP version (FAS2020/ONTAP 22.214.171.124) and we do not experience the 12 minute failover time. The failover of the file server role happens within seconds, as expected. The only part of the configuration that changes to reproduce the long failover time is which filer the Windows source and destination disks are hosted on.
The implementation is a newer Microsoft block-level replication technology called Storage Replica. Our configuration involves two Windows Server 2016 DCE nodes in a Windows Failover Cluster, with each node using the in-guest MS iSCSI initiator and SnapDrive 7.1.4 x64. Each node is connected to one separate LUN for data (2TB) and one separate LUN for logging (25GB), making four LUNs total, each thin-provisioned with SnapDrive. The four disks are then added to the Windows Failover Cluster and a File Server role is created using one of the 2TB disks as the source disk. Replication is then successfully enabled between the identically-sized disks using the Storage Replica wizard, to create a source and destination for replication. The role is supposed to failover to the other node (destination) within seconds, but this operation takes around 12 minutes on our specific filer and ONTAP version. As stated previously, the long failover does not happen on an older filer, with an older ONTAP version.
We have a total of four FAS2240-2 filers, and each pair are in a HA configuration and reside at different physical sites. I have tested hosting the storage in this configuration across the physical sites and have also isolated the configuration to each individual site, and consistenly achieve the same long failover time of the file server role with the FAS2240-2/ONTAP 8.1.4 7-Mode filers. The older filer is a FAS2020 pair in a HA configuration, running ONTAP 126.96.36.199. The long failover time does not happen when hosting the storage in this configuration on the older filer.
Since we are currently on 8.1.4 7-mode, we are unable to get support due to the version falling under EOVS. We intend to move to a newer version when possible to open a support case. However in the meantime, we've been scratching our heads on this one and are hoping to see if anyone on the NetApp forums have any ideas/thoughts. I would be happy to answer any additional questions.
@Jeff_Yao, @GidonMarcus is correct. I am referring to failover of the Windows Failover Cluster file server role as having the issue described in my post.
@GidonMarcus, I noticed the LUN reset notice in the Syslog on the filer with the destination disk of the failover. I did check EMS and found the same message that I saw in the Syslog. The specific message in EMS is: <iscsi_notice_1 m="Initiator (iqn.1991-05.com.microsoft:server) sent LUN Reset request, aborting all SCSI commands on lun X"/>.