2016-11-01 03:59 PM
We have noticed within VMware (version is 5.5. update 3 with NFS datastores) that during a controller failover (i.e. testing or Data Ontap upgrade) that a few datastores experience the All Paths Down (APD) event for about 10 seconds.
NFS.MaxQueueDepth is set at 64 currently.
The APD events only occur on controller failover and not during normal operations.
Has anyone experienced this on controller fail overs, and has anyone had success with eliminating APD events by dropping the NFS.MaxQueueDepth to 32 or less ?
2016-11-01 06:18 PM
Which version of ONTAP?
7 or C?
and which model?
Have you applied the settings from the VSC and rebooted the hosts?
How are the switchports configured?
Sounds like you're reading vmware kb 2016122.
2016-11-01 08:07 PM
Hey Sean - Clustered Ontap 8.3.2P5 Model AFF 8060, VSC setting have been applied. Yes VSC settings have been applied and hosts rebooted.
Switches are 5k's VPC port channel for NFS ifgrp on the AFF's. 1 port from each netapp port channel to 5k switch 1 and 2 - MTU 9000 throughout.
I have read that article but that article seems to of been written if you are experiencing random APD's through a normal day.
I have a support ticket open with VMware - who have suggested the drop in MaxQueueDepth to 32 based on what the storage vendor says.
I also have a ticket with Netapp which has not gotten very far
2016-11-02 08:26 AM - edited 2016-11-02 01:30 PM
That article does seem like a red herring in your situation.
Does it also happen during a lif migrate across nodes? or is it just at failover?
It sounds like you've got support of the vendors on the ends engaged, just don't overlook the network in-between; flow control, portfast/spanning tree port type edge trunk, mtu 9216, etc.
2016-11-02 05:07 PM
Don't see any APD's during a lif migrate operation
Network components look all good
We have found in the logs that during the failover the NFS service is taking about 5 seconds to startup on the active node. We are going to drop the MaxQueueDepth to 32 on an ESXi host and test failover again.
2016-11-02 06:31 PM - edited 2016-11-02 06:32 PM
Does that SVM have an NFS lif on that node in a non-failover state?