We have noticed within VMware (version is 5.5. update 3 with NFS datastores) that during a controller failover (i.e. testing or Data Ontap upgrade) that a few datastores experience the All Paths Down (APD) event for about 10 seconds.
NFS.MaxQueueDepth is set at 64 currently.
The APD events only occur on controller failover and not during normal operations.
Has anyone experienced this on controller fail overs, and has anyone had success with eliminating APD events by dropping the NFS.MaxQueueDepth to 32 or less ?
Which version of ONTAP?
7 or C?
and which model?
Have you applied the settings from the VSC and rebooted the hosts?
How are the switchports configured?
Sounds like you're reading vmware kb 2016122.
Hey Sean - Clustered Ontap 8.3.2P5 Model AFF 8060, VSC setting have been applied. Yes VSC settings have been applied and hosts rebooted.
Switches are 5k's VPC port channel for NFS ifgrp on the AFF's. 1 port from each netapp port channel to 5k switch 1 and 2 - MTU 9000 throughout.
I have read that article but that article seems to of been written if you are experiencing random APD's through a normal day.
I have a support ticket open with VMware - who have suggested the drop in MaxQueueDepth to 32 based on what the storage vendor says.
I also have a ticket with Netapp which has not gotten very far
That article does seem like a red herring in your situation.
Does it also happen during a lif migrate across nodes? or is it just at failover?
It sounds like you've got support of the vendors on the ends engaged, just don't overlook the network in-between; flow control, portfast/spanning tree port type edge trunk, mtu 9216, etc.
Don't see any APD's during a lif migrate operation
Network components look all good
We have found in the logs that during the failover the NFS service is taking about 5 seconds to startup on the active node. We are going to drop the MaxQueueDepth to 32 on an ESXi host and test failover again.
Does that SVM have an NFS lif on that node in a non-failover state?
I also had the same - tickets open with Netapp and VMware
Basically it came down to the failover time between controllers. Sometimes the failover time was quick (in this case we didn't experience any APD) in other times it was slightly slower (depending on how busy the controller was at the time) which led to an APD. Basically there is was no guarantee that the failover was going to occur faster or slower at the time. Really depends or comes down to how busy the controller is. You can see the failover times for individual protocols in ::> event log show -event *nfs* (after failover)
Out of interest what controllers are you running, and what's their utilization like ?
Running AFF8040 9.1P2
Very low utilization currently. Just got everything stood up, issue presented during some failover testing prior to production use. Typically seeing peaks of <10MBps throughput and <10k IOPS.