We have multiple E2824 SANs and are experiencing long unstun times when creating vmware snapshots for backup purposes.
I have raised a ticket with support, and am getting sick of going round in circles, explaining the problem technically, and getting robotic non technical responses and not having my questions answered.
I have an SSD volume on the SAN with 3 x SSDs in a RAID5. The SSD cache is not enabled for these disks, ruling out any SSD cache performance issues.
If I take a snapshot of a busy VM on that volume, I get unstun times as follows:
Create snap was 663ms Delete snap was 27ms AND 340ms AND 410ms
These times are high enough to stop some of our applications responding and the VMs drop pings, it causes services like KeepaliveD to fail VIPs over.
I have upgraded the firmware on these 3 x SSDs to the latest, on the advice of netapp support and it has made no difference.
I have then been told I need to upgrade the firmware on the other disks in the SAN, which are not involved in this process whatsoever.
I'm not against upgrading the FW on the other disks, but I need to pause I/O to the SAN to do it... That is not an easy task right now, so I don't want to do it on a whim without a technical explanation of how upgrading the FW on disks that aren't being used by the process can have any impact on the unstun time.
I have opened a support ticket with vmware, and they went through various things before saying everything is configured perfectly, and I need to check with storage vendor.
I can compare this performance to a local disk in one of the ESXI hosts, just a SAS spinner:
Local storage create snapshot was 70ms of stun, commit snap was 4ms AND 26ms of unstun. Zero pings dropped.
If it is the case that these controllers in the E series add this kind of latency to vmware snapshot processes, and it is normal to experience this then I need to know...
But the support engineers just ignore my questions and keep asking me to upgrade firmwares.
NetApp support ticket 2008537431
Questions getting ignored and replied to with suggestion to upgrade FW in irrelevant disks:
I'm 100% confident Delayed ACK is disabled, because it initially was not disabled and we had absolutely awful IO within VMs.
We raised a support ticket early on for this and rectified it, and the performance has been fine since.
This issue is very simple, when we create VMware VM snapshots, the VMs unstun times are a little too long - this means the VM drops a ping.
The performance with a single SAS disk in an ESXI host is better in terms of unstun performance than an iSCSI LUN backed by 3SSDs, connected with 10Gbps DACs.
If this is normal for an E-Series in this scenario, then we need to know that.
I've given you a lot of information from my testing, can you not re-create the issue?
Am I to ask for the case to be escalated? Or to take it to the NetApp forum?
Copying my Manager in for visibility.
On Thu, 19 Nov 2020 at 14:50, wrote:
There is nothing else on the array that can possibly cause any performance degradation. The host interfaces are fine, the back-end is fine, nothing indicating any issues with the array as a general. The firmware bugs will not manifest in the logs if they are there, Delayed ACK will not manifest if it is still enabled. If you wont upgrade the SSDs, I am fine with that, I will re-frame the issue and we can look at this again. With that in mind, how sure are you that Delayed ACK is disabled, because there have been instances where it was disabled from vCenter, but still remained ON on the hosts themselves.
Thanks very much for the quick escalation. It's a shame I had to come to the forum to make progress, but equally impressed with the speed at which NetApp have called me and how seriously you guys take it.