ONTAP Hardware
ONTAP Hardware
Dear everyone,
When performing an ONTAP upgrade on NetApp systems, we observe an I/O interruption lasting approximately 10 to 30 seconds, which causes all applications to terminate.
This interruption occurs frequently during the upgrade process, and we are looking for ways to minimize its impact.
We have tested the following upgrade approaches:
1. Manual takeover using the storage failover takeover command
2. Step-by-step takeover (LIF migration → CFO takeover → SFO takeover)
3. ANDU (Automated Non-Disruptive Upgrade)
We confirmed that I/O interruption occurs with all of the above methods.
Among them, option 2 (step-by-step takeover) results in the least impact, but the interruption is still noticeable.
This issue does not appear to be related to system utilization, and we have observed that it occurs more frequently on ONTAP 9.12.1 and later versions, even when running the latest patch releases.
We would appreciate your advice on:
Whether this behavior is expected in recent ONTAP versions
Any known issues or changes in takeover/upgrade behavior since 9.12.1
Best practices or configuration recommendations to further reduce or eliminate this I/O interruption during upgrades.
Thank you.
IO interruptions are "normal" during takeovers and givebacks.
You must install the correct 'host configuration' settings on every host using NetApp storage to be able to 'ride out' these IO pauses. This mainly involves modifying IO timeout settings on each host type.
We have recommended settings for the various host OS's on our support site.
We have been informed that an I/O interruption of up to approximately 90 seconds is considered normal behavior.
However, it is not feasible for us to explain to the customer that a 90-second I/O disruption is an acceptable or expected condition.
In practice, even an I/O interruption of around 10 seconds results in NFSv3 hangs, which causes all related applications to terminate.
As a result, the customer has expressed serious concerns, as a manufacturing/process project that had been in progress for several months has been completely disrupted.
Thank you.
Due to this issue, we conducted extensive testing on the client side, including NFSv3 timeout settings, TAP disk timeouts, and both hard and soft mount options.
We also confirmed with NetApp that the I/O disruption is not caused by WAFL scans triggered by TSSE, nor by background jobs or high resource utilization, as the issue occurs even under low system load.
We are currently investigating the correlation with the point in time when ONTAP Fast Takeover was introduced.
We are using NFSv3 with Xen (Citrix), VMware datastore (NFSv3), and bare-metal RHEL 7.
Currently, we are conducting version-based testing per platform internally.
For each platform (FAS8200, FAS8300, AFF-A700, AFF-A700s, FAS9000), we are performing dozens of ARL tests per version.
@HojunLim , What Protocol are you seeing this IO Disruption ?.
We are using NFSv3 with Xen (Citrix), VMware datastore (NFSv3), and bare-metal RHEL 7.