I have a customer with a large number of AIX servers who just experienced timeouts on 3 servers during an NDU ONTAP upgrade, and again during a non-disruptive Flash Cache card install (the same 3 servers both times).
During both times the takover/giveback took over 30 seconds. I wouldn't expect this to be a problem seeing as NetApp openly advertise that it won't take more than 180 seconds and these servers had the Host Utilities for AIX installed/configured.
Turns out that IBM identified a few MPIO patches missing on these servers which is more than likely the root cause, but it got me thinking: if a takeover/giveback can take up to 180 seconds, why do the Host Utilities set the timeout value to 30 seconds?
I've attached a screenshot which shows the output of lsattr which highlights the rw_timeout value:
The Host Utilities manual also states that this is the correct value: http://support.netapp.com/knowledge/docs/hba/aix/relaixhu50/pdfs/host_set.pdf (page 7)
All of the other servers continue to run normally during the takeover/giveback, but I'm now interested if anyone is able to explain why the timeout is set to 30 seconds but a takeover can take 180 seconds?