Zero paths during fail over/giveback

masaru_ryumae · ‎2012-02-17

All, we have recently noticed (I don't believe this happened in our prior DataOnTAP upgrade) while upgrading our HA pair (FAS6240) to 8.0.2P5 that Linux/ESX clients lost all the paths for a short period (15 sec or so) when we did the final give back. I am told by NetApp support this is expected. Since I don't remember seeing it in the past, I never thought we would generally see zero paths. I am then told by NetApp support that we would need to re-check on Linux/ESX host utilities (5.3 or 6.0) and make sure multipath.conf is correct so that clients won't freak out when zero paths happens for a short period of time during fail over/giveback. I remember seeing in the past somewhere that there should never be zero paths while fail over/giveback. Has anybody seen this zero paths behavior regularly??

aborzenkov · ‎2012-02-17

Support is right here. Takeover/giveback means filer reboot. During reboot filer cannot serve data. So there is period of time when no valid path to filer exists. It may be surprising, but during giveback filer may be unavailable for longer period, because it first performs clean shutdown on partner.

Hosts must be configured so that they wait for a filer to come back. That's what host utilities normally do - configure various timeouts to ensure host will not experience fatal error.

masaru_ryumae · ‎2012-02-19

Hi, thank you so much for your response. So the question I have is this...

I would understand fully if the one being taken over would lose all paths temporarily because it is being rebooted and not serving data any longer, but for a head pair to be fully HA, I would expect the remaining serving head would fully keep its paths to keep serving data. This has been confirmed in my environment by looking at the multipathd output in syslog for Linux clients.

When we do the giveback, the one being given back is being fully booted, but the one giving back should still be serving data, and therefore, I would expect we may lose paths on the one being fully booted, but the one serving data shouldn't lose paths. If they both lose paths at the same time for a brief moment, it defeats the purpose of advertising fully HA. Isn't it??

Anyhow, what I am hearing is that we must make sure the clients can handle the timeouts, so I would certainly look at the client configurations. For Linux clients in the host utilities, I didn't see things mentioned for ext3 or LVM file systems and how the timeout would play out (I may have missed in the doc). I am assuming as long as we have proper timeout configured, ext3 and LVM would both survive the fail-over/giveback. Is this correct? I mention this because in our environment, we have seen Linux client's file system becoming read-only in the past when we did fail-over/giveback, and our typical method of fixing it would be to do a reboot...

Thanks!

aborzenkov · ‎2012-02-20

You misunderstand how HA works in this case.

Each LUN is being served by exactly one filer. Paths via partner help to protect against path errors (cable/HBA/SFP down); but if filer that owns LUN is unavailable, paths via partner cannot help you. What HA does is to make partner that serves LUN available again automatically and reasonably fast. But there is still some period of time when nobody's home to answer requests.

Please understand that timeouts are unavoidable. Timeouts are ultimately the only way for a host to detect dead path (for whatever reasons) and retry, using the same or different path. This is true for any storage of any vendor. So any multipath stack must be prepared to handle timeouts and retries.