Pop quiz: A customer lost power to one of their FAS controllers that is in an active/active pair, and with that, the LUNs attached to thier Hyper-V server went away. The envronment is such that nothing was lost and we are able to troubleshoot the situation on why it happend. That said, I preformed a cf takeover (then cf giveback) to see if the same thing happend on a graceful hand-off. The results were as one would expect when things are setup properly, the LUN remained online and visible to the server through the activity.
So my question, why on a dirty shut-down did they lose connectivity to thier LUN? I am going to be troubleshooting onsite tomorrow so any suggestion would be welcome.
Nice one - I like it!
So, let's start with this question: is it a relatively modern version of ONTAP, or an ancient one? Is FCP run in a single_image mode? If that's the case, then this simply should not happen - controller failure is an equivalent of path failover (if everything is set properly).
Let's just say, they were "testing a theory!" 🙂
So the FAS' are running 8.0.2 7-mode. It was my understanding that single_image was the default mode of 8.x+. I'll double check tomorrow. As for the protocol, it is iSCSI over 10GbE.
Me bad - I automatically assumed LUN equals FCP!
Single_image cfmode is FCP specific & irrelevant for iSCSI (and it's actually default since 7.2).
What are disk timeouts on the hosts? Unplanned takeover will presumably take longer than a clean one - this may explain the difference in observed behaviour.
@Radek, I'll have to check the timeout settings.
@Pascal, a failover did take place. A controller power outtage was all that occurred. The shelves stayed powered and disk multipathis set up properly.
What else did they do during the outage?
Its seems strange I'll give you that. Was Iscsi licenced on both controllers, and patched correctly? Did they test CFO during initial setup of this HyperV environment? do they have the ontap DSM installed and snapdrive? how many luns and how many hyperV hosts are involved?
Did you grab the logs and ASUPs from this event? did they help trouble shoot this?